Rolling Window Regression: a Simple Approach for Time Series Next value Predictions

Given a time series, predicting the next value is a problem that fascinated  programmers for a long time. Obviously, a key reason for this attention is stock markets, which promised untold riches if you can crack it. However, except for few (see A rare interview with the mathematician who cracked Wall Street), those riches have proved elusive.

Thanks to IoT (Internet of Things), time series analysis is poise to a come back into the lime light. IoT let us place ubiquitous sensors everywhere, collect data, and act on that data. IoT devices collect data through time and resulting data are almost always time series data.

Following are few use cases for time series prediction.

  1. Power load prediction
  2. Demand prediction for Retail Stores
  3. Services (e.g. airline check in counters, government offices) client prediction
  4. Revenue forecasts
  5. ICU care vital monitoring
  6. Yield and crop prediction

Let’s explore the techniques available for time series forecasts.

The first question is that “isn’t it regression?”. It is close, but not the same as regression. In a time series, each value is affected by the values just preceding this value. For example, if there is lot of traffic at 4.55 in a junction, chances are that there will be some traffic at 4.56 as well. This is called autocorrelation. If you are doing regression, you will only consider x(t) while due to auto correlation, x(t-1), x(t-2), … will also affect the outcome. So we can think about time series forecasts as regression that factor in autocorrelation as well.

For this discussion, let’s consider “Individual household electric power consumption Data Set”, which is data collected from a one house hold over four years in one minute intervals. Let’s only consider three fields, and data set will look like following.

The first question to ask is how do we measure success? We do this via a loss function, where we try to minimize the loss function. There are several loss functions, and they are different pros and cons.

  1. MAE ( Mean absolute error) — here all errors, big and small, are treated equal.
  2. Root Mean Square Error (RMSE) — this penalizes large errors due to the squared term. For example, with errors [0.5, 0.5] and [0.1, 0.9], MSE for both will be 0.5 while RMSE is 0.5 and. 0.45.
  3. MAPE ( Mean Absolute Percentage Error) — Since #1 and #2 depend on the value range of the target variable, they cannot be compared across data sets. In contrast, MAPE is a percentage, hence relative. It is like accuracy in a classification problem, where everyone knows 99% accuracy is pretty good.
  4. RMSEP ( Root Mean Square Percentage Error) — This is a hybrid between #2 and #3.
  5. Almost correct Predictions Error rate (AC_errorRate)—percentage of predictions that is within %p percentage of the true value

If we are trying to forecast the next value, we have several choices.


The gold standard for this kind of problems is ARIMA model. The core idea behind ARIMA is to break the time series in o different components such as trend component, seasonality component etc and carefully estimate a model for each component. See Using R for Time Series Analysis for a good overview.

However, ARIMA has an unfortunate problem. It needs an expert ( a good statistics degree or a grad student) to calibrate the model parameters. If you want to do multivariate ARIMA, that is to factor in multiple fields, then things get even harder.

However, R has a function called auto.arima, which estimates model parameters for you. I tried that out.

x_train <- train data set
X-test <- test data set
powerTs <- ts(x_train, frequency=525600, start=c(2006,503604))
arimaModel <- auto.arima(powerTs)
powerforecast <- forecast.Arima(arimaModel, h=length(x_test))

You can find detail discussion on how to do ARIMA from the links given above. I only used 200k from the data set as our focus is mid-size data sets. It gave a MAPE of 19.5.

Temporal Features

The second approach is to come up with a list of features that captures the temporal aspects so that the auto correlation information is not lost. For example, Stock market technical analysis uses features built using moving averages. In the simple case, an analyst will track 7 days and 21 days moving averages and take decisions based on cross-over points between those values.

Following are some feature ideas

  1. collection of moving averages/ medians(e.g. 7, 14, 30, 90 day)
  2. Time since certain event
  3. Time between two events
  4. Mathematical measures such as Entropy, Z-scores etc.
  5. X(t) raised to functions such as power(X(t),n), cos((X(t)/k)) etc

Common trick people use is to apply those features with techniques like Random Forest and Gradient Boosting, that can provide the relative feature importance. We can use that data to keep good features and drop ineffective features.

I will not dwell too much time on this topic. However, with some hard work, this method have shown to give very good results. For example, most competitions are won using this method (e.g. /).

Down side, however, is crafting features is a black art. It takes lots of work and experience to craft the features.

Rolling Windows based Regression

Now we got to the interesting part. It seems there is an another method that gives pretty good results without lots of hand holding.

Idea is to to predict X(t+1), next value in a time series, we feed not only X(t), but X(t-1), X(t-2) etc to the model. A similar idea has being discussed in Rolling Analysis of Time Series although it is used to solve a different problem.

Let’s look at an example. Let’s say that we need to predict x(t+1) given X(t). Then the source and target variables will look like following.

Data set would look like following after transformed with rolling window of three.

Then, we will use above transformed data set with a well-known regression algorithm such as linear regression and Random Forest Regression. The expectation is that the regression algorithm will figure out the autocorrelation coefficients from X(t-2) to X(t).

For example, with the above data set, applying Linear regression on the transformed data set using a rolling window of 14 data points provided following results. Here AC_errorRate considers forecast to be correct if it is within 10% of the actual value.

LR AC_errorRate=44.0 RMSEP=29.4632 MAPE=13.3814 RMSE=0.261307

This is pretty interesting as this beats the auto ARIMA right way ( MAPE 0.19 vs 0.13 with rolling windows).

So we only tried Linear regression so far. Then I tried out several other methods, and results are given below.

Linear regression still does pretty well, however, it is weak on keeping the error rate within 10%. Deep learning is better on that aspect, however, took some serious tuning. Please note that tests are done with 200k data points as my main focus is on small data sets.

I got the best results from a Neural network with 2 hidden layers of size 20 units in each layer with zero dropout or regularisation, activation function “relu”, and optimizer Adam(lr=0.001) running for 500 epochs. The network is implemented with Keras. While tuning, I found articles [1] and [2] pretty useful.

Then I tried out the same idea with few more datasets.

  1. Milk production Data set ( small < 200 data points)
  2. Bike sharing Data set (about 18,000 data points)
  3. USD to Euro Exchange rate ( about 6500 data points)
  4. Apple Stocks Prices (about 13000 data points)

Forecasts are done as univariate time series. That is we only consider time stamps and the value we are forecasting. Any missing value is imputed using padding ( using most recent value). For all tests, we used a window of size 14 for as the rolling window.

Following tables shows the results. Here except for Auto.Arima, other methods using a rolling window based data set.

There is no clear winner. However, rolling window method we discussed coupled with a regression algorithm seems to work pretty well.


We discussed three methods: ARIMA, Using Features to represent time effects, and Rolling windows to do time series next value forecasts with medium size data sets.

Among the three, the third method provides good results comparable with auto ARIMA model although it needs minimal hand holding by the end user.

Hence we believe that “Rolling Window based Regression” is a useful addition for the forecaster’s bag of tricks!

However, this does not discredit ARIMA, as with expert tuning, it will do much better. At the same time, with hand crafted features methods two and three will also do better.

One crucial consideration is picking the size of the window for rolling window method. Often we can get a good idea from the domain. Users can also do a parameter search on the window size.

Following are few things that need further exploration.

  • Can we use RNN and CNN? I tried RNN, but could not get good results so far.
  • It might be useful to feed other features such as time of day, day of the week, and also moving averages of different time windows.


  1. An overview of gradient descent optimization algorithms
  2. CS231n Convolutional Neural Networks for Visual Recognition

Understanding Causality and Big Data: Complexities, Challenges, and Tradeoffs

image credit: Wikipedia, Amitchell125

“Does smoking causes cancer?”

We have heard that lot of smokers have lung cancer. However, can we mathematically tell that smoking causes cancer?

We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.

Ok great, can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.

This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.

Correlation does not mean Causation!

This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.

For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.

You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.

Can we prove Causality?

Great, how can I show causality? Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.

To be precise, if error bars for groups does not overlap for both the groups, then there is a causality. Check for more details.

However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of the smoke, and make sure other half does not smoke. Then wait for like 50 years and compare.

Did you see the catch? it is not good enough to compare smokers and non-smokers as there may be a common cause like the gene that cause them to do so. Do prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g. )

This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to one, and wait few hundred years!!

To summarize, Casualty, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.

Following are examples when causality is needed.

  • Before punishing someone
  • Diagnosing a patient
  • Measure effectiveness of a new drug
  • Evaluate the effect of a new policy (e.g. new Tax)
  • To change a behavior

Big Data and Causality

Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.

There are two reactions to this problem.

First, “Big data guys does not understand what they are doing. It is stupid to try to draw conclusions without randomized experiment”.

I find this view lazy.

Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.

Second is “forget causality! correlation is enough”.

I find this view blind.

Playing ostrich does not make the problem go away. This kind of crude generalizations make people do stupid things and can limit the adoption of Big Data technologies.

We need to find the middle ground!

When do we need Causality?

The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.

Let us investigate both types of cases.

Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples.

  1. When stakes are low ( e.g. marketing, recommendations) — when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
  2. As a starting point for an investigation — correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
  3. Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good ways to verify the fit.

There are other cases where causality is crucial. Following are few examples.

  1. Find a cause for disease
  2. Policy decisions ( would 15$ minimum wage be better? would free health care is better?)
  3. When stakes are too high ( Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
  4. When we are acting on the decision ( firing an employee)

Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, state could have run a experiment by selecting a population and sending the book to half of them and looking at the outcome.

Some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a polulation and rollout the feature for only part of the population and compare the two.

So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.

Closing Remarks

Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.

Most important thing is having a clear understanding at the point when we act on the decisions. Sometimes, when stakes are low, correlation might be enough. On some other cases, it is best to run an experiment to verify our claims. Finally, some systems might warrant building experiments into the system itself, letting you draw strong causality results. Choose wisely!

Original Post from my Medium account:

Value Proposition of Big Data after a Decade

Big data is an umbrella term for many technologies: Search, NoSQL, Distributed File Systems, Batch and Realtime Processing, and Machine Learning ( Data Science). These Different technologies are developed and proven to various degree. After 10 years, is it real? Following are few success stories of what big data has done.

  1. Nate Silver predicted outcomes of 49 of the 50 states in the 2008 U.S. Presidential election
  2. Money Ball ( Baseball drafting)
  3. Cancer detection from Biopsy cells (Big Data find 12 tell-tale patterns while doctors only knew about nine). See
  4. Bristol-Myers Squibb reduced the time it takes to run clinical trial simulations by 98%
  5. Xerox used big data to reduce the attrition rate in its call centre by 20%.
  6. Kroger Loyalty programs ( growth in 45 consecutive quarters)

As these examples show, big data indeed can work. Could that work for you. Let’s explore this a bit.

The premise of big data goes as follows.

If you collect data about your business and feed it to a Big Data system, you will find useful insights that will provide a competitive advantage — (e.g. Analysis of data sets can find new correlations to “spot business trends, prevent diseases, combat crime and so on”. [Wikipedia])

When we say Big Data will make a difference, the underline assumption is that way we and organisations operate are inefficient.

This means Big Data is as an optimization technique. Hence, you must know what is worth optimizing. If your boss asked you to make sure the organization is using big data, doing “Big Data Washing” is easy.

  1. Publish or collect the data you can with a minimal effort
  2. Do a lot of simple aggregations
  3. Figure out what data combinations makes prettiest pictures
  4. Throw in some machine learning algorithms, predict something but don’t compare
  5. Create a cool dashboard and do a cool demo. Claim that you are just scratching the surface!!

However, adding value to your organization through big data is not that easy. This is because insights are not automatic. Insights are possible only if we have right data, we look at the right place, such insights exists, and we do find the insights.

Making a difference will need you to understand what is possible with big data, what are its tools, as well as the pain points in your domain and organization? Following Pictures shows some of the applications of big data within an organization.

The first step is asking, what are some of those applications that can make a difference for your organization.

The next step is understanding tools in “Big Data toolbox”. They come in many forms.

KPI ( Key Performance Indicators) — People used to take canaries into the coal mines. Since those small birds are very sensitive to the oxygen level in the air, if they got knocked out, you need to be running out of the mine. KPIs are canaries for your organization. They are numbers that can give you an idea about the performance of something — E.g. GDP, Per Capita Income, HDI index etc for a country, Company Revenue, Lifetime value of a customer, Revenue per Square foot ( in the retail industry). Chances are your organization or your domain has already defined them. Idea is to use Big Data to monitor the KPIs.

Dashboard — Think about a car dashboard. It gives you an idea about the overall system in a glance. It is boring when all is good, but it grabs attention when something is wrong. However, unlike car dashboards, Big data dashboards have support for drill down and find root cause.

Alerts — Alerts are Notifications ( sent via email, SMS, Pager etc.). Their Goal is to give you a peace of mind by not having to check all the time. They should be specific, infrequent, and have very low false positives.

Sensors — Sensors collect data and make them available to the rest of the system. They are expensive and time-consuming to install.

Analytics — Analytics take decisions. They come in four forms: batch real-time, interactive, predictive.

  • Batch Analytics— process the data that resides in the disk. If you can wait (e.g. more than an hour) for data to be available, this is what you use.
  • Interactive Analytics —It is used by a human to issue ad-hoc queries and to understand a dataset. Think of it as having a conversation with the data.
  • Realtime Analytics— It is used to detect something quickly within few milliseconds to few seconds. Realtime analytics are very powerful in detecting conditions over time (e.g. Football Analytics). Alerts are implemented through Realtime analytics
  • Predictive Analytics — It learns a solution from examples. Example, It is very hard to write a program to drive a car. This is because there are too many edge conditions. We solve that kind of problems by giving lot of examples and asking the computer to figure out a program that solves the problem ( which we call a model). Two common forms are predicting next value (e.g. electricity load prediction) and predicting a category (e.g. is this email a SPAM?).

Drill down — To make decisions, operators need to see the data in context and drill down into detail to understand the root cause. The typical model is to start from an alert or dashboard, see data in context (other transactions around the same time, what does the same user did before and after etc.) and then let the user drill down. For example, see WSO2 Fraud Detection Solution Demo.

The process of deriving insight from the data, using above tools, looks like following.

Here different roles work together to explore data, understand data, to define KPIs, create dashboards, alerts etc.

In this process, keeping the system running is a key challenge. This includes DevOps challenges, Integrate data continuously, update models, and get feedback about the effectiveness of decisions (e.g. Accuracy of Fraud). Hence doing things in production is expensive.

On the other hand, “doing it Once” is cheap. Hence, you must first try your scenarios in an ad-hoc manner first (hire some expertise if you must) and make sure it can add value to the organization before setting up a system that does it every day.

Actionable Insights are the Key!!

Insights that you generate must be actionable. That means several things.

  1. Information you share is significant and warrant attention, and they are presented with their ramifications ( e.g. more than two technical issues would lead customer to churn)
  2. Decision makers can identify the context associated with the insight ( e.g. operators can see through history of customers who qualify)
  3. Decision makers can do something about the insight ( e.g. can work with customers to reassures and fix)

For each information you show the user, think hard “why I am showing him this?”, “what can he do with this information?”, and “what other information I can show to make him understand the context?”.

Where to Start?

Big Data projects can take many forms.

  1. Use an existing Dataset: I already have a data set, and list of potential problems. I will use Big data to solve some of the problems.
  2. **Fix a known Problem: Find a problem, collect data about it, analyse, visualize, build a model and improve. Then build a dashboard to monitor.
  3. Improve Overall Process: Instrument processes ( start with most crucial parts), find KPIs, analyze and visualize the processes, and improve
  4. Find Correlations: Collect all available data, data mine the data or visualize, find interesting correlations.

My recommendation is to start with #2, fix a known problem in the organization. That is the least expensive, and that will let you demonstrate the value of Big data right away.

Finally, the following are key take away points.

  • Big Data provide a way to optimize. However, blind application does not guarantee success.
  • Learn tools in Big Data toolbox: KPIs, Analytics ( Batch, Real-time, Interactive, Predicative), Visualizations, Dashboards. Alerts, Sensors.
  • Start small. Try out with data sets before investing in a system
  • Find a high impact problem and make it work end to end

Understanding CEP, Stream Processing, and their Implementations

Real-time analytics technologies come in many flavors such as Apache Strom, and streaming analytics, and complex event processing. I am sure you have heard about the first, likely second and third. Have you heard about a technology called “Complex Event Processing”? If you follow this space, you might have heard that people believe that CEP will play a key role in IoT use cases. However, Storm and Spark Streaming are much widely known than CEP.

So what is this CEP anyway?  In this post, I am trying to explain CEP, streaming analytics and compare and contrast them. I will try to give a description of current status (as of 2015) as opposed to give a definition. If you are looking for a definition, best would be What’s the Difference Between ESP and CEP?


As the above picture shows, technically CEP is a subset of Event Stream Processing. Asking for the difference between CEP vs Stream Processing, however, is the wrong question because both CEP engines and Stream processing engines do more than suggested by their names and trespass into the other side.

The right question is “what is the difference between CEP and ESB engines?” Stream processing engines and CEP engines use to be pretty different and they come from a very different background. Use cases they target and issues they choose to handle or not to handle were different.

Stream processing engines let you create a processing graph and inject events into the processing graph. Each operator process and send events to next processors. In most Stream processing engines like Storm, S4, etc, users have to write code to create the operators, wire them up in a graph and run them.  Then the engine runs the graph in parallel using many computers. Among examples are Apache Storm, Apache Fink, and Apache Samza.

In contrast, CEP engines let users write queries using a higher level query language. CEP engines were first created for use cases related to stock market use cases where they must generate a response within milliseconds. Furthermore, CEP engines have built-in operators such as time windows, temporal event sequences integrated into their query language (see Patterns for Streaming Realtime Analytics). It is worth noting that these differences have very little to do with the definitions of CEP or stream processing. Rather, they are a by-product of history and use cases they had to handle. This is the reason that many find the difference between CEP and Stream Processing confusing.

It is worth noting that these differences do not stem from definitions of CEP or stream processing. Rather, they are a by-product of history and use cases they had to handle. This is the reason that many find the difference between CEP and Stream Processing confusing.

Hence, let’s focus on differences between two types of engines. Following are key differences between the CEP and Stream Processing engines.

  1. Stream Processing Engines are distributed and parallel by design. They support large 10-100s node computations as opposed to CEP engines, which have centralized architecture typically having two or few nodes.
  2. Stream Processing Engines force you to write code, and often they do not have higher level operators such as windows, joins, and temporal patterns. In contrast, CEP engines provide you with high-level languages and support high-level operators. This difference is similar to the relationship between MapReduce and HIVE SQL scripts.
  3. Due to their stock market-based history, CEP engines are tuned for low latency. Often they respond within few milliseconds and sometimes with sub-millisecond latency. In contrast, most Stream processing engines take close to a second to generate results.
  4. Stream Processing engines stress the reliable message processing, often consuming data from a queue such as Kafka.  In contrast, CEP engines often receive and process data in memory, and when a failure happened, they often choose to throw away failed events and continue. This behavior, however, has already changed. Most CEP engines support reliable processing of data from a queue such as Kafka.

Let us look at the history of both.

CEP engines were around for a long time. Their history goes back to 90’s (see CEP Market players – end of 2014 – from Paul Vincent). They were used in several real-world use cases. However, they were a niche and expensive. Stream Processing systems come from Aurora and Borealis research projects (2005-2008).

At the aftermath of Big Data taking off around 2012-2013, people started to look for streaming analytics solution that is similar to Hadoop. Apache Storm is created at that time. It mirrored the MapReduce model, where you can write some code and attach them to a processing graph. It stole the limelight and outshone the CEP solutions.

Meanwhile, CEP was pretty much excluded from the spotlight. Stream processing engines programming models had direct parallels with MapReduce model, which helped. (image credit tambako flicker stream).

6797307367_3df84e44be_z However, it is worth noting that Analysts always paid attention to CEP. For example, in this 2008 Gartner report, CEP has been mentioned and CEP is mentioned ever since. CEP has been mentioned in Gartner hype cycles 2012-2014 ( All big data technologies are dropped from 2015 as it is no longer emerging technology, see

Now another trend, IoT, might bring CEP back into the spotlight and into our day to day lives. This is due to three main reasons.

  1. IOT data are time series data where data is autocorrelated. CEP is much better placed to handle them due to its temporal operators.
  2. Most IoT use cases deal with use cases that connect directly with the real world. If you are to act on those insights, you need those insights very fast. CEP has an advantage in the turnaround time.
  3. Most IoT use cases are complex, and they go beyond calculating aggregating data.  Those use cases need support for complex operators like time windows and temporal query patterns.

At the same time, traditional CEP cannot handle those IoT use cases in their current form. Most IoT use cases would have very high event rates. Therefore, whatever event technology used in those use cases needed to be able to scale up. Stream processing can scale much better than CEP.

At the same time, I believe it is a mistake to ignore the higher level temporal operators introduced by CEP and asking the end users to write their own operators. You can find my thoughts on Patterns for Streaming Realtime Analytics and Streaming SQL Query Language for Real-time Streaming Analytics.

The good news is that both technologies: CEP and Stream Processing are merging and the differences are diminishing. Both can learn from the other, where CEP needs to scale and process events reliably while event processing needs high-level languages and lower latencies. IBM Infosphere, which is a stream processing engine, have had CEP like operators for a long time. WSO2 CEP can now accept Streaming SQL queries and runs on top of Apache Storm (more details). SQLStream is a CEP engine that is highly parallel. My belief is that we will end up with a combination of both and we all will be better off for it.

Update: This post was featured in Software Engineering Daily blog.

Update 2017 September: WSO2 CEP now available under the name  WSO2 Stream Processor, which is freely available under Apache Licence 2.

Introduction to Anomaly Detection: Concepts and Techniques

Why Anomaly Detection?

burglar-157142_640Machine Learning has four common classes of applications: classification, predicting next value, anomaly detection, and discovering structure. Among them, Anomaly detection detects data points in data that does not fit well with the rest of the data. It has a wide range of applications such as fraud detection, surveillance, diagnosis, data cleanup, and predictive maintenance.

Although it has been studied in detail in academia, applications of anomaly detection have been limited to niche domains like banks, financial institutions, auditing, and medical diagnosis etc. However, with the advent of IoT, anomaly detection would likely to play a key role in IoT use cases such as monitoring and predictive maintenance.

This post explores what is anomaly detection, different anomaly detection techniques,  discusses the key idea behind those techniques, and wraps up with a discussion on how to make use of those results.

Is it not just Classification?

The answer is yes if the following three conditions are met.

  1. You have labeled training data
  2. Anomalous and normal classes are balanced ( say at least 1:5)
  3. Data is not autocorrelated. ( That one data point does not depend on earlier data points. This often breaks in time series data).

If all of above is true, we do not need an anomaly detection techniques and we can use an algorithm like Random Forests or Support Vector Machines (SVM).

However, often it is very hard to find training data, and even when you can find them, most anomalies are 1:1000 to 1:10^6 events where classes are not balanced. Moreover, the most data, such as data from IoT use cases, would be autocorrelated.

Another aspect is that the false positives are a major concern as we will discuss under acting on decisions. Hence, the precision ( given model predicted an anomaly, how likely it is to be true)  and recall (how much anomalies the model will catch) trade-offs are different from normal classification use cases. We will discuss this in detail later.

What is Anomaly Detection?

Anomalies or outliers come in three types.

  1. Point Anomalies. If an individual data instance can be considered as anomalous with respect to the rest of the data (e.g. purchase with large transaction value)
  2. Contextual Anomalies, If a data instance is anomalous in a specific context, but not otherwise ( anomaly if occur at a certain time or a certain region. e.g. large spike at the middle of the night)
  3. Collective Anomalies. If a collection of related data instances is anomalous with respect to the entire dataset, but not individual values. They have two variations.
    1. Events in unexpected order ( ordered. e.g. breaking rhythm in ECG)
    2. Unexpected value combinations ( unordered. e.g. buying a large number of expensive items)

In the next section, we will discuss in detail how to handle the point and collective anomalies. Contextual anomalies are calculated by focusing on segments of data (e.g. spatial area, graphs, sequences, customer segment) and applying collective anomaly techniques within each segment independently.

Anomaly Detection Techniques

Anomaly detection can be approached in many ways depending on the nature of data and circumstances. Following is a classification of some of those techniques.

Static Rules Approach

The most simple, and maybe the best approach to start with, is using static rules. The Idea is to identify a list of known anomalies and then write rules to detect those anomalies. Rules identification is done by a domain expert, by using pattern mining techniques, or a by combination of both.

Static rules are used with the hypothesis that anomalies follow the 80/20 rule where most anomalous occurrences belong to few anomaly types. If the hypothesis is true, then we can detect most anomalies by finding few rules that describe those anomalies.

Implementing those rules can be done using one of three following methods.

  1. If they are simple and no inference is needed, you can code them using your favorite programming language
  2. If decisions need inference, then you can use a rule-based or expert system (e.g. Drools)
  3. If decisions have temporal conditions, you can use a Complex Event Processing System (e.g. WSO2 CEP, Esper)

Although simple, static rules-based systems tend to be brittle and complex. Furthermore, identifying those rules is often a complex and subjective task. Therefore, statistical or machine learning based approach, which automatically learns the general rules, are preferred to static rules.

When we have Training Data

Anomalies are rare under most conditions. Hence, even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. The standard classification methods such as SVM or Random Forest will classify almost all data as normal because doing that will provide a very high accuracy score (e.g. accuracy is 99.9 if anomalies are one in thousand).

Generally, the class imbalance is solved using an ensemble built by resampling data many times.  The idea is to first create new datasets by taking all anomalous data points and adding a subset of normal data points (e.g. as 4 times as anomalous data points). Then a classifier is built for each data set using SVM or Random Forest, and those classifiers are combined using ensemble learning. This approach has worked well and produced very good results.

If the data points are autocorrelated with each other, then simple classifiers would not work well. We handle those use cases using time series classification techniques or Recurrent Neural networks.

When there is no Training Data

If you do not have training data, still it is possible to do anomaly detection using unsupervised learning and semi-supervised learning. However, after building the model, you will have no idea how well it is doing as you have nothing to test it against. Hence, the results of those methods need to be tested in the field before placing them in the critical path.

No Training Data: Point Anomalies

Point anomalies will only have one field in the data set. We use percentiles to detect point anomalies with numeric data and histograms to detect Detecting point anomalies in categorical data. Either case, we find rare data ranges or field values from the data and predict those as anomalies if it happens again. For example, if 99.9 percentile of my transaction value is 800$, one can guess any transaction greater than that value as the potential anomaly. When building models, often we use moving averages instead of point values when possible as they are much more stable to noise.

No Training Data: Univariate Collective Outliers 

Time series data are the best examples of collective outliers in a univariate dataset. In this case, anomalies happen because values occur in unexpected order. For example. the third heart beat might be anomalous not because values are out of range, but they happen in a wrong order.


There are three several approaches to handle these use cases.

Solution 1: build a predictor and look for outliers using residues: This is based on the heuristic that the values not explained by the model are anomalies. Hence we can build a model to predict the next value, and then apply percentiles on the error ( predicted value – actual value) as described before. The model can be built using regression,  time series models, or Recurrent Neural Networks.

Solution 2: Markov chains and Hidden Markov chains can measure the probability of a sequence of events happening. This approach builds a Markov chain for the underline process, and when a sequence of events has happened, we can use the Markov Chain to measure the probability of that sequence occurring and use that to detect any rare sequences.

FraudFor example, let’s consider credit card transactions. To model the transactions using Markov chains, let’s represent each transaction using two values: transaction value (L, H) and time since the last transaction (L, H). Since Markov chain’s states have to be finite, we will choose two values Low (L), High (H) to represent variable values. Then Markov chains would represent by states LL, LH, HL, HH and each transaction would be a transition from one state to another state. We can build the Markov chain using historical data and use the chain to calculate sequence probabilities. Then, we can find the probability of any new sequence happening and then mark rare sequences as anomalies. The blog post “Real Time Fraud Detection with Sequence Mining” describes this approach in detail.

No Training Data: Multivariate Collective Outliers ( Unordered)

Here data have multiple reading but does not have an order. For example, vitals collected from many people are such a multi-variate but not ordered dataset. For example, higher temperatures and slow heartbeats might be an anomaly even though both temperature and heartbeats by itself are in a normal range.

Approach 1: Clustering – the underline assumption in the first approach is that if we cluster the data, normal data will belong to clusters while anomalies will not belong to any clusters or belong to small clusters.

Then to detect anomalies we will cluster the data, and calculate the centroids and density of each cluster found.  When we receive a new data point, we calculate the distance from the new data point to known large clusters and if it is too far, then decide it as an anomaly.

Furthermore, we can improve upon the above approach by first manually inspecting ranges of each cluster and labeling each cluster as anomalous or normal and use that while doing anomaly check for a data point.

Approach 2: Nearest neighbor techniques – the underline assumption is new anomalies are closer to known anomalies. This can be implemented by using distance to k-anomalies or using the relative density of other anomalies near the new data point. While calculating the above, with numerical data, we will break the space into hypercubes, and with categorical data, we will break the space into bins using histograms. Both these approaches are described in ACM Computing Survey paper “Anomaly Detection: A Survey” in detail.

No Training Data: Multivariate Collective Outliers ( Ordered)

This class is most general and consider ordering as well as value combinations. For example, consider a series of vital readings taken from the same patient. Some reading may be normal in combination but anomalous as combinations happen in wrong order. For example, given a reading that has the blood pressure, temperature, and heart beat frequency,  each reading by itself may be normal, but not normal if it oscillates too fast in a short period of time.

Combine Markov Chains and Clustering – This method combines clustering and Markov Chains by first clustering the data, and then using clusters as the states in a Markov Chain and building a Markov Chain. Clustering will capture common value combinations and Markov chains will capture their order.

Other Techniques

There are several other techniques that have been tried out, and following are some of them. Please see Anomaly Detection: A Survey for more details.

Information Theory: The main idea is that anomalies have high information content due to irregularities, and this approach tries to find a subset of data points that have highest irregularities.

Dimension Reduction: The main idea is that after applying dimension reduction, a normal data can be easily expressed as a combination of dimensions while anomalies tend to create complex combinations.

Graph Analysis: Some processes would have interaction between different players. For example, money transfers would create a dependency graph among participants. Flow analysis of such graphs might show anomalies. On some other use cases such as insurance, stock markets, corporate payment fraud etc, similarities between player’s transactions might suggest anomalous behavior. Using PageRank to Detect Anomalies and Fraud in Healthcare and “New ways to detect fraud” white paper by Neo4j  are examples of these use cases.

Comparing Models and Acting on Results


With anomaly detection, it is natural to think that the main goal is to detect all anomalies. However, it is often a mistake.

The book, “Statistics done Wrong”, has a great example demonstrating the problem. Let’s consider there are 1000 patients and 9 of them have breast cancer. There is a test ( a model) that detect cancer, which will capture 80% of patients who have cancer (true positives). However, it says yes for 9% of healthy patients ( false positives).

This can be represented by following confusion matrix.

Healthy Not Healthy
Predicted Healthy 9091 1
Predicted Not Healthy 900 8

In this situation, when the test says someone has cancer, actually he does not have cancer at 99% of the time. So the test is useless. If we go detecting all anomalies, we might create a similar situation.

If you ignore this problem, it can cause harm in multiple ways.

  1. Reduce trust in the system – when people lost the trust, it will take a lot of threats and red tape to make them trust it
  2. Might do harm than good – in above example, emotional trauma and unnecessary tests might outweigh any benefits.
  3. Might be unfair (e.g. surveillance, arrest)

Hence, we must try to find a balance where we try to detect what we can while keeping model accuracy within acceptable limits.

Another side of the same problem is that the models are only a suggestion for investigation, but not evidence for incriminating someone. This is another twist of Correlation vs. Causality. Therefore, results of the model must never be used as evidence and the investigator must find independent evidence of the problem (e.g. Credit card Fraud).

Due to both these reasons, it is paramount that investigator should be able to see the anomaly within context to verify it and also to find evidence that something is amiss.  For example, in WSO2 Fraud Detection solution, investigators could click on the fraud alert and see the data point within the context of other data as shown below.


Furthermore, with the techniques like static rules and unsupervised methods, it is harder to predict how much alerts the techniques might lead to. For example, it is not useful for 10 person team to receive thousands of alerts. We can handle this problem by tracking percentiles on anomaly score and only triggering on the top 1% of the time. If the considered set is very big we can use a percentile approximation technique like t-digest (e.g.

Finally, we must pay attention to what investigators did with the alerts and improve their experience. For example, providing auto-silencing of repeated alerts and alert digests etc are ways to provide more control to the investigators.

Tools & Datasets

Anomaly Detection is mostly done with custom code and proprietary solutions. Hence, it’s applications has been limited to few high-value use cases. If you are looking for an open source solution, following are some options.

  1. WSO2 has been working on Fraud Detection tool built on top of WSO2 Data Analytics Platform ( Disclaimer: I am part of that team).  It is free under Apache Licence. You can find more information about
  2. Kale ( and Thyme by Etasy provide support for time series based anomaly detection.  See
  3. There are several samples done on top of other products such as

Finally, there are only a few datasets in the public domain that can be used to test anomaly detection problems. This limits the development of those techniques. Following are those I know about.

  1. KDD cup 99 intrusion detection dataset
  2. Single variable time series data sets by Numenta
  3. Breast Cancer dataset
  4. Yahoo Time Series Anomaly Detection Dataset

I think as a community we need to find more datasets as that will make it possible to compare and contrast different solutions.


In this post, we discussed anomaly detection, how it is different from machine learning, and then discussed different anomaly detection techniques

We categorized anomalies into three classes: point anomalies, contextual anomalies, and collective anomalies. Then we discussed some of the techniques for detecting them. The following picture shows a summary of those techniques.


You can find a detailed discussion of most of these techniques from the ACM Computing Survey paper “Anomaly Detection: A Survey“. Finally, the post discussed several of the pitfalls of trying to detect all anomalies and some tools.

Hope this was useful. If you have any thoughts or like to point out any other major techniques that I did not mention, please drop a comment.

Image Credit: (CC Licence)


If you enjoyed this post you might also find following interesting.

 Check out some of my most read posts and my talks (videos). Talk to me at @srinath_perera or find me.

Analysis of Retweeting Patterns in Sri Lankan 2015 General Election

The Election is done. Just the other day, we saw an analysis of twitter activities by Yudhanjaya, where he observed that few accounts shape the tone of the twitter LKA community. 

In this post, I am taking a detailed look at the retweet network, further drilling into twitter community. The dataset is the retweets graph for twitter hashtags #GESL15 and #GenElecSL collected between 4-22 of august.The archive includes 14k tweets of which about 9k are retweets. There are 2480 tweets accounts have participated. I collect the data through The analysis is done using a set of R scripts.

Following are some interesting observations.

How does the community look like?

denseNetwork sparsenetwork

The graph shows a visualization of the community. Each vertex represents an account. The first dense chart shows an edge for each retweet and the second sparse graph only shows an edge if five or more retweets have happened between the two accounts. The size of each node shows the number of retweets the node has received.

The community is arranged around few accounts that act as hubs, and first 10 authors have received about 40% retweets all retweets. This confirms Yudhanjaya‘s observations.

Furthermore, both graphs are well connected. Even the sparse graph is fully connected. Often in political conversions, different groups tend to segregate and cross talk is minimal. However, that is not the case in the LKA twitter graph. Maybe the presence of journalists as hubs in the network have enabled cross talk between groups.

The first table below shows the 15 accounts that had most retweets and the second table shows vertex betweenness values. Vertex betweenness is a measure of each node’s ability to connect different parts of the network.

betweeness toptweets

Four accounts appears on both measures, which further confirms their prominence.

What suggests a good reach?

Following two charts try to find any correlation between the number of tweets or the number of followers vs. retweets.


However, data does not show such behavior. There are several bots that generate lots of tweets, but they do not generate much retweets. Furthermore, the accounts that receive many retweets have only about 50-100 tweets ( about 4-5 per day). This is evidence supporting that it is content, not the network strcture drives retweets, although it is not concusive.

Also, the relationship between the  number of followers and retweets also is not very clear. Although there are few account that have many followers and tweets, there are notable exceptions in the graph. Most likely this is caused by highly connected nature of the network, where followers are replaced by fast propergation through the network. Hence you can have lot of reach without having lot of followers.

What did they talk about?

The following picture shows an word cloud generated using all the tweets. However, there are not much surprises there.


In contrast, top retweets in each day provide a superb chronicle on what happens in each day over time. You can get a view closer to this by typing #GenElecSL into twitter search box.



Twitter network community for Sri Lankan general election 2015 has a very well connected retweets graph. Although few accounts shape the tone of the discussion, they seem to do a good job of enabling cross-communication between different groups. Reach, measured via retweets, seem to be independent of attributes like the frequency of tweets and the number of followers the account have. Finally, most retweeted tweets in each day seem to provide a useful chronicle of what happens in the election each day.

Note: It is worth noting that the community graph does not show the followers. Hence, the retweet can happens via another account as well (e.g. B retweets A’s message, and C having seen B’s retweet, he retweets A’s post. Then both C and B will have an edge from A in this graph. Therefore, these results does not discredit follower network in the community.

13 Stream Processing Patterns for building Streaming and Realtime Applications


More and more use cases, we want to react to data faster, rather than storing them in a disk and periodically processing and acting on the data. This is done using Realtime analytics.

Realtime analytics, or what people call Realtime Analytics, have two flavors.

  1. Realtime Interactive/Ad-hoc Analytics (users issue ad-hoc dynamic queries and the system responds interactively). Examples of such tools are Druid, SAP Hana, VoltDB, MemSQL, and Apache Drill.
  2. Realtime Streaming Analytics / Stream Processing ( users issue static queries once and they do not change, and the system process data as they come in without storing). This is supported by Stream Processors and among examples of such tools are WSO2 Stream Processors and Apache Flink. ( see What is Stream Processing? for more details)

Realtime Interactive Analytics allows users to explore a large data set by issuing ad-hoc queries. Queries should respond within 10 seconds, which is considered the upper bound for acceptable human interaction. In contrast, this tutorial focuses on Stream Processing, which is processing data as they come in without storing them and react to those data very fast, often within few milliseconds. Such technologies are not new. History goes back to Active Databases (2000+), Stream processing (e.g. Aurora (2003), Borealis (2005+) and later Apache Storm), Distributed Streaming Operators(2005), and Complex Event processing. Between 2015-2018, most of these technologies have converged under the theme Steam Processing (see CEP vs. Stream Processing) for more information.

when thinking about Realtime analytics, many think only about counting use cases. Counting use cases are only the tip of the iceberg of real-life realtime use cases. Since the input data arrives as a data stream, a time dimension always presents in the data. This time dimension allows us to implement and perform many powerful use cases. For an example, Streaming SQL supported by many Stream Processors provides operators like windows, joins, and temporal event sequence detection.

Stream processing technologies like Apache Samza and Apache Storm has received much attention under the theme of large-scale streaming analytics. However, these tools force every programmer to design and implement real-time analytics processing from first principals.

For an example, if users need a time window, they need to implement it from first principals. This is like every programmer implementing his own list data structure.

Since 2016, a new idea called Streaming SQL has emerged. We call a language that enables users to write SQL like queries to query streaming data as a “Streaming SQL” language. Almost all Stream Processors now support Streaming SQL. 

However, writing Streaming Applications requires very different thinking patterns from writing code with a language like Java.  A better understanding of common patterns in Stream Processing will let us understand the domain better and build tools that handle those scenarios. This tutorial describes 13 common relatime streaming analytics patterns and how to implement them. In the discussion, we will draw heavily from real life use cases done under Stream Processing and other technologies.

Realtime Streaming Analytics Patterns

Before looking at the patterns, let’s first agree on the terminology. Stream Processing accepts input as a set of streams where each stream consists of many events ordered in time. Each event has many attributes, but all events in the same stream have the same set of attributes or schema.

Pattern 1: Preprocessing

Preprocessing is often done as a projection from one data stream to the other or through filtering. Potential operations include

  • Filtering and removing some events
  • Reshaping a stream by removing, renaming, or adding new attributes to a stream
  • Splitting and combining attributes in a stream
  • Transforming attributes

For example, from a twitter data stream, we might choose to extract the fields: author, timestamp, location, and then filter them based on the location of the author.

Pattern 2: Alerts and Thresholds

This pattern detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase etc.

For an example, in TFL (Transport for London) Demo video based on transit data from London, we trigger a speed alert when the bus has exceed a given speed limit.

We can generate alerts for scenarios such as the server room temperature is continually increasing for last 5 mins.

Pattern 3: Simple Counting and Counting with Windows

This pattern includes aggregate functions like Min, Max, Percentiles etc, and they can be counted without storing any data. (e.g. counting the number of failed transactions).

However, counts are often used with a time window attached to it. ( e.g. failure count last hour). There are many types of windows: sliding windows vs. batch (tumbling) windows and time vs. length windows. There are four main variations.

  • Time, Sliding window: keeps each event for the given time window, produce an output whenever a new event has added or removed.
  • Time, Batch window: also called tumbling windows, they only produce output at the end of the time window
  • Length, Sliding: same as the time, sliding window, but keeps a window of n events instead of selecting them by time.
  • Length, Batch window: same as the time, batch window, but keeps a window of n events instead of selecting them by time

There are special windows like decaying windows and unique windows. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.

Pattern 4: Joining Event Streams

The main idea behind this pattern is to match up multiple data streams and create a new event steam. For an example, let’s assume we play a football game with both the players and the ball having sensors that emit events with current location and acceleration. We can use “joins” to detect when a player has kicked the ball. To that end, we can join the ball location stream and the player stream on the condition that they are close to each other by one meter and the ball’s acceleration has increased by more than 55m/s^2.

Among other use cases are combining data from two sensors, and detecting the proximity of two vehicles. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.

Pattern 5: Data Correlation, Missing Events, and Erroneous Data

This pattern and the pattern four a has lot in common where here too we match up multiple streams. In addition, we also correlate the data within the same stream. This is because different data sensors can send events at different rates, and many use cases require this fundamental operator.

Following are some possible scenarios.

  1. Matching up two data streams that send events at different speeds
  2. Detecting a missing event in a data stream ( e.g. detect a customer request that has not been responded within 1 hour of its reception. )
  3. Detecting erroneous data (e.g. Detect failed sensors using a set of sensors that monitor overlapping regions and using those redundant data to find erroneous sensors and removing their data from further processing)

Pattern 6: Interacting with Databases

Often we need to combine the realtime data against the historical data stored in a disk. Following are few examples.

  • When a transaction happened, look up the age using the customer ID from customer database to be used for Fraud detection (enrichment)
  • Checking a transaction against blacklists and whitelists in the database
  • Receive an input from the user (e.g. Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention.)

Pattern 7: Detecting Temporal Event Sequence Patterns

Using regular expressions with strings, we detect a pattern of characters from a sequence of characters. Similarly, given a sequence of events, we can write a regular expression to detect a temporal sequence of events arranged on time where each event or condition about the event is parallel to a character in a string in the above example.

A frequently cited example, although bit simplistic, is that a thief, having stolen a credit card, would try a smaller transaction to make sure it works and then do a large transaction. Here the small transaction followed by a large transaction is a temporal sequence of events arranged on time and can be detected using a regular expression written on top of an event sequence.

Such temporal sequence patterns are very powerful. For example, the following video shows a real time analytics done using the data collected from a real football game. This was the dataset taken from DEBS 2013 Grand Challenge.

In the video, we used patterns on event sequence to detect the ball possession, the time period a specific player controlled the ball. A player possessed the ball from the time he hits the ball until someone else hits the ball. This condition can be written as a regular expression: a hit by me, followed by any number of hits by me, followed by a hit by someone else. (We already discussed how to detect the hits on the ball in Pattern 4: Joins).

Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.

Pattern 8: Tracking

The eighth pattern tracks something over space and time and detects given conditions. Following are few examples

  • Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and geo-fences.
  • Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation.
  • Tracking airline luggage and making sure they are not been sent to wrong destinations
  • Tracking a logistic network and figure out bottlenecks and unexpected conditions.

For example, TFL Demo we discussed under pattern 2 shows an application that tracks and monitors London buses using the open data feeds exposed by TFL(Transport for London).

Pattern 9: Detecting Trends

We often encounter time series data. Detecting patterns from time series data and bringing them into operator attention are common use cases.

Following are some of the examples of tends.

  • Rise, Fall
  • Turn (switch from a rise to a fall)
  • Outliers
  • Complex trends like triple bottom etc.

These trends are useful in a wide variety of use cases such as

  • Stock markets and Algorithmic trading
  • Enforcing SLA (Service Level Agreement), Auto Scaling, and Load Balancing
  • Predictive maintenance ( e.g. guessing the Hard Disk will fill within next week)

Pattern 10: Running the same Query in Batch and Realtime Pipelines

This pattern runs the same query in both Relatime and batch pipeline. It is often used to fill the gap left in the data due to batch processing. For example, if batch processing takes 15 minutes, results would lack the data for last 15 minutes.

The idea of this pattern, which is sometimes called “Lambda Architecture” is to use realtime analytics to fill the gap. Jay Kreps’s article “Questioning the Lambda Architecture” discusses this pattern in detail.

Pattern 11: Detecting and switching to Detailed Analysis

The main idea of the pattern is to detect a condition that suggests some anomaly, and further analyze it using historical data. This pattern is used with the use cases where we cannot analyze all the data with full detail. Instead, we analyze anomalous cases in full detail. Following are few examples.

  • Use basic rules to detect Fraud (e.g. large transaction), then pull out all transactions done against that credit card for a larger time period (e.g. 3 months data) from a batch pipeline and run a detailed analysis
  • While monitoring weather, detect conditions like high temperature or low pressure in a given region and then start a high resolution localized forecast on that region.
  • Detect good customers, for example through the expenditure of more than $1000 within a month, and then run a detailed model to decide the potential of offering a deal.

Pattern 12: Using a Model

The idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions. For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline.

Among examples is Fraud Detections, Segmentation, Predict next value, Predict Churn. Also see InfoQ article, Machine Learning Techniques for Predictive Maintenance, for a detailed example of this pattern.

Pattern 13: Online Control

There are many use cases where we need to control something online. The classical use cases are the autopilot, self-driving, and robotics. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.

You can implement most of these use cases with a Stream Processor that supports a Streaming SQL language. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for a detailed discussion on Streaming SQL.  You can try out above patterns with  WSO2 Stream Processor, which is freely available under Apache Licence 2. You can also find other Stream Processors from What are the best stream processing solutions out there?

This post is initially based on a tutorial in DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns.  We have later edited the content to capture the trends such as Streaming SQL.

You can find details about pattern implementations from the following slide deck, and source code from Although Streaming SQL syntax closely follows the most recent release of WSO2 SP there are minor changes in the syntax. Please refer to WSO2 SP user guide for most recent syntax.

Hope this was useful. If you enjoyed this post you might also find following interesting.