Rolling Window Regression: a Simple Approach for Time Series Next value Predictions

Given a time series, predicting the next value is a problem that fascinated  programmers for a long time. Obviously, a key reason for this attention is stock markets, which promised untold riches if you can crack it. However, except for few (see A rare interview with the mathematician who cracked Wall Street), those riches have proved elusive.

Thanks to IoT (Internet of Things), time series analysis is poise to a come back into the lime light. IoT let us place ubiquitous sensors everywhere, collect data, and act on that data. IoT devices collect data through time and resulting data are almost always time series data.

Following are few use cases for time series prediction.

  1. Power load prediction
  2. Demand prediction for Retail Stores
  3. Services (e.g. airline check in counters, government offices) client prediction
  4. Revenue forecasts
  5. ICU care vital monitoring
  6. Yield and crop prediction

Let’s explore the techniques available for time series forecasts.

The first question is that “isn’t it regression?”. It is close, but not the same as regression. In a time series, each value is affected by the values just preceding this value. For example, if there is lot of traffic at 4.55 in a junction, chances are that there will be some traffic at 4.56 as well. This is called autocorrelation. If you are doing regression, you will only consider x(t) while due to auto correlation, x(t-1), x(t-2), … will also affect the outcome. So we can think about time series forecasts as regression that factor in autocorrelation as well.

For this discussion, let’s consider “Individual household electric power consumption Data Set”, which is data collected from a one house hold over four years in one minute intervals. Let’s only consider three fields, and data set will look like following.

The first question to ask is how do we measure success? We do this via a loss function, where we try to minimize the loss function. There are several loss functions, and they are different pros and cons.

  1. MAE ( Mean absolute error) — here all errors, big and small, are treated equal.
  2. Root Mean Square Error (RMSE) — this penalizes large errors due to the squared term. For example, with errors [0.5, 0.5] and [0.1, 0.9], MSE for both will be 0.5 while RMSE is 0.5 and. 0.45.
  3. MAPE ( Mean Absolute Percentage Error) — Since #1 and #2 depend on the value range of the target variable, they cannot be compared across data sets. In contrast, MAPE is a percentage, hence relative. It is like accuracy in a classification problem, where everyone knows 99% accuracy is pretty good.
  4. RMSEP ( Root Mean Square Percentage Error) — This is a hybrid between #2 and #3.
  5. Almost correct Predictions Error rate (AC_errorRate)—percentage of predictions that is within %p percentage of the true value

If we are trying to forecast the next value, we have several choices.


The gold standard for this kind of problems is ARIMA model. The core idea behind ARIMA is to break the time series in o different components such as trend component, seasonality component etc and carefully estimate a model for each component. See Using R for Time Series Analysis for a good overview.

However, ARIMA has an unfortunate problem. It needs an expert ( a good statistics degree or a grad student) to calibrate the model parameters. If you want to do multivariate ARIMA, that is to factor in multiple fields, then things get even harder.

However, R has a function called auto.arima, which estimates model parameters for you. I tried that out.

x_train <- train data set
X-test <- test data set
powerTs <- ts(x_train, frequency=525600, start=c(2006,503604))
arimaModel <- auto.arima(powerTs)
powerforecast <- forecast.Arima(arimaModel, h=length(x_test))

You can find detail discussion on how to do ARIMA from the links given above. I only used 200k from the data set as our focus is mid-size data sets. It gave a MAPE of 19.5.

Temporal Features

The second approach is to come up with a list of features that captures the temporal aspects so that the auto correlation information is not lost. For example, Stock market technical analysis uses features built using moving averages. In the simple case, an analyst will track 7 days and 21 days moving averages and take decisions based on cross-over points between those values.

Following are some feature ideas

  1. collection of moving averages/ medians(e.g. 7, 14, 30, 90 day)
  2. Time since certain event
  3. Time between two events
  4. Mathematical measures such as Entropy, Z-scores etc.
  5. X(t) raised to functions such as power(X(t),n), cos((X(t)/k)) etc

Common trick people use is to apply those features with techniques like Random Forest and Gradient Boosting, that can provide the relative feature importance. We can use that data to keep good features and drop ineffective features.

I will not dwell too much time on this topic. However, with some hard work, this method have shown to give very good results. For example, most competitions are won using this method (e.g. /).

Down side, however, is crafting features is a black art. It takes lots of work and experience to craft the features.

Rolling Windows based Regression

Now we got to the interesting part. It seems there is an another method that gives pretty good results without lots of hand holding.

Idea is to to predict X(t+1), next value in a time series, we feed not only X(t), but X(t-1), X(t-2) etc to the model. A similar idea has being discussed in Rolling Analysis of Time Series although it is used to solve a different problem.

Let’s look at an example. Let’s say that we need to predict x(t+1) given X(t). Then the source and target variables will look like following.

Data set would look like following after transformed with rolling window of three.

Then, we will use above transformed data set with a well-known regression algorithm such as linear regression and Random Forest Regression. The expectation is that the regression algorithm will figure out the autocorrelation coefficients from X(t-2) to X(t).

For example, with the above data set, applying Linear regression on the transformed data set using a rolling window of 14 data points provided following results. Here AC_errorRate considers forecast to be correct if it is within 10% of the actual value.

LR AC_errorRate=44.0 RMSEP=29.4632 MAPE=13.3814 RMSE=0.261307

This is pretty interesting as this beats the auto ARIMA right way ( MAPE 0.19 vs 0.13 with rolling windows).

So we only tried Linear regression so far. Then I tried out several other methods, and results are given below.

Linear regression still does pretty well, however, it is weak on keeping the error rate within 10%. Deep learning is better on that aspect, however, took some serious tuning. Please note that tests are done with 200k data points as my main focus is on small data sets.

I got the best results from a Neural network with 2 hidden layers of size 20 units in each layer with zero dropout or regularisation, activation function “relu”, and optimizer Adam(lr=0.001) running for 500 epochs. The network is implemented with Keras. While tuning, I found articles [1] and [2] pretty useful.

Then I tried out the same idea with few more datasets.

  1. Milk production Data set ( small < 200 data points)
  2. Bike sharing Data set (about 18,000 data points)
  3. USD to Euro Exchange rate ( about 6500 data points)
  4. Apple Stocks Prices (about 13000 data points)

Forecasts are done as univariate time series. That is we only consider time stamps and the value we are forecasting. Any missing value is imputed using padding ( using most recent value). For all tests, we used a window of size 14 for as the rolling window.

Following tables shows the results. Here except for Auto.Arima, other methods using a rolling window based data set.

There is no clear winner. However, rolling window method we discussed coupled with a regression algorithm seems to work pretty well.


We discussed three methods: ARIMA, Using Features to represent time effects, and Rolling windows to do time series next value forecasts with medium size data sets.

Among the three, the third method provides good results comparable with auto ARIMA model although it needs minimal hand holding by the end user.

Hence we believe that “Rolling Window based Regression” is a useful addition for the forecaster’s bag of tricks!

However, this does not discredit ARIMA, as with expert tuning, it will do much better. At the same time, with hand crafted features methods two and three will also do better.

One crucial consideration is picking the size of the window for rolling window method. Often we can get a good idea from the domain. Users can also do a parameter search on the window size.

Following are few things that need further exploration.

  • Can we use RNN and CNN? I tried RNN, but could not get good results so far.
  • It might be useful to feed other features such as time of day, day of the week, and also moving averages of different time windows.


  1. An overview of gradient descent optimization algorithms
  2. CS231n Convolutional Neural Networks for Visual Recognition

Understanding Causality and Big Data: Complexities, Challenges, and Tradeoffs

image credit: Wikipedia, Amitchell125

“Does smoking causes cancer?”

We have heard that lot of smokers have lung cancer. However, can we mathematically tell that smoking causes cancer?

We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.

Ok great, can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.

This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.

Correlation does not mean Causation!

This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.

For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.

You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.

Can we prove Causality?

Great, how can I show causality? Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.

To be precise, if error bars for groups does not overlap for both the groups, then there is a causality. Check for more details.

However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of the smoke, and make sure other half does not smoke. Then wait for like 50 years and compare.

Did you see the catch? it is not good enough to compare smokers and non-smokers as there may be a common cause like the gene that cause them to do so. Do prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g. )

This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to one, and wait few hundred years!!

To summarize, Casualty, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.

Following are examples when causality is needed.

  • Before punishing someone
  • Diagnosing a patient
  • Measure effectiveness of a new drug
  • Evaluate the effect of a new policy (e.g. new Tax)
  • To change a behavior

Big Data and Causality

Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.

There are two reactions to this problem.

First, “Big data guys does not understand what they are doing. It is stupid to try to draw conclusions without randomized experiment”.

I find this view lazy.

Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.

Second is “forget causality! correlation is enough”.

I find this view blind.

Playing ostrich does not make the problem go away. This kind of crude generalizations make people do stupid things and can limit the adoption of Big Data technologies.

We need to find the middle ground!

When do we need Causality?

The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.

Let us investigate both types of cases.

Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples.

  1. When stakes are low ( e.g. marketing, recommendations) — when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
  2. As a starting point for an investigation — correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
  3. Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good ways to verify the fit.

There are other cases where causality is crucial. Following are few examples.

  1. Find a cause for disease
  2. Policy decisions ( would 15$ minimum wage be better? would free health care is better?)
  3. When stakes are too high ( Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
  4. When we are acting on the decision ( firing an employee)

Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, state could have run a experiment by selecting a population and sending the book to half of them and looking at the outcome.

Some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a polulation and rollout the feature for only part of the population and compare the two.

So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.

Closing Remarks

Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.

Most important thing is having a clear understanding at the point when we act on the decisions. Sometimes, when stakes are low, correlation might be enough. On some other cases, it is best to run an experiment to verify our claims. Finally, some systems might warrant building experiments into the system itself, letting you draw strong causality results. Choose wisely!

Original Post from my Medium account:

Walking the Microservices Path towards Loose coupling? Look out for these Pitfalls

(image credit: Wiros from Barcelona, Spain)

Microservices are the new architecture style of building systems using simple, lightweight, loosely coupled services that can be developed and released independently of each other.

If you need to know the basics, read Martin Fowler’s Post. If you like to compare it with SOA, watch the Don Ferguson’s talk.). Also, Martin Fowler has written about “trade off of micro services” and “when it is worth doing microservices”, which let you decide when it is useful.

Let’s say that you heard, read, and got convinced about microservices. If you are trying to follow the microservices architecture, there are few practical challenges. This post discusses how you can handle some of those challenges.

No Shared Database(s)

Each microservice should have it’s own databases and Data MUST not be shared via a database. This rule removes a common case that leads to tight coupling between services. For example, if two services share the same database, the second service will break if the first service has changed the database schema. So teams will have to talk to each other.

I think this rule is a good one, and should not be broken. However, there is a problem. If two services share the same data (e.g. bank account data, shopping cart) and need to update the data transactionally, simplest approach is to keep both in the same database and use database transactions to enforce consistency. Any other solution is hard.

Solution 1: If updates happen only in one microservice (e.g. loan approval process check the balance), you can use asynchronous messaging (message queue) to share data.

Solution 1: If updates happen in both services, you can either consider merging the two services or use transactions. The post Microservices: It’s not (only) the size that matters, it’s (also) how you use them describes the first option. The next section will describe the transactions in detail.

Handling Consistency of Updates

You will run into scenarios where you will update the data from multiple places. We discuss an example in the earlier section. ( If you update the data only from one place, we already discussed how to do it).

Please note this use case typically solved using transactions. However, you can sometimes solve the problem without transactions. There are several options.

Put all updates to the same Microservice

When possible, avoid multiple updates crossing microservice boundaries. However, sometimes by doing this you might end up with few or worse one big monoliths. Hence, sometimes, this is not possible.

Use Compensation and other lesser Guarantees

As the famous post “Starbucks Does Not Use Two-Phase Commit” describes, the normal world works without transactions. For example, barista atStarbucks does not wait until your transaction is completed. Instead, they handle multiple customers same time and compensate for any erroneous conditions explicitly. You can do the same, given you are willing to do a bit more work.

One simple idea is if an option failed, you go and compensate. For example, if you are shipping the book, first deduct the money, then ship the book. If the shipping failed, you go and return the money.

Also, sometimes you can settle for eventual consistency or timeout. Another simple idea is give a button to the use to forcefully refresh the page if he can tell that it is outdated. Some other times, you bite the bullet and settle for lesser consistency (e.g. Vogel’s post is a good starting point).

Finally, Life Beyond Distributed Transactions: An Apostate’s Opinion is a detailed discussion on all the tricks.

Having said that, there are some use cases where you must do transactions to get correct results. And those MUST use transactions. see Microservices and transactions-an update. Weigh the pros and cons and choose wisely.

Microservice Security

Old approach is the service to authenticate by calling the database or Identity Server when it has received a request.

You can replace the identity server with a microservice. That, in my opinion, leads to a big complicated dependency graph.

Instead, I like the token based approach depicted by the following figure. The idea is described in the book, “Building Microservices”. Here the client ( or a gateway) would first talk to an identity/SSO server who will authenticate the user and issue a signed token that describes the user and his roles. (e.g. you can do this with SAML or OpenIDConnect). Each microservice verifies the token and authorizes the calls based on the user roles described in the token. For example, with this model, for the same query, a user with role “publisher” might see different results than a user with role “admin” because they have different permissions.

You can find more information about this approach from How To Control User Identity Within Microservices?.

Microservice Composition

Here, “composition” means “how can connect multiple microservices into one flow to deliver what end user needs”.

Most compositions with SOA looked like following. The idea is that there is a central server that runs the workflow.

Use of ESB with microservices is discouraged (e.g. Top 5 Anti-ESB Arguments for DevOps Teams). Also, you can find some counter arguments in Do Good Microservices Architectures Spell the Death of the Enterprise Service Bus?

I do not plan to get into the ESB flight in this post. However, I want to discuss whether we need a central server to do the microservices composition. There are several way to do the microservices composition.

Approach 1:Drive flow from Client

The following figure shows an approach to do microservices without a central server. The client browser handles the flow. The post, Domain Service Aggregators: A Structured Approach to Microservice Composition, is an example of this approach.

This has approach has several problems.

  1. If the client is behind a slow network, which is the most common case, the execution will be slow. This is because now multiple calls need to be triggered by the client.
  2. Might add security concerns ( I can hack my app to give me a loan)
  3. Above example thinks about a website. However, most complex compositions often come from other use cases. So general applicability of composition at the client to other use cases yet to be demonstrated.
  4. Where to keep the State? Can client be trusted to keep the state of the workflow. Modeling state with REST is possible. However, it is complicated.


Driving the flow from a central place is called orchestration. However, that is not the only way to coordinate multiple partners to carry out some work. For example, in a dance, there is no one person directing the performance. Instead, each dancer would follow who is near to her and sync up. Choreography applies the same idea to businesses process.

Typical implementation includes an eventing system, where each participant in the process listens to different events and carries out his or her parts. Each action generates asynchronous events that will trigger participants down the stream. This is the programming model used by environments like RxJava or Node.js.

For example, let’s assume that a loan process includes a request, a credit check, other outstanding loans check, manager approval, and a decision notification. The following picture shows how to implement this using choreograph. The request will be placed in a queue. It will be picked up by next actor, who will put his results into the next queue. The process will continue until the it has completed.

Just like a dance needs practice, choreography is complicated. For example, you did not know when the process has finished, nor you will know if an error has happened, or if the process is stuck. Choreography needs a monitoring system to track progress and recover or notify about the error.

On the other hand, the advantage of choreography is that it creates systems that are much loosely coupled. For example, often you can add a new actor to the process without changing other actors. You can find more information from Scaling Microservices with an Event Stream.

Centralized Server

The last but most simple option is a centralized server (a.k.a orchestration).

SOA’s implemented this often using two methods: ESB or Business Processes. Microservice folks propose an API Gateway (e.g. Microservices: Decomposing Applications for Deployability and Scalability). I guess API gateway is more lightweight and use technologies like REST/JSON. However, in a pure architectural sense, all those uses orchestration style.

Another variation of the centralized server is “backend for frontends” (BEF), which build a server side API per client type ( one for desktop, one for iOS etc). This model creates different APIs per each client type, optimized for each use case. See the pattern: Backends For Frontends for more information.

I would suggest not to go crazy with all options here and start with the API gateway as that is the most straightforward approach. You can switch to more complicated options as need arises.

Avoid Dependency Hell

We do microservices to make it possible that each service can release and deploy independently. To do that, you must avoid the dependancy hell.

Let’s consider microservices “A” who has the API “A1” and have upgraded to API “A2”. Now there are two cases.

  1. Microservice B might send messages intended for A1 to A2. This is backward compatibility.
  2. Microservice A might have to revert back to A1, and microservices C might continue to send messages intended to A2 to A1.

You must handle above scenarios somehow, and let the microservices evolve and deployed independently. If not, all your effort will be wasted.

Often, handling these cases is a matter of adding optional parameters and never renaming or removing existing parameters. More complicated scenarios, however, are possible.

The post “Taming Dependency Hell” within Microservices with Michael Bryzek discuss this in detail. Ask HN: How do you version control your microservices? is also another good source.

Finally, backward and forward compatibility support should be bounded by time. For example, you can have a rule that no microservice should depend on APIs that are more than three months old. That would let the microservices developers to eventually drop some of the code paths.

Finally, I would like to rant a bit about how your dependency graph should look like in a microservices architecture.

One option is freely invoking other microservices whenever it is needed. That will create a spaghetti architecture from the pre-ESB era. I am not a fan of that model.

The other extreme is saying that microservices should not call other microservices and all connection should be done via API gateway or message bus. This will lead to a one level tree. For example, instead of the microservice A calling B, we bring result from the microservice A to the gateway, which will call B with the results. This is the orchestration model. Most of the business logic will now live in the gateway. Yes, this makes the gateway fat.

My recommendation is either to go for the orchestration model or do the hard work of implementing choreography properly. Yes, I am asking not to do the spaghetti.


The goal of Microservices is loose coupling. Carefully designed microservice architecture let you implement a project using a set of microservices, where each is managed, developed, and released independently.

When you designed with microservices, you must keep the eye on the prize, which is “loose coupling”. There are quite a few challenges, and this post answer following questions.

  1. How can I handle scenario that needs to share data between two microservices?
  2. How can I evolve microservices API while keeping loose coupling?
  3. How to handle security?
  4. How to compose microservices?

Thanks! love to hear your thoughts.

Value Proposition of Big Data after a Decade

Big data is an umbrella term for many technologies: Search, NoSQL, Distributed File Systems, Batch and Realtime Processing, and Machine Learning ( Data Science). These Different technologies are developed and proven to various degree. After 10 years, is it real? Following are few success stories of what big data has done.

  1. Nate Silver predicted outcomes of 49 of the 50 states in the 2008 U.S. Presidential election
  2. Money Ball ( Baseball drafting)
  3. Cancer detection from Biopsy cells (Big Data find 12 tell-tale patterns while doctors only knew about nine). See
  4. Bristol-Myers Squibb reduced the time it takes to run clinical trial simulations by 98%
  5. Xerox used big data to reduce the attrition rate in its call centre by 20%.
  6. Kroger Loyalty programs ( growth in 45 consecutive quarters)

As these examples show, big data indeed can work. Could that work for you. Let’s explore this a bit.

The premise of big data goes as follows.

If you collect data about your business and feed it to a Big Data system, you will find useful insights that will provide a competitive advantage — (e.g. Analysis of data sets can find new correlations to “spot business trends, prevent diseases, combat crime and so on”. [Wikipedia])

When we say Big Data will make a difference, the underline assumption is that way we and organisations operate are inefficient.

This means Big Data is as an optimization technique. Hence, you must know what is worth optimizing. If your boss asked you to make sure the organization is using big data, doing “Big Data Washing” is easy.

  1. Publish or collect the data you can with a minimal effort
  2. Do a lot of simple aggregations
  3. Figure out what data combinations makes prettiest pictures
  4. Throw in some machine learning algorithms, predict something but don’t compare
  5. Create a cool dashboard and do a cool demo. Claim that you are just scratching the surface!!

However, adding value to your organization through big data is not that easy. This is because insights are not automatic. Insights are possible only if we have right data, we look at the right place, such insights exists, and we do find the insights.

Making a difference will need you to understand what is possible with big data, what are its tools, as well as the pain points in your domain and organization? Following Pictures shows some of the applications of big data within an organization.

The first step is asking, what are some of those applications that can make a difference for your organization.

The next step is understanding tools in “Big Data toolbox”. They come in many forms.

KPI ( Key Performance Indicators) — People used to take canaries into the coal mines. Since those small birds are very sensitive to the oxygen level in the air, if they got knocked out, you need to be running out of the mine. KPIs are canaries for your organization. They are numbers that can give you an idea about the performance of something — E.g. GDP, Per Capita Income, HDI index etc for a country, Company Revenue, Lifetime value of a customer, Revenue per Square foot ( in the retail industry). Chances are your organization or your domain has already defined them. Idea is to use Big Data to monitor the KPIs.

Dashboard — Think about a car dashboard. It gives you an idea about the overall system in a glance. It is boring when all is good, but it grabs attention when something is wrong. However, unlike car dashboards, Big data dashboards have support for drill down and find root cause.

Alerts — Alerts are Notifications ( sent via email, SMS, Pager etc.). Their Goal is to give you a peace of mind by not having to check all the time. They should be specific, infrequent, and have very low false positives.

Sensors — Sensors collect data and make them available to the rest of the system. They are expensive and time-consuming to install.

Analytics — Analytics take decisions. They come in four forms: batch real-time, interactive, predictive.

  • Batch Analytics— process the data that resides in the disk. If you can wait (e.g. more than an hour) for data to be available, this is what you use.
  • Interactive Analytics —It is used by a human to issue ad-hoc queries and to understand a dataset. Think of it as having a conversation with the data.
  • Realtime Analytics— It is used to detect something quickly within few milliseconds to few seconds. Realtime analytics are very powerful in detecting conditions over time (e.g. Football Analytics). Alerts are implemented through Realtime analytics
  • Predictive Analytics — It learns a solution from examples. Example, It is very hard to write a program to drive a car. This is because there are too many edge conditions. We solve that kind of problems by giving lot of examples and asking the computer to figure out a program that solves the problem ( which we call a model). Two common forms are predicting next value (e.g. electricity load prediction) and predicting a category (e.g. is this email a SPAM?).

Drill down — To make decisions, operators need to see the data in context and drill down into detail to understand the root cause. The typical model is to start from an alert or dashboard, see data in context (other transactions around the same time, what does the same user did before and after etc.) and then let the user drill down. For example, see WSO2 Fraud Detection Solution Demo.

The process of deriving insight from the data, using above tools, looks like following.

Here different roles work together to explore data, understand data, to define KPIs, create dashboards, alerts etc.

In this process, keeping the system running is a key challenge. This includes DevOps challenges, Integrate data continuously, update models, and get feedback about the effectiveness of decisions (e.g. Accuracy of Fraud). Hence doing things in production is expensive.

On the other hand, “doing it Once” is cheap. Hence, you must first try your scenarios in an ad-hoc manner first (hire some expertise if you must) and make sure it can add value to the organization before setting up a system that does it every day.

Actionable Insights are the Key!!

Insights that you generate must be actionable. That means several things.

  1. Information you share is significant and warrant attention, and they are presented with their ramifications ( e.g. more than two technical issues would lead customer to churn)
  2. Decision makers can identify the context associated with the insight ( e.g. operators can see through history of customers who qualify)
  3. Decision makers can do something about the insight ( e.g. can work with customers to reassures and fix)

For each information you show the user, think hard “why I am showing him this?”, “what can he do with this information?”, and “what other information I can show to make him understand the context?”.

Where to Start?

Big Data projects can take many forms.

  1. Use an existing Dataset: I already have a data set, and list of potential problems. I will use Big data to solve some of the problems.
  2. **Fix a known Problem: Find a problem, collect data about it, analyse, visualize, build a model and improve. Then build a dashboard to monitor.
  3. Improve Overall Process: Instrument processes ( start with most crucial parts), find KPIs, analyze and visualize the processes, and improve
  4. Find Correlations: Collect all available data, data mine the data or visualize, find interesting correlations.

My recommendation is to start with #2, fix a known problem in the organization. That is the least expensive, and that will let you demonstrate the value of Big data right away.

Finally, the following are key take away points.

  • Big Data provide a way to optimize. However, blind application does not guarantee success.
  • Learn tools in Big Data toolbox: KPIs, Analytics ( Batch, Real-time, Interactive, Predicative), Visualizations, Dashboards. Alerts, Sensors.
  • Start small. Try out with data sets before investing in a system
  • Find a high impact problem and make it work end to end

Understanding CEP, Stream Processing, and their Implementations

Real-time analytics technologies come in many flavors such as Apache Strom, and streaming analytics, and complex event processing. I am sure you have heard about the first, likely second and third. Have you heard about a technology called “Complex Event Processing”? If you follow this space, you might have heard that people believe that CEP will play a key role in IoT use cases. However, Storm and Spark Streaming are much widely known than CEP.

So what is this CEP anyway?  In this post, I am trying to explain CEP, streaming analytics and compare and contrast them. I will try to give a description of current status (as of 2015) as oppose to give a definition. If you are looking for a definition, best would be What’s the Difference Between ESP and CEP?

As the above picture shows, technically CEP is a subset of Event Stream Processing. Asking for the difference between CEP vs Stream Processing, however, is the wrong question because both CEP engines and Stream processing engines do more than suggested by their names and trespass into the other side.

The right question is “what is the difference between CEP and ESB engines?” Stream processing engines and CEP engines use to be pretty different and they come from a very different background. Use cases they target and issues they choose to handle or not to handle were different.

Stream processing engines let you create a processing graph and inject events into the processing graph. Each operator process and send events to next processors. In most Stream processing engines like Storm, S4, etc, users have to write code to create the operators, wire them up in a graph and run them.  Then the engine runs the graph in parallel using many computers. Among examples are Apache Storm, Apache Fink, and Apache Samza.

In contrast, CEP engines let users write queries using a higher level query language. CEP engines were first created for use cases related to stock market use cases where they must generate a response within milliseconds. Furthermore, CEP engines have built-in operators such as time windows, temporal event sequences integrated into their query language (see Patterns for Streaming Realtime Analytics). It is worth noting that these differences have very little to do with the definitions of CEP or stream processing. Rather, they are a by-product of history and use cases they had to handle. This is the reason that many find the difference between CEP and Stream Processing confusing.

It is worth noting that these differences do not stem from definitions of CEP or stream processing. Rather, they are a by-product of history and use cases they had to handle. This is the reason that many find the difference between CEP and Stream Processing confusing.

Hence, let’s focus on  differences between two types of engines. Following are key differences between the CEP and Stream Processing engines.

  1. Stream Processing Engines are distributed and parallel by design. They support large 10-100s node computations as opposed to CEP engines, which have centralized architecture typically having two or few nodes.
  2. Stream Processing Engines force you to write code, and often they do not have higher level operators such as windows, joins, and temporal patterns. In contrast, CEP engines provide you with high-level languages  and support high-level operators. This difference is similar to the relationship between MapReduce and HIVE SQL scripts.
  3. Due to their stock market-based history, CEP engines are tuned for low latency. Often they respond within few milliseconds and sometimes with sub-millisecond latency. In contrast, most Stream processing engines take close to a second to generate results.
  4. Stream Processing engines stress the reliable message processing, often consuming data from a queue such as Kafka.  In contrast, CEP engines often receive and process data in memory, and when a failure happened, they often choose to throw away failed events and continue. This behaviour, however, has already changed. Most CEP engines support reliable processing of data from a queue such as Kafka.

Let us look at the history of both.

CEP engines were around for a long time. Their history goes back to 90’s (see CEP Market players – end of 2014 – from Paul Vincent). They were used in several real-world use cases. However, they were a niche and expensive. Stream Processing systems come from Aurora and Borealis research projects (2005-2008).

At the aftermath of Big Data taking off around 2012-2013, people started to look for streaming analytics solution that is similar to Hadoop. Apache Storm is created at that time. It mirrored the MapReduce model, where you can write some code and attach them to a processing graph. It stole the limelight and outshone the CEP solutions.

Meanwhile, CEP was pretty much excluded from the spotlight. Stream processing engines programming models had direct parallels with MpaReduce model, which helped. (image credit tambako flicker stream).

6797307367_3df84e44be_z However, it is worth noting that Analysts always paid attention to CEP. For example, in this 2008 Gartner report, CEP has been mentioned and CEP is mentioned ever since. CEP has been mentioned in Gartner hype cycles 2012-2014 ( All big data technologies are dropped from 2015 as it is no longer emerging technology, see

Now another trend, IoT, might bring CEP back into the spotlight and into our day to day lives. This is due to three main reasons.

  1. IOT data are time series data where data is autocorrelated. CEP is much better placed to handle them due to it’s temporal operators.
  2. Most IoT use cases deal with use cases that connect directly with the real world. If you are to act on those insights, you need those insights very fast. CEP has an advantage in the turnaround time.
  3. Most IoT use cases are complex, and they go beyond calculating aggregating data.  Those use cases need support for complex operators like time windows and temporal query patterns.

At the same time, traditional CEP cannot handle those IoT use cases in their current form. Most IoT use cases would have very high event rates. Therefore, whatever event technology used in those use cases needed to be able to scale up. Stream processing can scale much better than CEP.

At the same time, I believe it is a mistake to ignore the higher level temporal operators introduced by CEP and asking the end users to write their own operators. You can find my thoughts from Patterns for Streaming Realtime Analytics and SQL-like Query Language for Real-time Streaming Analytics.

The good news is that both technologies: CEP and Stream Processing are merging and the differences are diminishing. Both can learn from the other, where CEP needs to scale and process events reliably while event processing needs high-level languages and lower latencies. IBM infosphere, which is a stream processing engine, have had CEP like operators for a long time. WSO2 CEP can now accept SQL-like queries and runs on top of Apache Storm (more details). SQL stream is a CEP engine that is highly parallel. My belief is that we will end up with a combination of both and we all will be better off for it.


Update: This post was featured in Software Engineering Daily blog.

Introduction to Anomaly Detection: Concepts and Techniques


Why Anomaly Detection?

burglar-157142_640Machine Learning has four common classes of applications: classification, predicting next value, anomaly detection, and discovering structure. Among them, Anomaly detection detects data points in data that does not fit well with the rest of the data. It has a wide range of applications such as fraud detection, surveillance, diagnosis, data cleanup, and predictive maintenance.

Although it has been studied in detail in academia, applications of anomaly detection have been limited to niche domains like banks, financial institutions, auditing, and medical diagnosis etc. However, with the advent of IoT, anomaly detection would likely to play a key role in IoT use cases such as monitoring and predictive maintenance.

This post explores what is anomaly detection, different anomaly detection techniques,  discusses the key idea behind those techniques, and wraps up with a discussion on how to make use of those results.

Is it not just Classification?

The answer is yes if the following three conditions are met.

  1. You have labeled training data
  2. Anomalous and normal classes are balanced ( say at least 1:5)
  3. Data is not autocorrelated. ( That one data point does not depend on earlier data points. This often breaks in time series data).

If all of above is true, we do not need an anomaly detection techniques and we can use an algorithm like Random Forests or Support Vector Machines (SVM).

However, often it is very hard to find training data, and  even when you can find them, most anomalies are 1:1000 to 1:10^6 events where classes are not balanced. Moreover, the most data, such as data from IoT use cases, would be autocorrelated.

Another aspect is that the false positives are a major concern as we will discuss under acting on decisions. Hence, the precision ( given model predicted an anomaly, how likely it is to be true)  and recall (how much anomalies the model will catch) trade-offs are different from normal classification use cases. We will discuss this in detail later.

What is Anomaly Detection?

Anomalies or outliers come in three types.

  1. Point Anomalies. If an individual data instance can be considered as anomalous with respect to the rest of the data (e.g. purchase with large transaction value)
  2. Contextual Anomalies, If a data instance is anomalous in a specific context, but not otherwise ( anomaly if occur at certain time or certain region. e.g. large spike at middle of night)
  3. Collective Anomalies. If a collection of related data instances is anomalous with respect to the entire data set, but not individual values. They have two variations.
    1. Events in unexpected order ( ordered. e.g. breaking rhythm in ECG)
    2. Unexpected value combinations ( unordered. e.g. buying large number of expensive items)

In the next section, we will discuss in detail how to handle the point and collective anomalies. Contextual anomalies are calculated by focusing on segments of data (e.g. spatial area, graphs, sequences, customer segment) and applying collective anomaly techniques within each segment independently.

Anomaly Detection Techniques

Anomaly detection can be approached in many ways depending on the nature of data and circumstances. Following is a classification of some of those techniques.

Static Rules Approach

Most simple, and may be the best approach to start with, is using static rules. The Idea is to identify a list of known anomalies and then write rules to detect those anomalies. Rules identification is done by a domain expert, by using pattern mining techniques, or a by combination of both.

Static rules are used with the hypothesis that anomalies follow the 80/20 rule where most anomalous occurrences belong to few anomaly types. If the hypothesis is true, then we can detect most anomalies by finding few rules that describe those anomalies.

Implementing those rules can be done using one of three following methods.

  1. If they are simple and no inference is needed, you can code them using your favourite programming language
  2. If decisions need inference, then you can use a rule-based or expert system (e.g. Drools)
  3. If decisions have temporal conditions, you can use a Complex Event Processing System (e.g. WSO2 CEP, Esper)

Although simple, static rules based systems tend to be brittle and complex. Furthermore, identifying those rules is often a complex and subjective task. Therefore, statistical or machine learning based approach, which automatically learn the general rules, are preferred to static rules.

When we have Training Data

Anomalies are rare under most conditions. Hence, even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. The standard classification methods such as SVM or Random Forest will classify almost all data as normal because doing that will provide a very high accuracy score (e.g. accuracy is 99.9 if anomalies are one in thousand).

Generally, the class imbalance is solved using an ensemble built by resampling data many times.  The idea is to first create new datasets by taking all anomalous data points and adding a subset of normal data points (e.g. as 4 times as anomalous data points). Then a classifier is built for each data set using SVM or Random Forest, and those classifiers are combined using ensemble learning. This approach has worked well and produced very good results.

If the data points are autocorrelated with each other, then simple classifiers would not work well. We handle those use cases using time series classification techniques or Recurrent Neural networks.

When there is no Training Data

If you do not have training data, still it is possible to do anomaly detection using unsupervised learning and semi-supervised learning. However, after building the model, you will have no idea how well it is doing as you have nothing to test it against. Hence, the results of those methods need to be tested in the field before placing them in the critical path.

No Training Data: Point Anomalies

Point anomalies will only have a one field in the data set. We use percentiles to detect point anomalies with numeric data and histograms to detect Detecting point anomalies in categorical data. Either case, we find rare data ranges or field values from the data and predict those as anomalies if it happens again. For example, if 99.9 percentile of my transaction value is 800$, one can guess any transaction greater than that value as the potential anomaly. When building models, often we use moving averages instead of point values when possible as they are much more stable to noise.

No Training Data: Univariate Collective Outliers 

Time series data are the best examples of collective outliers in a univariate dataset. In this case, anomalies happen because values occur in unexpected order. For example. the third heart beat might be anomalous not because values are out of range, but they happen in a wrong order.


There are three several approaches to handle these use cases.

Solution 1: build a predictor and look for outliers using residues: This is based on the heuristic that the values not explained by the model are anomalies. Hence we can build a model to predict the next value, and then apply percentiles on the error ( predicted value – actual value) as described before. The model can be built using regression,  time series models, or Recurrent Neural Networks.

Solution 2: Markov chains and Hidden Markov chains can measure the probability of a sequence of events happening. This approach builds a Markov chain for the underline process, and when a sequence of events has happened, we can use the Markov Chain to measure the probability of that sequence occurring, and use that to detect any rare sequences.

FraudFor example, let’s consider credit card transactions. To model the transactions using Markov chains, let’s represent each transaction using two values: transaction value (L, H) and time since the last transaction (L, H). Since Markov chain’s states have to be finite, we will choose two values Low (L), High (H) to represent variable values. Then Markov chains would represent by states LL, LH, HL, HH and each transaction would be a transition from one state to another state. We can build the Markov chain using historical data and use the chain to calculate sequence probabilities. Then, we can find the probability of any new sequence happening and then mark rare sequences as anomalies. The blog post “Real Time Fraud Detection with Sequence Mining” describes this approach in detail.

No Training Data: Multivariate Collective Outliers ( Unordered)

Here data have multiple reading but does not have an order. For example, vitals collected from many people are such a multi-variate but not ordered dataset. For example, higher temperatures and slow heartbeats might be an anomaly even though both temperature and heartbeats  by itself are in a normal range.

Approach 1: Clustering – the underline assumption in the first approach is that if we cluster the data, normal data will belong to clusters while anomalies will not belong to any clusters or belong to small clusters.

Then to detect anomalies we will cluster the data, and calculate the centroids and density of each cluster found.  When we receive a new data point, we calculate the distance from the new data point to known large clusters and if it is too far, then decide it as an anomaly.

Furthermore, we can improve upon the above approach by first manually inspecting ranges of each cluster and labelling each cluster as anomalous or normal and use that while doing anomaly check for a data point.

Approach 2: Nearest neighbour techniques – the underline assumption is new anomalies are closer to known anomalies. This can be implemented by using distance to k-anomalies or using the relative density of other anomalies near the new data point. While calculating the above, with numerical data, we will break the space into hypercubes, and with categorical data, we will break the space into bins using histograms.Both these approaches are described in ACM Computing Survey paper “Anomaly Detection: A Survey” in detail.

No Training Data: Multivariate Collective Outliers ( Ordered)

This class is most general and consider ordering as well as value combinations. For example, consider a series of vital readings taken from the same patient. Some reading may be normal in combination but anomalous as combinations happen in wrong order. For example, given a reading that has the blood pressure, temperature, and heart beat frequency,  each reading by itself may be normal, but not normal if it oscillates too fast in a short period of time.

Combine Markov Chains and Clustering – This method combines clustering and Markov Chains by first clustering the data, and then using clusters as the states in a Markov Chain and building a Markov Chain. Clustering will capture common value combinations and Markov chains will capturing their order.

Other Techniques

There are several other techniques that have been tried out, and following are some of them. Please see Anomaly Detection: A Survey for more details.

Information Theory: The main idea is that anomalies have high information content due to irregularities, and this approach tries to find a subset of data points that has highest irregularities.

Dimension Reduction: The main idea is that after applying dimension reduction, a normal data can be easily expressed  as a combination of dimensions while anomalies tend to create complex combinations.

Graph Analysis: Some processes would have interaction between different players. For example, money transfers would create a dependency graph among participants. Flow analysis of such graphs might show anomalies. On some other use cases such as insurance, stock markets, corporate payment fraud etc, similarities between player’s transactions might suggest anomalous behaviour. Using PageRank to Detect Anomalies and Fraud in Healthcare and “New ways to detect fraud” white paper by Neo4j  are examples of these use cases.

Comparing Models and Acting on Results


With anomaly detection, it is natural to think that the main goal is to detect all anomalies. However, it is often a mistake.

The book, “Statistics done Wrong”, have a great example demonstrating the problem. Let’s consider there are 1000 patients and 9 of them have breast cancer. There is a test ( a model) that detect cancer, which will capture 80% of patients who has cancer (true positives). However, it says yes for 9% of healthy patients ( false positives).

This can be represented with following confusion matrix.

Healthy Not Healthy
Predicted Healthy 9091 1
Predicted Not Healthy 900 8

In this situation, when the test says someone has cancer, actually he does not have cancer at 99% of the time. So the test is useless. If we go detecting all anomalies, we might create a similar situation.

If you ignore this problem, it can cause harm in multiple ways.

  1. Reduce trust in the system – when people lost the trust, it will take lot of threats and red tape to make them trust it
  2. Might do harm than good – in above example, emotional trauma and unnecessary tests might outweigh any benefits.
  3. Might be unfair (e.g. surveillance, arrest)

Hence, we must try to find a balance where we try to detect what we can while keeping model accuracy within acceptable limits.

Another side of the same problem is that the models are only a suggestion for investigation, but not evidence for incriminating someone. This is another twist of Correlation vs. Causality. Therefore, results of the model must never be used as evidence and the investigator must find independent evidence of the problem (e.g. Credit card Fraud).

Due to both these reasons, it is paramount that investigator should be able to see the anomaly within context to verify it and also to find evidence that something is amiss.  For example, in WSO2 Fraud Detection solution, investigators could click on the fraud alert and see the data point within the context of other data as shown below.


Furthermore, with the techniques like static rules and unsupervised methods, it is harder to predict how much alerts the techniques might lead to. For example, it is not useful for 10 person team to receive thousands of alerts. We can handle this problem by tracking percentiles on anomaly score and only triggering on the top 1% of the time. If the considered set is very big we can use a percentile approximation technique like t-digest (e.g.

Finally, we must pay attention to what investigators did with the alerts and improve their experience. For example, providing auto-silencing of repeated alerts and alert digests etc are ways to provide more control to the investigators.

Tools & Datasets

Anomaly Detection is mostly done with custom code and proprietary solutions. Hence, it’s applications has been limited to few high-value use cases. If you are looking for an open source solution, following are some options.

  1. WSO2 has been working on Fraud Detection tool built on top of WSO2 Data Analytics Platform ( Disclaimer: I am part of that team).  It is free under Apache Licence. You can find more information from
  2. Kale ( and Thyme by Etasy provide support for time series based anomaly detection.  See
  3. There are several samples done on top of other products such as

Finally, there are only a few datasets in the public domain that can be used to test anomaly detection problems. This limits the development of those techniques. Following are those I know about.

  1. KDD cup 99 intrusion detection dataset
  2. Single variable time series data sets by Numenta
  3. Breast Cancer dataset
  4. Yahoo Time Series Anomaly Detection Dataset

I think as a community we need to find more datasets as that will make it possible to compare and contrast different solutions.


In this post, we discussed anomaly detection, how it is different from machine learning, and then discussed different anomaly detection techniques

We categorised anomalies into three classes: point anomalies, contextual anomalies, and collective anomalies. Then we discussed some of the techniques for detecting them. The following picture shows a summary of those techniques.


You can find a detailed discussion of most of these techniques from the ACM Computing Survey paper “Anomaly Detection : A Survey“. Finally, the post discussed several the pitfalls of trying to detect all anomalies and some tools.

Hope this was useful. If you have any thoughts or like to point out any other major techniques that I did not mention, please drop a comment.

Image Credit: (CC Licence)

(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),

ga(‘create’, ‘UA-73218448-1’, ‘auto’);
ga(‘send’, ‘pageview’);

Thinking Deeply about IoT Analytics


A typical IoT system would have following architecture.


As the picture depicts, sensors would collect data and transfer them to a gateway, which  in turn would send them to a processing system ( analytics cloud). Gateway can choose either to or not to summarizing  or preprocess the data.

The Connection between sensors and gateway would be via Radio Frequency (e.g. Zigbee), BLE, Wifi, or even wired connections. Often, the gateway is a mobile phone.

The connection from the gateway to Analytic servers would be  via Internet, LAN, or WiFi connection, and it will use a higher level protocol such as MQTT or CoAp (e.g. see IoT Protocols).

Since our focus is on IoT analytics, let’s not drill into devices and connectivity. Assuming that part is done, then how hard is IoT analytics? is it just a matter of offloading the data into one of the IoT analytics platforms or are there hidden surprises?

In this post, I am trying to answer those questions.  Efforts under the theme “Big data”  has solved many IoT analytics challenges. Especially, the system challenges related to large-scale data management, learning, and data visualizations. Data for “Big data”, however, came mostly from computer based systems (e.g. transaction logs, system logs, social networks,  and mobile phones). IoT data, in contrast, will come from the natural world, would be more detailed, fuzzy, and large. Nature of that data, assumptions, and use cases differ between old Big data and new IoT data. IoT analytics designers can build on top of big data, yet work is far from being done.

Let us look at few things we need to worry about.

How fast you need results?

Depends on how fast we need results from the data, our design changes. This decision depends on our use cases. We should ask ourselves, does the value of our insights ( results) degrade over time and how fast? For example, if we are going to improve the design of a product using data, then we can wait days if not weeks. On the other hand, if we are dealing with stock markets and other similar use cases where winner takes all,  milliseconds are a big deal.

Speed comes in several levels.

  • Few hours – send your data into a Data Lake and use a MapReduce technology such as Hadoop or Spark for processing.
  • Few Seconds – send data into a stream processing system (e.g. Apache Storm or Apache Samza),  an in-memory computing system (e.g. VoltDB, Sap Hana), or an interactive query system (e.g. Apache Drill) for processing.
  • Few milliseconds – send data to a system like Complex Event Processing where records are processed one by one and produce very fast outputs.

The following picture summarizes those observations.


Chances are we will have use cases that falls under more than one, and then we will have to use multiple technologies.

How much data to keep?

Next, we should decide how much data to keep and in what form. It is a tradeoff between cost vs. potential value of data and associated risks. Data is valuable. We see companies acquired just for their data and Google, Facebook going an extraordinary length to access data. Furthermore, we might find a bug or improvement in the current algorithm, and we might want to go back and rerun the algorithm on old data. Having said that, all decision must be made thinking about the big picture and current limits.

Following are our choices.

  • keep all the data and save it to a data lake ( the argument is that disk is cheap)
  • process all the data in a streaming fashion and not keep any data at all.
  • keep a processed or summarized version of the data. However, it is possible that you cannot recover all the information from the summaries later.

The next question is where to do the processing and how much of that logic we should push towards the sensors. There are three options.

  • Do all processing at analytics servers
  • Push some queries into the gateway
  • Push some queries down to sensors as well.

IoT community already has the technology to push the logic to gateways. Most gateways are full-fledged computers or mobile phones, and they can run higher level logic such as SQL-like CEP queries. For example, we have been working to place a light-weight CEP engine into mobile phones and gateways. However, if you want to push code into sensors, most of the cases, you would have to write custom logic using a lower level language like Arduino C. Another associated challenge is deploying, updating, and managing queries over time. If you choose to put custom low-level filtering code into sensors, I believe that will lead to a deployment complexities in the long run.

Analytics: Hindsight, Insight or Foresight?

Hindsight, insight, and foresight are three question types we can ask from data: To know what happened? to understand what happened? and predict what will happen.

Hindsight is possible with aggregations and applied statistics. We will aggregate data by different groups and compare those results using statistical techniques such as confidence intervals and statistical tests. A key component is  data visualizations that will show related data in context. (e.g. see Napoleon’s March and Hans Rosling’s famous Ted talk).

Insights and foresight would require machine learning and data mining. This includes finding patterns, modeling the current behavior, predicting future outcomes, and detecting anomalies. For more detailed discussion, I suggest you start following data science and machine learning tools (e.g. R, Apache Spark MLLib, WSO2 Machine Learner, GraphLab to name a few).

IoT analytics will pose new types of problems and demand more focus on some existing problems. Following are some analytics problems,  in my opinion, will play a key role in IoT analytics.

Time Series Processing

Most IoT data are collected via sensors over time. Hence, they are time series data,  and often most readings are autocorrelated. For example, a temperature reading is often highly affected by the earlier time step’s reading. However, most machine learning algorithms (e.g. Random Forests or SVM) do not consider autocorrelation. Hence, those algorithms would often do poorly while predicting  using IoT data.

This problem has been extensively studied under time series analysis (e.g. ARIMA model). Also, in recent years, Recurrent Neural Networks (RNN) has shown promising results with time series data. However, widely used Big Data frameworks such as Apache Spark and Hadoop do not support these models yet. IoT analytics community has to improve these models, build new models when needed, and incorporate them to big data analytics frameworks. For more information about the topic, please refer to the article Recurrent neural networks, Time series data and IoT: Part I.

Spatiotemporal Analysis and Forecasts

Similary, most IoT data would include location data, making them spatiotemporal data sets. (e.g. geospatial data collected over time). Just like time series data, these models would be affected by the spatial neighborhood. We would need to explore and learn spatiotemporal forecasting and other techniques and build tools that support them. Among related techniques are GIS databases (e.g. Geotrelis), and Panel Data analysis. Moreover, Machine learning techniques such as Recurrent Neural networks might also be used (see Application of a Dynamic Recurrent Neural Network in Spatio-Temporal Forecasting).

Anomaly detections

Many IoT use cases such as predictive maintenance, health warnings, finding plug points that consumes too much power, optimizations etc depend on detecting Anomalies. Anomaly detection poses several challenges.

  • Lack of training data – most use cases would not have training data, and hence unsupervised techniques such as clustering should be used.
  • Class imbalance – Even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. This problem is generally handled by building an ensemble of models where each model is trained with anomalous observations and resampled data from regular observations.
  • Click and explore – after detecting anomalies, they must be understood in context and vetted by humans. Tools, therefore, are required to show those anomalies in context and enable operators to explore data further starting from the anomalies. For example, if  an anomaly in a turbine is detected, it is useful to see that anomaly within regular data before and after the anomaly as well as to be able to study similar cases happened before.

What is our Response?

Finally, when we have analyzed and found actionable insights, we need to decide what to do with them. We have several choices.


  • Visualize the Results – build a dashboard that shows the data in context and let users explore, drill-down, and do root cause analysis.
  • Alerts – detect problems and notify the user using emails, SMS, or pager devices. Your primary challenge would be false positives that would severely affect the operator’s trust on the system. Finding the balance between false positives and ignoring true problems will be tricky.
  • Carrying out  Actions – next level is independent actions with open control loops. However, unlike the former case, the risk of a wrong diagnosis could have catastrophic consequences. Until we have a deeper understanding about the context, use cases would be limited to simple applications such as turning off a light, adjusting heating etc where associated risk are small.
  • Process & Environment control – this is the holy-grail of automated control. The system would continuously monitor and control the environment or the underline process in a closed control loop. The system has to understand the context, environment, and should be able to work around failures of actions etc. Much related work has been done under theme Autonomic computing  2001-2005 although a few use cases ever got deployed. Real life production deployment of this class, however, are several years away due to associated risks. We can think as NEST and Google Auto driving Car as first examples of such systems.

In general, we move towards automation when we need fast responses (e.g. algorithmic trading). More automation can be cheaper in the long run, but likely to be complex and expensive in the short run. As we learned from stock market crashes, the associated risks must not be underestimated.

It is worth noting that doing automation with IoT will be harder than big data automation use cases.  Most big data automation use cases either monitor computer systems or controlled environments like factories. In contrast, IoT data would be often fuzzy and uncertain. It is one thing to monitor and change a variable in automatic price setting algorithm. However, automating a use case in the natural world (e.g. an airport operations) is something different altogether. If we decide to go in the automation route, we need to spend significant time understanding, testing, retesting our scenarios.

Understanding IoT Use cases

Finally, let me wrap up by discussing the shape of common IoT data sets and use cases arises from them.

Data from most devices would have following fields.

  • Timestamp
  • Location, Grouping, or Proximity Data
  • Several readings associated with the device e.g. temperature, voltage and power, rpm, acceleration, and torque, etc.

The first use case is to monitor, visualize, and alerts about a single device data. This use case focuses on individual device owners.

However, more interesting use cases occur when we look at devices as part of a larger system: a fleet of vehicles, buildings in a city, a farm etc. Among aforementioned fields, time and location will play a key role in most IoT use cases. Using those two, we can categorize most use cases into two classes: stationary dots and moving dots.

Stationary dots

Among examples of “stationary dot” use cases are equipment deployments (e.g. buildings, smart meters, turbines, pumps etc). Their location is useful only as a grouping mechanism. The main goal is to monitor an already deployed system in operation.

Following are some of the use cases.

  • View of the current status, alerts on problems, drill down and root cause analysis
  • Optimizations of current operations
  • Preventive Maintenance
  • Surveillance

Moving dots

Among examples of moving dot use cases are fleet management, logistic networks, wildlife monitoring, monitoring customer interactions in a shop, traffic, etc. The goal of these use cases is to understand and control movements, interactions, and behavior of participants.

WSO2_CEP_TfL_Demo_-_YouTubeFollowing are some examples.

  • Sports analytics (e.g. see the following video)
  • Geo Fencing and Speed Limits
  • Monitoring customer behavior in a shop, guided interactions, and shop design improvements
  • Visualizing (e.g. time-lapse videos) of movement dynamics
  • Surveillance
  • Route optimizations

For example, the following is a sports analytics use case built using data from a real football game.

For both types of use cases, I believe it is possible to build generic extensible tools that provide an overall view of the devices and provide out of the box support for some of the use cases. However, specific machine learning models such as anomaly detection would need expert intervention for best results.  Such tools, if done right, could facilitate reuse, reduce cost, and improve the reliability of IoT systems. It is worth noting that this is one of the things “Big data” community did right. A key secret of “Big data” success so far has been the availability of high quality, generic open source middleware tools.

Also, there is room for companies that focus on specific use cases or classes of use cases. For example, Scanalytics focuses on foot traffic monitoring and Second spectrum focuses on sport analytics.  Although expensive, they would provide an integrated ready to go solutions. IoT system designers have a choice either going with a specialized vendor or building on top of open source tools (e.g. Eclipse IoT platform, WSO2 Analytics Platform).


This post discusses different aspects of an IoT analytics solutions pointing out challenges that you need to think about while building IoT analytics solutions or choosing analytics solutions.

Big data has solved many IoT analytics challenges. Specially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Following are the highlights.

  • How fast we need results? Real-time vs. batch or a combination.
  • How much data to keep? based on use cases and incoming data rate, we might choose between keeping none, summary, or everything. Edge analytics is also a related aspect of the same problem.
  • From analytics, do we want hindsight, insight or foresight? decide between aggregation and Machine learning methods. Also, techniques such as time series and spatiotemporal algorithms will play a key role with IoT use cases.
  • What is our Response from the system when we have an actionable insight? show a visualization, send alerts, or to do automatic control.

Finally, we discussed the shape of IoT data and few reusable scenarios and the potential of building middleware solutions for those scenarios.

Hope this was useful. If you have any thoughts, I would love to hear from you.