Understanding Causality and Big Data: Complexities, Challenges, and Tradeoffs

image credit: Wikipedia, Amitchell125

“Does smoking causes cancer?”

We have heard that lot of smokers have lung cancer. However, can we mathematically tell that smoking causes cancer?

We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.

Ok great, can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.

This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.

Correlation does not mean Causation!

This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.

For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.

You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.

Can we prove Causality?

Great, how can I show causality? Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.

To be precise, if error bars for groups does not overlap for both the groups, then there is a causality. Check https://www.optimizely.com/ab-testing/ for more details.

However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of the smoke, and make sure other half does not smoke. Then wait for like 50 years and compare.

Did you see the catch? it is not good enough to compare smokers and non-smokers as there may be a common cause like the gene that cause them to do so. Do prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g.https://en.wikipedia.org/wiki/A_Frank_Statement. )

This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to one, and wait few hundred years!!

To summarize, Casualty, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.

Following are examples when causality is needed.

  • Before punishing someone
  • Diagnosing a patient
  • Measure effectiveness of a new drug
  • Evaluate the effect of a new policy (e.g. new Tax)
  • To change a behavior

Big Data and Causality

Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.

There are two reactions to this problem.

First, “Big data guys does not understand what they are doing. It is stupid to try to draw conclusions without randomized experiment”.

I find this view lazy.

Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.

Second is “forget causality! correlation is enough”.

I find this view blind.

Playing ostrich does not make the problem go away. This kind of crude generalizations make people do stupid things and can limit the adoption of Big Data technologies.

We need to find the middle ground!

When do we need Causality?

The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.

Let us investigate both types of cases.

Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples.

  1. When stakes are low ( e.g. marketing, recommendations) — when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
  2. As a starting point for an investigation — correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
  3. Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good ways to verify the fit.

There are other cases where causality is crucial. Following are few examples.

  1. Find a cause for disease
  2. Policy decisions ( would 15$ minimum wage be better? would free health care is better?)
  3. When stakes are too high ( Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
  4. When we are acting on the decision ( firing an employee)

Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, state could have run a experiment by selecting a population and sending the book to half of them and looking at the outcome.

Some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a polulation and rollout the feature for only part of the population and compare the two.

So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.

Closing Remarks

Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.

Most important thing is having a clear understanding at the point when we act on the decisions. Sometimes, when stakes are low, correlation might be enough. On some other cases, it is best to run an experiment to verify our claims. Finally, some systems might warrant building experiments into the system itself, letting you draw strong causality results. Choose wisely!

Original Post from my Medium account: https://medium.com/@srinathperera/understanding-causality-and-big-data-complexities-challenges-and-tradeoffs-db6755e8e220#.ca4j2smy3


Walking the Microservices Path towards Loose coupling? Look out for these Pitfalls

(image credit: Wiros from Barcelona, Spain)

Microservices are the new architecture style of building systems using simple, lightweight, loosely coupled services that can be developed and released independently of each other.

If you need to know the basics, read Martin Fowler’s Post. If you like to compare it with SOA, watch the Don Ferguson’s talk.). Also, Martin Fowler has written about “trade off of micro services” and “when it is worth doing microservices”, which let you decide when it is useful.

Let’s say that you heard, read, and got convinced about microservices. If you are trying to follow the microservices architecture, there are few practical challenges. This post discusses how you can handle some of those challenges.

No Shared Database(s)

Each microservice should have it’s own databases and Data MUST not be shared via a database. This rule removes a common case that leads to tight coupling between services. For example, if two services share the same database, the second service will break if the first service has changed the database schema. So teams will have to talk to each other.

I think this rule is a good one, and should not be broken. However, there is a problem. If two services share the same data (e.g. bank account data, shopping cart) and need to update the data transactionally, simplest approach is to keep both in the same database and use database transactions to enforce consistency. Any other solution is hard.

Solution 1: If updates happen only in one microservice (e.g. loan approval process check the balance), you can use asynchronous messaging (message queue) to share data.

Solution 1: If updates happen in both services, you can either consider merging the two services or use transactions. The post Microservices: It’s not (only) the size that matters, it’s (also) how you use them describes the first option. The next section will describe the transactions in detail.

Handling Consistency of Updates

You will run into scenarios where you will update the data from multiple places. We discuss an example in the earlier section. ( If you update the data only from one place, we already discussed how to do it).

Please note this use case typically solved using transactions. However, you can sometimes solve the problem without transactions. There are several options.

Put all updates to the same Microservice

When possible, avoid multiple updates crossing microservice boundaries. However, sometimes by doing this you might end up with few or worse one big monoliths. Hence, sometimes, this is not possible.

Use Compensation and other lesser Guarantees

As the famous post “Starbucks Does Not Use Two-Phase Commit” describes, the normal world works without transactions. For example, barista atStarbucks does not wait until your transaction is completed. Instead, they handle multiple customers same time and compensate for any erroneous conditions explicitly. You can do the same, given you are willing to do a bit more work.

One simple idea is if an option failed, you go and compensate. For example, if you are shipping the book, first deduct the money, then ship the book. If the shipping failed, you go and return the money.

Also, sometimes you can settle for eventual consistency or timeout. Another simple idea is give a button to the use to forcefully refresh the page if he can tell that it is outdated. Some other times, you bite the bullet and settle for lesser consistency (e.g. Vogel’s post is a good starting point).

Finally, Life Beyond Distributed Transactions: An Apostate’s Opinion is a detailed discussion on all the tricks.

Having said that, there are some use cases where you must do transactions to get correct results. And those MUST use transactions. see Microservices and transactions-an update. Weigh the pros and cons and choose wisely.

Microservice Security

Old approach is the service to authenticate by calling the database or Identity Server when it has received a request.

You can replace the identity server with a microservice. That, in my opinion, leads to a big complicated dependency graph.

Instead, I like the token based approach depicted by the following figure. The idea is described in the book, “Building Microservices”. Here the client ( or a gateway) would first talk to an identity/SSO server who will authenticate the user and issue a signed token that describes the user and his roles. (e.g. you can do this with SAML or OpenIDConnect). Each microservice verifies the token and authorizes the calls based on the user roles described in the token. For example, with this model, for the same query, a user with role “publisher” might see different results than a user with role “admin” because they have different permissions.

You can find more information about this approach from How To Control User Identity Within Microservices?.

Microservice Composition

Here, “composition” means “how can connect multiple microservices into one flow to deliver what end user needs”.

Most compositions with SOA looked like following. The idea is that there is a central server that runs the workflow.

Use of ESB with microservices is discouraged (e.g. Top 5 Anti-ESB Arguments for DevOps Teams). Also, you can find some counter arguments in Do Good Microservices Architectures Spell the Death of the Enterprise Service Bus?

I do not plan to get into the ESB flight in this post. However, I want to discuss whether we need a central server to do the microservices composition. There are several way to do the microservices composition.

Approach 1:Drive flow from Client

The following figure shows an approach to do microservices without a central server. The client browser handles the flow. The post, Domain Service Aggregators: A Structured Approach to Microservice Composition, is an example of this approach.

This has approach has several problems.

  1. If the client is behind a slow network, which is the most common case, the execution will be slow. This is because now multiple calls need to be triggered by the client.
  2. Might add security concerns ( I can hack my app to give me a loan)
  3. Above example thinks about a website. However, most complex compositions often come from other use cases. So general applicability of composition at the client to other use cases yet to be demonstrated.
  4. Where to keep the State? Can client be trusted to keep the state of the workflow. Modeling state with REST is possible. However, it is complicated.


Driving the flow from a central place is called orchestration. However, that is not the only way to coordinate multiple partners to carry out some work. For example, in a dance, there is no one person directing the performance. Instead, each dancer would follow who is near to her and sync up. Choreography applies the same idea to businesses process.

Typical implementation includes an eventing system, where each participant in the process listens to different events and carries out his or her parts. Each action generates asynchronous events that will trigger participants down the stream. This is the programming model used by environments like RxJava or Node.js.

For example, let’s assume that a loan process includes a request, a credit check, other outstanding loans check, manager approval, and a decision notification. The following picture shows how to implement this using choreograph. The request will be placed in a queue. It will be picked up by next actor, who will put his results into the next queue. The process will continue until the it has completed.

Just like a dance needs practice, choreography is complicated. For example, you did not know when the process has finished, nor you will know if an error has happened, or if the process is stuck. Choreography needs a monitoring system to track progress and recover or notify about the error.

On the other hand, the advantage of choreography is that it creates systems that are much loosely coupled. For example, often you can add a new actor to the process without changing other actors. You can find more information from Scaling Microservices with an Event Stream.

Centralized Server

The last but most simple option is a centralized server (a.k.a orchestration).

SOA’s implemented this often using two methods: ESB or Business Processes. Microservice folks propose an API Gateway (e.g. Microservices: Decomposing Applications for Deployability and Scalability). I guess API gateway is more lightweight and use technologies like REST/JSON. However, in a pure architectural sense, all those uses orchestration style.

Another variation of the centralized server is “backend for frontends” (BEF), which build a server side API per client type ( one for desktop, one for iOS etc). This model creates different APIs per each client type, optimized for each use case. See the pattern: Backends For Frontends for more information.

I would suggest not to go crazy with all options here and start with the API gateway as that is the most straightforward approach. You can switch to more complicated options as need arises.

Avoid Dependency Hell

We do microservices to make it possible that each service can release and deploy independently. To do that, you must avoid the dependancy hell.

Let’s consider microservices “A” who has the API “A1” and have upgraded to API “A2”. Now there are two cases.

  1. Microservice B might send messages intended for A1 to A2. This is backward compatibility.
  2. Microservice A might have to revert back to A1, and microservices C might continue to send messages intended to A2 to A1.

You must handle above scenarios somehow, and let the microservices evolve and deployed independently. If not, all your effort will be wasted.

Often, handling these cases is a matter of adding optional parameters and never renaming or removing existing parameters. More complicated scenarios, however, are possible.

The post “Taming Dependency Hell” within Microservices with Michael Bryzek discuss this in detail. Ask HN: How do you version control your microservices? is also another good source.

Finally, backward and forward compatibility support should be bounded by time. For example, you can have a rule that no microservice should depend on APIs that are more than three months old. That would let the microservices developers to eventually drop some of the code paths.

Finally, I would like to rant a bit about how your dependency graph should look like in a microservices architecture.

One option is freely invoking other microservices whenever it is needed. That will create a spaghetti architecture from the pre-ESB era. I am not a fan of that model.

The other extreme is saying that microservices should not call other microservices and all connection should be done via API gateway or message bus. This will lead to a one level tree. For example, instead of the microservice A calling B, we bring result from the microservice A to the gateway, which will call B with the results. This is the orchestration model. Most of the business logic will now live in the gateway. Yes, this makes the gateway fat.

My recommendation is either to go for the orchestration model or do the hard work of implementing choreography properly. Yes, I am asking not to do the spaghetti.


The goal of Microservices is loose coupling. Carefully designed microservice architecture let you implement a project using a set of microservices, where each is managed, developed, and released independently.

When you designed with microservices, you must keep the eye on the prize, which is “loose coupling”. There are quite a few challenges, and this post answer following questions.

  1. How can I handle scenario that needs to share data between two microservices?
  2. How can I evolve microservices API while keeping loose coupling?
  3. How to handle security?
  4. How to compose microservices?

Thanks! love to hear your thoughts.

Thinking Deeply about IoT Analytics

A typical IoT system would have following architecture.


As the picture depicts, sensors would collect data and transfer them to a gateway, which  in turn would send them to a processing system ( analytics cloud). Gateway can choose either to or not to summarizing  or preprocess the data.

The Connection between sensors and gateway would be via Radio Frequency (e.g. Zigbee), BLE, Wifi, or even wired connections. Often, the gateway is a mobile phone.

The connection from the gateway to Analytic servers would be  via Internet, LAN, or WiFi connection, and it will use a higher level protocol such as MQTT or CoAp (e.g. see IoT Protocols).

Since our focus is on IoT analytics, let’s not drill into devices and connectivity. Assuming that part is done, then how hard is IoT analytics? is it just a matter of offloading the data into one of the IoT analytics platforms or are there hidden surprises?

In this post, I am trying to answer those questions.  Efforts under the theme “Big data”  has solved many IoT analytics challenges. Especially, the system challenges related to large-scale data management, learning, and data visualizations. Data for “Big data”, however, came mostly from computer based systems (e.g. transaction logs, system logs, social networks,  and mobile phones). IoT data, in contrast, will come from the natural world, would be more detailed, fuzzy, and large. Nature of that data, assumptions, and use cases differ between old Big data and new IoT data. IoT analytics designers can build on top of big data, yet work is far from being done.

Let us look at few things we need to worry about.

How fast you need results?

Depends on how fast we need results from the data, our design changes. This decision depends on our use cases. We should ask ourselves, does the value of our insights ( results) degrade over time and how fast? For example, if we are going to improve the design of a product using data, then we can wait days if not weeks. On the other hand, if we are dealing with stock markets and other similar use cases where winner takes all,  milliseconds are a big deal.

Speed comes in several levels.

  • Few hours – send your data into a Data Lake and use a MapReduce technology such as Hadoop or Spark for processing.
  • Few Seconds – send data into a stream processing system (e.g. Apache Storm or Apache Samza),  an in-memory computing system (e.g. VoltDB, Sap Hana), or an interactive query system (e.g. Apache Drill) for processing.
  • Few milliseconds – send data to a system like Complex Event Processing where records are processed one by one and produce very fast outputs.

The following picture summarizes those observations.


Chances are we will have use cases that falls under more than one, and then we will have to use multiple technologies.

How much data to keep?

Next, we should decide how much data to keep and in what form. It is a tradeoff between cost vs. potential value of data and associated risks. Data is valuable. We see companies acquired just for their data and Google, Facebook going an extraordinary length to access data. Furthermore, we might find a bug or improvement in the current algorithm, and we might want to go back and rerun the algorithm on old data. Having said that, all decision must be made thinking about the big picture and current limits.

Following are our choices.

  • keep all the data and save it to a data lake ( the argument is that disk is cheap)
  • process all the data in a streaming fashion and not keep any data at all.
  • keep a processed or summarized version of the data. However, it is possible that you cannot recover all the information from the summaries later.

The next question is where to do the processing and how much of that logic we should push towards the sensors. There are three options.

  • Do all processing at analytics servers
  • Push some queries into the gateway
  • Push some queries down to sensors as well.

IoT community already has the technology to push the logic to gateways. Most gateways are full-fledged computers or mobile phones, and they can run higher level logic such as SQL-like CEP queries. For example, we have been working to place a light-weight CEP engine into mobile phones and gateways. However, if you want to push code into sensors, most of the cases, you would have to write custom logic using a lower level language like Arduino C. Another associated challenge is deploying, updating, and managing queries over time. If you choose to put custom low-level filtering code into sensors, I believe that will lead to a deployment complexities in the long run.

Analytics: Hindsight, Insight or Foresight?

Hindsight, insight, and foresight are three question types we can ask from data: To know what happened? to understand what happened? and predict what will happen.

Hindsight is possible with aggregations and applied statistics. We will aggregate data by different groups and compare those results using statistical techniques such as confidence intervals and statistical tests. A key component is  data visualizations that will show related data in context. (e.g. see Napoleon’s March and Hans Rosling’s famous Ted talk).

Insights and foresight would require machine learning and data mining. This includes finding patterns, modeling the current behavior, predicting future outcomes, and detecting anomalies. For more detailed discussion, I suggest you start following data science and machine learning tools (e.g. R, Apache Spark MLLib, WSO2 Machine Learner, GraphLab to name a few).

IoT analytics will pose new types of problems and demand more focus on some existing problems. Following are some analytics problems,  in my opinion, will play a key role in IoT analytics.

Time Series Processing

Most IoT data are collected via sensors over time. Hence, they are time series data,  and often most readings are autocorrelated. For example, a temperature reading is often highly affected by the earlier time step’s reading. However, most machine learning algorithms (e.g. Random Forests or SVM) do not consider autocorrelation. Hence, those algorithms would often do poorly while predicting  using IoT data.

This problem has been extensively studied under time series analysis (e.g. ARIMA model). Also, in recent years, Recurrent Neural Networks (RNN) has shown promising results with time series data. However, widely used Big Data frameworks such as Apache Spark and Hadoop do not support these models yet. IoT analytics community has to improve these models, build new models when needed, and incorporate them to big data analytics frameworks. For more information about the topic, please refer to the article Recurrent neural networks, Time series data and IoT: Part I.

Spatiotemporal Analysis and Forecasts

Similary, most IoT data would include location data, making them spatiotemporal data sets. (e.g. geospatial data collected over time). Just like time series data, these models would be affected by the spatial neighborhood. We would need to explore and learn spatiotemporal forecasting and other techniques and build tools that support them. Among related techniques are GIS databases (e.g. Geotrelis), and Panel Data analysis. Moreover, Machine learning techniques such as Recurrent Neural networks might also be used (see Application of a Dynamic Recurrent Neural Network in Spatio-Temporal Forecasting).

Anomaly detections

Many IoT use cases such as predictive maintenance, health warnings, finding plug points that consumes too much power, optimizations etc depend on detecting Anomalies. Anomaly detection poses several challenges.

  • Lack of training data – most use cases would not have training data, and hence unsupervised techniques such as clustering should be used.
  • Class imbalance – Even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. This problem is generally handled by building an ensemble of models where each model is trained with anomalous observations and resampled data from regular observations.
  • Click and explore – after detecting anomalies, they must be understood in context and vetted by humans. Tools, therefore, are required to show those anomalies in context and enable operators to explore data further starting from the anomalies. For example, if  an anomaly in a turbine is detected, it is useful to see that anomaly within regular data before and after the anomaly as well as to be able to study similar cases happened before.

What is our Response?

Finally, when we have analyzed and found actionable insights, we need to decide what to do with them. We have several choices.


  • Visualize the Results – build a dashboard that shows the data in context and let users explore, drill-down, and do root cause analysis.
  • Alerts – detect problems and notify the user using emails, SMS, or pager devices. Your primary challenge would be false positives that would severely affect the operator’s trust on the system. Finding the balance between false positives and ignoring true problems will be tricky.
  • Carrying out  Actions – next level is independent actions with open control loops. However, unlike the former case, the risk of a wrong diagnosis could have catastrophic consequences. Until we have a deeper understanding about the context, use cases would be limited to simple applications such as turning off a light, adjusting heating etc where associated risk are small.
  • Process & Environment control – this is the holy-grail of automated control. The system would continuously monitor and control the environment or the underline process in a closed control loop. The system has to understand the context, environment, and should be able to work around failures of actions etc. Much related work has been done under theme Autonomic computing  2001-2005 although a few use cases ever got deployed. Real life production deployment of this class, however, are several years away due to associated risks. We can think as NEST and Google Auto driving Car as first examples of such systems.

In general, we move towards automation when we need fast responses (e.g. algorithmic trading). More automation can be cheaper in the long run, but likely to be complex and expensive in the short run. As we learned from stock market crashes, the associated risks must not be underestimated.

It is worth noting that doing automation with IoT will be harder than big data automation use cases.  Most big data automation use cases either monitor computer systems or controlled environments like factories. In contrast, IoT data would be often fuzzy and uncertain. It is one thing to monitor and change a variable in automatic price setting algorithm. However, automating a use case in the natural world (e.g. an airport operations) is something different altogether. If we decide to go in the automation route, we need to spend significant time understanding, testing, retesting our scenarios.

Understanding IoT Use cases

Finally, let me wrap up by discussing the shape of common IoT data sets and use cases arises from them.

Data from most devices would have following fields.

  • Timestamp
  • Location, Grouping, or Proximity Data
  • Several readings associated with the device e.g. temperature, voltage and power, rpm, acceleration, and torque, etc.

The first use case is to monitor, visualize, and alerts about a single device data. This use case focuses on individual device owners.

However, more interesting use cases occur when we look at devices as part of a larger system: a fleet of vehicles, buildings in a city, a farm etc. Among aforementioned fields, time and location will play a key role in most IoT use cases. Using those two, we can categorize most use cases into two classes: stationary dots and moving dots.

Stationary dots

Among examples of “stationary dot” use cases are equipment deployments (e.g. buildings, smart meters, turbines, pumps etc). Their location is useful only as a grouping mechanism. The main goal is to monitor an already deployed system in operation.

Following are some of the use cases.

  • View of the current status, alerts on problems, drill down and root cause analysis
  • Optimizations of current operations
  • Preventive Maintenance
  • Surveillance

Moving dots

Among examples of moving dot use cases are fleet management, logistic networks, wildlife monitoring, monitoring customer interactions in a shop, traffic, etc. The goal of these use cases is to understand and control movements, interactions, and behavior of participants.

WSO2_CEP_TfL_Demo_-_YouTubeFollowing are some examples.

  • Sports analytics (e.g. see the following video)
  • Geo Fencing and Speed Limits
  • Monitoring customer behavior in a shop, guided interactions, and shop design improvements
  • Visualizing (e.g. time-lapse videos) of movement dynamics
  • Surveillance
  • Route optimizations

For example, the following is a sports analytics use case built using data from a real football game.

For both types of use cases, I believe it is possible to build generic extensible tools that provide an overall view of the devices and provide out of the box support for some of the use cases. However, specific machine learning models such as anomaly detection would need expert intervention for best results.  Such tools, if done right, could facilitate reuse, reduce cost, and improve the reliability of IoT systems. It is worth noting that this is one of the things “Big data” community did right. A key secret of “Big data” success so far has been the availability of high quality, generic open source middleware tools.

Also, there is room for companies that focus on specific use cases or classes of use cases. For example, Scanalytics focuses on foot traffic monitoring and Second spectrum focuses on sport analytics.  Although expensive, they would provide an integrated ready to go solutions. IoT system designers have a choice either going with a specialized vendor or building on top of open source tools (e.g. Eclipse IoT platform, WSO2 Analytics Platform).


This post discusses different aspects of an IoT analytics solutions pointing out challenges that you need to think about while building IoT analytics solutions or choosing analytics solutions.

Big data has solved many IoT analytics challenges. Specially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Following are the highlights.

  • How fast we need results? Real-time vs. batch or a combination.
  • How much data to keep? based on use cases and incoming data rate, we might choose between keeping none, summary, or everything. Edge analytics is also a related aspect of the same problem.
  • From analytics, do we want hindsight, insight or foresight? decide between aggregation and Machine learning methods. Also, techniques such as time series and spatiotemporal algorithms will play a key role with IoT use cases.
  • What is our Response from the system when we have an actionable insight? show a visualization, send alerts, or to do automatic control.

Finally, we discussed the shape of IoT data and few reusable scenarios and the potential of building middleware solutions for those scenarios.

Hope this was useful. If you have any thoughts, I would love to hear from you.


Beyond Distributed ML Algorithms: Techniques for Learning from Large Data Sets

Almost always, more data let machine learning (ML) algorithms do better. Sometimes, more data let simpler algorithms like logistic regression do better than complex algorithms such as SVM. This has been observed in academia (e.g. see A Few Useful Things to Know about Machine Learning), community (e.g. In Machine Learning, What is Better: More Data or better Algorithms) and Keggale competitions.

Moreover, more data has enabled previously underperforming algorithms like Neural Networks to come back and take over the limelight. For an example, Google has used the new reincarnation of Neural networks, Deep Learning, for image recognition with amazing results. Try a query like “boy on a tree” in Google image search, and the results will amaze you.


In this post, let’s explore different methods for learning from large datasets. An obvious method is parallel and distributed execution. One of the key points I want to make is that although effective, distributed executions are not the only option.

Le’ts start with a great talk by Ron Beckerman on the topic.

He provides a great overview into our topic. Let’s start with Hadoop.

Use Hadoop

When community looked to learn from large datasets, they already knew a way to do parallel executions: Hadoop (MapReduce). So everyone tried ML algorithms using Hadoop, which kind of worked. There are hundreds of papers written and Apache Mahout came out as the opensource implementation of those ML algorithms.

That got people started. Hadoop-based processing, however, had a big flow. Most Machine learning algorithms have an iterative part (see the famous paper A Few Useful Things to Know about Machine Learning). To run the iterative part, the Hadoop model must load the data from the file system again and again. Since Network and Disk IO are the main bottlenecks for distributed computations like MapReduce, the Hadoop was very slow. The article, MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! , is a very good treatment of the related aspects.

Of course, this was fiercely competed (e.g. see MapReduce is Good Enough). However, arguments do not make performance problems go away. When an alternative, in the form of Apache Spark, become available, people started to move on.

New Techniques for Scaling ML

To run an algorithm parallelly, we need to somehow break the problem into smaller parts and assign it to different threads or machines. This is a problem that has been well studied (e.g. see famous 13 Dwarfs paper). The post, An Introduction to Distributed Machine Learning by Krishna Sridhar, describes the motivation behind this approach.

We have to either partition the data (e.g.  KD trees, Max-margin trees, Convex trees) or partition the execution. However, most machine learning algorithms were not embarrassingly parallel, which means you need communications between your threads or machines. This is bad news. Amdahl’s law says that resulting sequential parts in the algorithms are prohibitively expensive.

Then come a breakthrough. Machine learning algorithms are optimization problems, and they search a large parameter space to find the function or representation that best represent the data. For this search, data need not be consistent. Instead, the algorithm can continue while lazily updating each other, and still the answer will be correct.  The post, Parallel Machine Learning with Hogwild!, by Krishna Sridhar describes this beautifully.

That means we can just break most machine learning algorithms (e.g. by data) and run them parallel while communicating lazily without slowing down sender or the receiver. This is the approach used by Apache Spark. Coupled with its ability to process data again and again, it was much easier to implement algorithms with Apache Spark. So much so that Apache Mahout, the Hadoop Machine Learning project, switched to Spark and stop adding new Hadoop-based executions.

Above approach partitions the data and run it in a batch execution style. However, lazy communication between different jobs is complicated in batch style systems like Spark. Alternative is to break the data and assign them to different nodes, which will pin data always to a one node. Then, while carrying out computations, nodes can periodically broadcast their current state to other nodes in an asynchronous style.

However, broadcasting in a distributed system is both expensive and complicated. To solve this problem, a new centralized approach is used. The idea is to use a centerlized server called “parameter server”, that collects the current state of nodes periodically and redistributes it back to everyone. Die hard distributed people does not like this due to the central server, but the state of machine learning algorithms are small and this approach scale for most practical applications. Indeed, Google uses this. You can find more information from the following talk by Jeff Dean.

This is primarily used to scale up Neural networks and Probabilistic Graphical Models (Kalman filters, Belief Networks). You can find an opensource implementation from http://parameterserver.org. In the following talk, Alex Somla talks about parameter servers in detail.

Avoid Parallelism and Make Data Small

However, it has not been clearly established that parallel distributed execution is indeed the superior approach for all kind of problems. For an example, Ben Hamner from Kaggle observes in the following talk that  down sampling 1/10 to 1/100 often does not affect final results significantly in most competitions. Furthermore, he observes that most winners are teams that can iterate and improve their solutions faster.

Hence sampling is a viable and very powerful approach. Specially, at the initial stages where the data scientist explores possible solutions. An interesting related work has done by prof. Michal Jorden’s group, which they call Bag of Little Bootstrap (BLB). The main Idea is to sample the dataset with replacements, build models, and then looking at error bars to decide on  the quality of models. You can find more information from their paper from A scalable bootstrap for massive data.

The second idea is to observe that in distributed computations, a significant part of the computing power is spent on communications. If we have enough of memory and use technologies like GPUs, can we solve most problems in single multi-core computing? The answer is yes. It has been demonstrated that this approach can handle moderate size data sets. For example, in 2009, GPU based KMeans algorithm clustering  1 billion data points looking for 1,000 clusters took only 26 minutes  while distributed approach took 6 days. You can find more information from the blog post, GPU and Large Scale Data Mining, by Suleiman Shehu.

Finally, streaming can also help. Most of the time we collect data for hours and want to build a model using those data very fast. However, if we build the model in streaming fashion as data arrives, we have much more time available for the computation and in some cases even a single machine might be enough. However, one major weakness is that streaming algorithms are fixed, and cannot be used to do explorative data analysis.

Parting Thoughts

I believe we should be practical. Although, in some large use cases like Google’s image search, we must use large distributed machine learning algorithms. However, when possible, we should also use simpler methods. Specially, at the initial exploration phase while exploring possible models. Remember that is is often who can iterate fastest wins in Kaggle.

Other Resources

Following are some of the other content that are relevant to the topic, although I did not refer to them above.

  1. Ron Bekkerman, http://hunch.net/~large_scale_survey/
  2. Scaling Decision trees http://hunch.net/~large_scale_survey/TreeEnsembles.pdf
  3. What is Scalable Machine Learning?, http://blog.mikiobraun.de/2014/07/what-is-scalable-machine-learning.html
  4. Scaling big data mining infrastructure: the twitter experience J Lin, D Ryaboy – ACM SIGKDD Explorations Newsletter, 2013 – dl.acm.org
  5. Monoidify! monoids as a design principle for efficient mapreduce algorithms, J Lin, http://arxiv.org/abs/1304.7544 
  6. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML, http://www.vldb.org/pvldb/vol7/p553-boehm.pdf

WSO2 CEP 4.0.0: What is New? Storm support, Dashboards, and Templates

WSO2 CEP 4.0 is out. You can find

The first thing to note is that we have integrated batch, realtime, interactive, and predictive analytics into one platform called WSO2 Data Analytics Server (DAS). Please refer to my earlier blog, Introducing WSO2 Analytics Platform: Note for Architects, to understand WSO2 CEP fits into DAS. DAS release is coming soon.

Let us discuss what is new in WSO2 CEP 4.0.0, and what would those features mean to the end user.

Storm Integration

WSO2 CEP supports distributed query executions on top of Apache Storm. Users can provide CEP queries that have partitions, and WSO2 CEP can automatically build a Storm Topology that is equivalent to the query, deploy, and run it. Following is an example of such a query. CEP will build a Storm topology, which will first partition the data by region, run the first query within each partition, and then collect the results and run the second query.

define partition on TempStream.region {
  from TempStream[temp > 33]
  insert into HighTempStream; 
from HighTempStream#window(1h)
  select max(temp)as max 
  insert into HourlyMaxTempStream;

Effectively, WSO2 CEO provides a SQL-like, stream processing language that runs on Apache Storm. Please refer to the following talk I did at O’reilly Strata for more information (slides).

Analytics Dashboard

WSO2 CEP now includes a dashboard and a Wizard for creating charts using data from event streams.


From the Wizard, you can choose a stream, select a chart type, assign its properties into different dimensions in the plot via a Wizard, and generate a chart. For an example, you can tell that you need a scatter plot where x-axis maps to time, y-axis maps to hit count where point colours maps to country, and point size maps to their population. The charts will be connected to CEP though web sockets and the scatter plot will update when new data become available in the underline event stream.

Query Templates

CEP queries are complicated. It is not very simple for non-technical user to write new queries. With these templates, developers can write parameterised queries and save them as a template. Then users can provide values for that template using a form and deploy them as a query.

For example, let’s assume we want the end users to write a query to detect high-speed vehicles where end user defines the speed. Then we will write parametrized query template like following.

from VehicleStream[speed > $1]

The end user, when he select the template, will see a form that let him specify the speed value. CEP will deploy a new query using the template and speed value by a click of a button. Following is a example form.


Furthermore, WSO2 CEP now includes a Geo Dashboard, that you can configure via query templates. Following video shows the visualization of London traffic data using geo dashboard.

Siddhi Language Improvements

With WSO2 CEP 4.0, queries that have partitions defined would run each partition in parallel. Earlier, all executions would run in a single thread, and CEP will only use a single core  per Execution plan ( a collection of queries). New approach would significantly improve performance for some usecases.

Furthermore, CEP can now run Machine Learning models built with WSO2 ML and PMML models. It supports several anomaly detection algorithms as described in Fraud Detection and Prevention: A Data Analytics Approach white paper.

In addition, we have added time series regression and a forecaster as Siddhi functions. It also includes several new functions for string manipulations and mathematics. Furthermore, it includes a CronWindow that will trigger based on a Cron expression (see sample 115 for more details), which users can used to define time windows that starts in a specific time.

Also, now you can pack all queries and related artifacts into a WSO2 single Carbon archive, which will make it easier for users to manage their CEP execution plans as a single unit.

New Transports

New WSO2 CEP can receive event and send events using MQTT, which is one of the leading Internet of Things (IoT) protocols. Also it includes support for WebSockets that will make it much easier to build web apps that uses WSO2 CEP.


WSo2 CEP now includes an Event Simulator, that you can use to replay events stored in a CSV file for testing and demo purposes. Furthermore, it has a “Try it” feature that let user send events into CEP using it’s Web console, which is also useful for testing.


Please try it out. It is all free under apache Licence. We will love to hear your thoughts. If you find any problems or have suggestions, drop me a note or get in touch with us via architecture@wso2.org.

WSO2 Machine Learner: Why would You care?

After about a year worth of work, WSO2 Machine Learner (WSO2 ML) is officially out. You can find the pack from http://wso2.com/products/machine-learner/ ( also Code and User Guide). It is free and Opensource under Apache Licence ( which pretty much means you can do whatever with the code as long as you keep the same Licence).

Let me try to answer “the question”. How is it different and why would you care?

What is it?

The short answer is it is a Wizard and a system on top of Apache Spark MLLib. The long answer is the following picture.


You can use it to do the following

  1. User can start with data ( in his disk, in HDFS, or in WSO2 DAS)
  2. Explore the data ( more about that later)
  3. Create a Project and build machine learning models going through a Wizard
  4. Compare those models and find the best model
  5. Export that model and use it with WSO2 CEP, WSO2 ESB, or from Java Code.

For someone from Enterprise World?

WSO2 Machine Learner is designed for the Enterprise world. It comes as an integrated solution with the rest of the Big Data processing technologies: batch, realtime, and interactive analytics. Also, it includes support from data collection, analysis,  to communication (e.g. visualizations, APIs, and alerts). Please see the earlier post “Introducing WSO2 Analytics Platform: Note for Architects” for more details.  Hence, it is part of a complete analytics solution.

WSO2 ML handles the full predictive analytics lifecycle, including model deployment and management.


If you are already collecting data, we can pull that data, process them, and build models. Models you built are immediately available to use from your main transaction flow ( via WSO2 ESB) or  data analysis flow ( via WSO2 CEP). Basically, you copy the model ID and add it to WSO2 ESB mediation scripts or WSO2 CEP queries, and now you have a Machine Learning integrated into your business. (Please see in Using Models for more information.) This handles details like keeping a central store of Models while deploying models in production and also let you quickly switch between models.

If you are not collecting data, you can start with WSO2 DAS and go from there. The same story holds.

Furthermore, it gives you the concept of a project where you can try out and keep track of multiple machine learning models. Also, it handles details like sending you an email when a long running machine learning algorithm execution has completed.

Finally, as we discuss in the next section, the ML Wizard is built such a way that you can use it with minimal understanding about Machine Learning. Sure, you will not get the same accuracy as the experts who will know how to tune the thing, but it can get you started and give you OK accuracy.

For a Machine Learning Newbie?

First of all, you need to understand what Machine Learning can do for you. Most problems, we know the exact steps to be followed to solve the problem. With those kinds of problems, all we have to do is to write a code that does those steps. This is what we call programming and lot of us do this day in day out.

However, there are other problems that you will learn by example. Driving a car, cycling, and drawing a picture are problems that we learn by looking at examples. If you want a computer to solve those problems, you cannot write a program to solve them because you do not know the algorithm. Machine Learning is used to solve specifically those problems. Instead of the algorithm, you give it lots of examples, and Machine Learning will learn a model (a function) from those examples. You can use the model to solve your initial problem. Google’s driverless car does exactly this.

If you are new to Machine Learning, I highly recommend looking at A Visual Introduction to Machine Learning and the following talk by Ron Beckerman.

The Machine Learner wizard tries to model the experience around what you want to do as oppose to showing you lot of ML algorithms. For example, you can choose to predict the next value, classify something to a one of the categories, or detect an anomaly. You can click through, use defaults, and get a model. You can try several algorithms and compare them with each other.

We support several standard techniques to compare ML models such as ROC curve, confusion Matrix, etc. CD’s blog post “Machine Learning for Everyone” talks about this in detail.

For example, following confusion matrix shows how much of true positives, false negatives etc resulted from the model.


The figure on the left chart shows a scatter plot of data points that are predicted correctly and incorrectly while the right-hand side shows the RoC curve.



However, at this point I suggest that you read How to Evaluate Machine Learning Models: Classification Metrics by Alice Zheng. It is ok to not to know how ML algorithms work, but you must know what models are better and why.

However, there is a catch. If you try well known Machine Learning datasets, they would work well ( You can find few of such data sets from the sample directory of the pack). However, sometimes with real datasets, getting good results need transforming features into different features, and that might be beyond you if you have just started. If you want to go pro and learn to transform features ( a.k.a. Feature Engineering) and other fascinating stuff, then Andrew Ng’s famous course https://www.coursera.org/learn/machine-learning is the best place to start.

For a Machine Learning Expert?

If you are an ML expert, still WSO2 Machine Learner can help in several ways.

First, it provides pretty sophisticated support for exploring the dataset based on a random sample. This includes scatter plots for looking at any two numerical features, parallel sets for looking at categorical data, Trellis sets for looking at 4-5 numerical dimensions at the same time, and cluster diagrams ( see below for some examples).

cluster-diagram trellis-chart parallel-set

Second, it gives you access to a large collection of scalable machine learning algorithms pretty easily. For a single node setup, you just download and unzip it. ( see below for how to do it).

Third, it provides an extensive set of model comparison measures as visualizations and also let you compare models side by side.

Fourth, in addition to predictive analytics, you have access to batch analytics though SparkSQL, interactive analytics with Lucence, and relatime analytics through WSO2 CEP. This will make understanding dataset as well as preprocessing data much easier. One limitation of this release is that those other types of analytics must be done before using data within WSO2 ML. However, the next release will enable you to run queries within the WSO2 ML pipeline as well.

Finally, you will also have all advantages listed under enterprise user such as seamless deployment of models and ability to switch the model easily.

Furthermore, many interesting features are coming shortly in the next release.

  • Support for Deep Learning and Neural Networks
  • Support for out of the Box Anomaly detection using Markov Chains and Clustering
  • Support to data cleanup and preprocessing using Data Wrangler and SparkSQL
  • Support for out of the box ensembles that let you combine models
  • Improvements to pipeline to warn the user on cases like class imbalances in classifications

Trying it Out

Carry out following steps

  1. Download WSO2 ML from http://wso2.com/products/machine-learner/
  2. Make sure you have Java 7 installed in your machine and set JAVA_HOME.
  3. Unzip the pack and run bin/wso2server.sh from the unpacked directory. Wait for WSO2 ML to start.
  4. Go to https://hostname:9443/ml and Login using username admin and password admin.
  5. Now you can upload your own dataset and follow along with the wizard. You can find more info from the User Guide. However, Wizard should be self-explanatory.

Remember, it is all free under apache Licence. Give it a try, and we will love to hear your thoughts. If you find any problems or have suggestions, report them via https://wso2.org/jira/browse/ML.

Dissecting the Big data Twitter Community through a Big data Lense

BigData hashtag is hyperactive, with close to 2000 tweets each day with more than 20,000 tweeps. This post digs into the tweet archive from August 03-25 to understand dynamics about the Big data community.

Tweeter communities have activities: tweets, retweets, replies, and followers. Among them, retweets suggest a strong agreement by the actor with the tweet’s content. Hence, retweets graph is a good representation of actual connections in the network, their strengths, as well as the propagation of information through the network.

The Network

This post, therefore, will focus on the retweets graph. The following graph shows a visualization of the retweets graph  where vertex represents an account, edges represent retweets, and the size of the node represents the number of retweets each node has received. Each edge is weighted by the number of retweets between two accounts, and it shows an edge from account A to B only if B has retweeted two or more tweets by A.


The first thing you will notice is that the top three tweeps have received a large proportion of retweets. The following heat map shows retweets received by top tweeps.


  1. KirkDBorne 2588
  2. jose_garde 1730
  3. craigbrownphd 1546

In the network, we can see that the three of them have their own following. However, the graph has a phantom edge which has lots of edges placed around it in the right middle. That turns out to be a twitter bot ( BigDataTweetBot) which has tweeted lots of other people’s tweets.

network10The following figure shows a more spare version of the same graph that only shows an edge if two accounts have more than 10 retweets between them. On this network, the KirkDBorne community seems to be pretty well-connected, while others are pretty isolated. This suggests that his community is stronger.

Is it a Small World Network?

As shown by the following plot, the Retweets distribution follows a Power law, but edge distribution is close to Power law but falls short. The network is close to a scale-free network.



However, the network has a very high diameter of 154 and a mean path length 11. Hence, it is not a small world network. Furthermore, it’s Cluster coefficient is very small (0.0009953724), which suggest that the cross chatter in the network is very small. So the Big data retweets do not create a cohesive community.

How can I get more Retweets?

When we talk about retweets, this is the thought on everyone’s mind. The plot shows number of tweets per day in X-axis, the number of followers on Y axis in log scale, and each point’s size and the color is decided by the number of retweets it has received.

According to the plot, having a lot of followers helps and necessary, but it is not a sufficient. However,  tweeting a lot seems to help, and most tweeps tweeting more than ten tweets a day have received at least 10 retweets. ( Retweets are not included).

Are Tweet Bots Useful?

Do retweets bots (e.g. BigDataTweetBot, NoSQLDigest) are useful or do they just create noise by retweeting things blindly? Let us investigate. Let’s look at the betweenness centrality, which is a measure of the role of each node in connecting the network, to understand who are key connectors in the network. @Espenel takes the first while the fourth takes by @KirkDBorne. Second and third are taken by twitter bots (BigDataTweetBot, NoSQLDigest), which suggests that twitter bots are indeed useful.

What did Community talking about?


Following word cloud shows the words that have been most often used. The word cloud has most of the usual suspects, like links to IoT and cloud,  businesses, marketing etc. Among companies Google, Intel and IBM have been mentioned.

Interestingly, we do not see any of the big data tools. It is possible that related discussions happen in their own hashtags such as #hadoop and #spark.

Following are most tweeted tweets through the time period

  1. #bigdata to our users !!! check the new keyword suggestions for an improved
  2. 4 predictive #analytics and practical applications for the everyday marketer (422)
  3. marrying #data to #analytics a major theme at #hp’s conference  (153)
  4. combining analytics and security to treat vulnerabilities like ants (150)
  5. sbi uses big data mining to check defaults biz loss: when state bank of india  (141)

Following are most tweeted tweets by day. We only list tweets that have had more than 75 retweets in a day. It shows the number of tweets it has received within brackets.

  1. Aug 05: guidelines to optimize #bigdata transfers (89)
  2. Aug 10:#nfl taps #bigdata to study #concussions but major game changes far off (139)
  3. Aug 10: sbi uses big data mining to check defaults biz loss: when state bank of india (sbi)  (140)
  4. Aug 12: #iot facts + how to make business sense of the internet of things (85)
  5. Aug 18: idf 2015: intel teams with google to bring realsense to project tango (113)
  6. Aug 18: marrying #data to #analytics a major theme at #hp’s conference (152)
  7. Aug 19: combining analytics and security to treat vulnerabilities like ants: bill franks chief analytics off (149)
  8. Aug 20: qantas annual profit soars to au$975m: australia’s flying kangaroo is out of the red having boosted (115)
  9. Aug 20: top news: sap oem on twitter: “top 10 #bigdata twitter handles to follow @merv (78)
  10. Aug 22: five open source big data projects to watch (132)
  11. Aug 22: 3 ways that big data are used to study #climatechange (126)
  12. Aug 22: should #bigdata be used to measure #employee #productivity? (110)
  13. Aug 23: e-commerce market #analytics to #ebay #amazon #alibaba sellers and buyers
  14. Aug 23: should #bigdata be used to measure #employee #productivity? (134)

One interesting observation is that most trending tweets were about usecases, not about tools or techniques.


  1. Few well-known tweeps have a lot of retweets, and top three roughly have their own communities.
  2. The network is roughly scale-free, but not a small world network. Nodes are weakly connected, which suggests non-cohesive  communities.
  3. A large number of followers is a necessary but not a sufficient condition to receive a lot of retweets. Tweeting a lot seems to help.
  4. Tweet bots are centrally placed and likely useful.
  5. Most retweeted tweets seem to focus on use cases.