Thinking Deeply about IoT Analytics

A typical IoT system would have following architecture.


As the picture depicts, sensors would collect data and transfer them to a gateway, which  in turn would send them to a processing system ( analytics cloud). Gateway can choose either to or not to summarizing  or preprocess the data.

The Connection between sensors and gateway would be via Radio Frequency (e.g. Zigbee), BLE, Wifi, or even wired connections. Often, the gateway is a mobile phone.

The connection from the gateway to Analytic servers would be  via Internet, LAN, or WiFi connection, and it will use a higher level protocol such as MQTT or CoAp (e.g. see IoT Protocols).

Since our focus is on IoT analytics, let’s not drill into devices and connectivity. Assuming that part is done, then how hard is IoT analytics? is it just a matter of offloading the data into one of the IoT analytics platforms or are there hidden surprises?

In this post, I am trying to answer those questions.  Efforts under the theme “Big data”  has solved many IoT analytics challenges. Especially, the system challenges related to large-scale data management, learning, and data visualizations. Data for “Big data”, however, came mostly from computer based systems (e.g. transaction logs, system logs, social networks,  and mobile phones). IoT data, in contrast, will come from the natural world, would be more detailed, fuzzy, and large. Nature of that data, assumptions, and use cases differ between old Big data and new IoT data. IoT analytics designers can build on top of big data, yet work is far from being done.

Let us look at few things we need to worry about.

How fast you need results?

Depends on how fast we need results from the data, our design changes. This decision depends on our use cases. We should ask ourselves, does the value of our insights ( results) degrade over time and how fast? For example, if we are going to improve the design of a product using data, then we can wait days if not weeks. On the other hand, if we are dealing with stock markets and other similar use cases where winner takes all,  milliseconds are a big deal.

Speed comes in several levels.

  • Few hours – send your data into a Data Lake and use a MapReduce technology such as Hadoop or Spark for processing.
  • Few Seconds – send data into a stream processing system (e.g. Apache Storm or Apache Samza),  an in-memory computing system (e.g. VoltDB, Sap Hana), or an interactive query system (e.g. Apache Drill) for processing.
  • Few milliseconds – send data to a system like Complex Event Processing where records are processed one by one and produce very fast outputs.

The following picture summarizes those observations.


Chances are we will have use cases that falls under more than one, and then we will have to use multiple technologies.

How much data to keep?

Next, we should decide how much data to keep and in what form. It is a tradeoff between cost vs. potential value of data and associated risks. Data is valuable. We see companies acquired just for their data and Google, Facebook going an extraordinary length to access data. Furthermore, we might find a bug or improvement in the current algorithm, and we might want to go back and rerun the algorithm on old data. Having said that, all decision must be made thinking about the big picture and current limits.

Following are our choices.

  • keep all the data and save it to a data lake ( the argument is that disk is cheap)
  • process all the data in a streaming fashion and not keep any data at all.
  • keep a processed or summarized version of the data. However, it is possible that you cannot recover all the information from the summaries later.

The next question is where to do the processing and how much of that logic we should push towards the sensors. There are three options.

  • Do all processing at analytics servers
  • Push some queries into the gateway
  • Push some queries down to sensors as well.

IoT community already has the technology to push the logic to gateways. Most gateways are full-fledged computers or mobile phones, and they can run higher level logic such as SQL-like CEP queries. For example, we have been working to place a light-weight CEP engine into mobile phones and gateways. However, if you want to push code into sensors, most of the cases, you would have to write custom logic using a lower level language like Arduino C. Another associated challenge is deploying, updating, and managing queries over time. If you choose to put custom low-level filtering code into sensors, I believe that will lead to a deployment complexities in the long run.

Analytics: Hindsight, Insight or Foresight?

Hindsight, insight, and foresight are three question types we can ask from data: To know what happened? to understand what happened? and predict what will happen.

Hindsight is possible with aggregations and applied statistics. We will aggregate data by different groups and compare those results using statistical techniques such as confidence intervals and statistical tests. A key component is  data visualizations that will show related data in context. (e.g. see Napoleon’s March and Hans Rosling’s famous Ted talk).

Insights and foresight would require machine learning and data mining. This includes finding patterns, modeling the current behavior, predicting future outcomes, and detecting anomalies. For more detailed discussion, I suggest you start following data science and machine learning tools (e.g. R, Apache Spark MLLib, WSO2 Machine Learner, GraphLab to name a few).

IoT analytics will pose new types of problems and demand more focus on some existing problems. Following are some analytics problems,  in my opinion, will play a key role in IoT analytics.

Time Series Processing

Most IoT data are collected via sensors over time. Hence, they are time series data,  and often most readings are autocorrelated. For example, a temperature reading is often highly affected by the earlier time step’s reading. However, most machine learning algorithms (e.g. Random Forests or SVM) do not consider autocorrelation. Hence, those algorithms would often do poorly while predicting  using IoT data.

This problem has been extensively studied under time series analysis (e.g. ARIMA model). Also, in recent years, Recurrent Neural Networks (RNN) has shown promising results with time series data. However, widely used Big Data frameworks such as Apache Spark and Hadoop do not support these models yet. IoT analytics community has to improve these models, build new models when needed, and incorporate them to big data analytics frameworks. For more information about the topic, please refer to the article Recurrent neural networks, Time series data and IoT: Part I.

Spatiotemporal Analysis and Forecasts

Similary, most IoT data would include location data, making them spatiotemporal data sets. (e.g. geospatial data collected over time). Just like time series data, these models would be affected by the spatial neighborhood. We would need to explore and learn spatiotemporal forecasting and other techniques and build tools that support them. Among related techniques are GIS databases (e.g. Geotrelis), and Panel Data analysis. Moreover, Machine learning techniques such as Recurrent Neural networks might also be used (see Application of a Dynamic Recurrent Neural Network in Spatio-Temporal Forecasting).

Anomaly detections

Many IoT use cases such as predictive maintenance, health warnings, finding plug points that consumes too much power, optimizations etc depend on detecting Anomalies. Anomaly detection poses several challenges.

  • Lack of training data – most use cases would not have training data, and hence unsupervised techniques such as clustering should be used.
  • Class imbalance – Even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. This problem is generally handled by building an ensemble of models where each model is trained with anomalous observations and resampled data from regular observations.
  • Click and explore – after detecting anomalies, they must be understood in context and vetted by humans. Tools, therefore, are required to show those anomalies in context and enable operators to explore data further starting from the anomalies. For example, if  an anomaly in a turbine is detected, it is useful to see that anomaly within regular data before and after the anomaly as well as to be able to study similar cases happened before.

What is our Response?

Finally, when we have analyzed and found actionable insights, we need to decide what to do with them. We have several choices.


  • Visualize the Results – build a dashboard that shows the data in context and let users explore, drill-down, and do root cause analysis.
  • Alerts – detect problems and notify the user using emails, SMS, or pager devices. Your primary challenge would be false positives that would severely affect the operator’s trust on the system. Finding the balance between false positives and ignoring true problems will be tricky.
  • Carrying out  Actions – next level is independent actions with open control loops. However, unlike the former case, the risk of a wrong diagnosis could have catastrophic consequences. Until we have a deeper understanding about the context, use cases would be limited to simple applications such as turning off a light, adjusting heating etc where associated risk are small.
  • Process & Environment control – this is the holy-grail of automated control. The system would continuously monitor and control the environment or the underline process in a closed control loop. The system has to understand the context, environment, and should be able to work around failures of actions etc. Much related work has been done under theme Autonomic computing  2001-2005 although a few use cases ever got deployed. Real life production deployment of this class, however, are several years away due to associated risks. We can think as NEST and Google Auto driving Car as first examples of such systems.

In general, we move towards automation when we need fast responses (e.g. algorithmic trading). More automation can be cheaper in the long run, but likely to be complex and expensive in the short run. As we learned from stock market crashes, the associated risks must not be underestimated.

It is worth noting that doing automation with IoT will be harder than big data automation use cases.  Most big data automation use cases either monitor computer systems or controlled environments like factories. In contrast, IoT data would be often fuzzy and uncertain. It is one thing to monitor and change a variable in automatic price setting algorithm. However, automating a use case in the natural world (e.g. an airport operations) is something different altogether. If we decide to go in the automation route, we need to spend significant time understanding, testing, retesting our scenarios.

Understanding IoT Use cases

Finally, let me wrap up by discussing the shape of common IoT data sets and use cases arises from them.

Data from most devices would have following fields.

  • Timestamp
  • Location, Grouping, or Proximity Data
  • Several readings associated with the device e.g. temperature, voltage and power, rpm, acceleration, and torque, etc.

The first use case is to monitor, visualize, and alerts about a single device data. This use case focuses on individual device owners.

However, more interesting use cases occur when we look at devices as part of a larger system: a fleet of vehicles, buildings in a city, a farm etc. Among aforementioned fields, time and location will play a key role in most IoT use cases. Using those two, we can categorize most use cases into two classes: stationary dots and moving dots.

Stationary dots

Among examples of “stationary dot” use cases are equipment deployments (e.g. buildings, smart meters, turbines, pumps etc). Their location is useful only as a grouping mechanism. The main goal is to monitor an already deployed system in operation.

Following are some of the use cases.

  • View of the current status, alerts on problems, drill down and root cause analysis
  • Optimizations of current operations
  • Preventive Maintenance
  • Surveillance

Moving dots

Among examples of moving dot use cases are fleet management, logistic networks, wildlife monitoring, monitoring customer interactions in a shop, traffic, etc. The goal of these use cases is to understand and control movements, interactions, and behavior of participants.

WSO2_CEP_TfL_Demo_-_YouTubeFollowing are some examples.

  • Sports analytics (e.g. see the following video)
  • Geo Fencing and Speed Limits
  • Monitoring customer behavior in a shop, guided interactions, and shop design improvements
  • Visualizing (e.g. time-lapse videos) of movement dynamics
  • Surveillance
  • Route optimizations

For example, the following is a sports analytics use case built using data from a real football game.

For both types of use cases, I believe it is possible to build generic extensible tools that provide an overall view of the devices and provide out of the box support for some of the use cases. However, specific machine learning models such as anomaly detection would need expert intervention for best results.  Such tools, if done right, could facilitate reuse, reduce cost, and improve the reliability of IoT systems. It is worth noting that this is one of the things “Big data” community did right. A key secret of “Big data” success so far has been the availability of high quality, generic open source middleware tools.

Also, there is room for companies that focus on specific use cases or classes of use cases. For example, Scanalytics focuses on foot traffic monitoring and Second spectrum focuses on sport analytics.  Although expensive, they would provide an integrated ready to go solutions. IoT system designers have a choice either going with a specialized vendor or building on top of open source tools (e.g. Eclipse IoT platform, WSO2 Analytics Platform).


This post discusses different aspects of an IoT analytics solutions pointing out challenges that you need to think about while building IoT analytics solutions or choosing analytics solutions.

Big data has solved many IoT analytics challenges. Specially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Following are the highlights.

  • How fast we need results? Real-time vs. batch or a combination.
  • How much data to keep? based on use cases and incoming data rate, we might choose between keeping none, summary, or everything. Edge analytics is also a related aspect of the same problem.
  • From analytics, do we want hindsight, insight or foresight? decide between aggregation and Machine learning methods. Also, techniques such as time series and spatiotemporal algorithms will play a key role with IoT use cases.
  • What is our Response from the system when we have an actionable insight? show a visualization, send alerts, or to do automatic control.

Finally, we discussed the shape of IoT data and few reusable scenarios and the potential of building middleware solutions for those scenarios.

Hope this was useful. If you have any thoughts, I would love to hear from you.



5 thoughts on “Thinking Deeply about IoT Analytics

  1. Very thoughtful and educational post with lots to think about. A couple of questions:

    a) The volume of IoT data is talked about as bigger than Big Data but won’t most of it be redundant (duplicate or almost duplicate) data?

    b) IoT is realtime and so will it not require realtime machine learning which today is batch-based?


    1. Hi Dinesh,

      a) Some of that data might be redundant, but some use cases would have substantial real data. Point is each of us, each house can generate so much, and they add up.

      b) I think we can live with real-time evaluation of models ( e.g. checking is this a anomaly) while building the model we can do in batch fashion. Streaming ML is nice, but not mandatory IMO.



  2. If only some of the data is redundant then the amount of data will be colossal. I’m guessing that a first-level filter will attempt to reduce the amount of data.

    Would there be latency and data integrity problems between checking for anomalies against the model in realtime and retraining the model in batch? I’m thinking that for IoT, the data will be coming in pretty fast, the models will not be too large so that retraining is faster.


  3. I think it depends on the use case. Sometimes we can filter/ summarize and reduce and sometimes we cannot. I do not believe it will work most of the time.

    Regarding retraining as batch, then the model will be based on slightly late data. However, this is only a problem if underline data’s behaviour changes very fast ( then model will change). I think we can often ignore this.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s