Thinking Deeply about IoT Analytics

A typical IoT system would have following architecture.

IoTArch

As the picture depicts, sensors would collect data and transfer them to a gateway, which  in turn would send them to a processing system ( analytics cloud). Gateway can choose either to or not to summarizing  or preprocess the data.

The Connection between sensors and gateway would be via Radio Frequency (e.g. Zigbee), BLE, Wifi, or even wired connections. Often, the gateway is a mobile phone.

The connection from the gateway to Analytic servers would be  via Internet, LAN, or WiFi connection, and it will use a higher level protocol such as MQTT or CoAp (e.g. see IoT Protocols).

Since our focus is on IoT analytics, let’s not drill into devices and connectivity. Assuming that part is done, then how hard is IoT analytics? is it just a matter of offloading the data into one of the IoT analytics platforms or are there hidden surprises?

In this post, I am trying to answer those questions.  Efforts under the theme “Big data”  has solved many IoT analytics challenges. Especially, the system challenges related to large-scale data management, learning, and data visualizations. Data for “Big data”, however, came mostly from computer based systems (e.g. transaction logs, system logs, social networks,  and mobile phones). IoT data, in contrast, will come from the natural world, would be more detailed, fuzzy, and large. Nature of that data, assumptions, and use cases differ between old Big data and new IoT data. IoT analytics designers can build on top of big data, yet work is far from being done.

Let us look at few things we need to worry about.

How fast you need results?

Depends on how fast we need results from the data, our design changes. This decision depends on our use cases. We should ask ourselves, does the value of our insights ( results) degrade over time and how fast? For example, if we are going to improve the design of a product using data, then we can wait days if not weeks. On the other hand, if we are dealing with stock markets and other similar use cases where winner takes all,  milliseconds are a big deal.

Speed comes in several levels.

  • Few hours – send your data into a Data Lake and use a MapReduce technology such as Hadoop or Spark for processing.
  • Few Seconds – send data into a stream processing system (e.g. Apache Storm or Apache Samza),  an in-memory computing system (e.g. VoltDB, Sap Hana), or an interactive query system (e.g. Apache Drill) for processing.
  • Few milliseconds – send data to a system like Complex Event Processing where records are processed one by one and produce very fast outputs.

The following picture summarizes those observations.

BigdataToolingLandscape3

Chances are we will have use cases that falls under more than one, and then we will have to use multiple technologies.

How much data to keep?

Next, we should decide how much data to keep and in what form. It is a tradeoff between cost vs. potential value of data and associated risks. Data is valuable. We see companies acquired just for their data and Google, Facebook going an extraordinary length to access data. Furthermore, we might find a bug or improvement in the current algorithm, and we might want to go back and rerun the algorithm on old data. Having said that, all decision must be made thinking about the big picture and current limits.

Following are our choices.

  • keep all the data and save it to a data lake ( the argument is that disk is cheap)
  • process all the data in a streaming fashion and not keep any data at all.
  • keep a processed or summarized version of the data. However, it is possible that you cannot recover all the information from the summaries later.

The next question is where to do the processing and how much of that logic we should push towards the sensors. There are three options.

  • Do all processing at analytics servers
  • Push some queries into the gateway
  • Push some queries down to sensors as well.

IoT community already has the technology to push the logic to gateways. Most gateways are full-fledged computers or mobile phones, and they can run higher level logic such as SQL-like CEP queries. For example, we have been working to place a light-weight CEP engine into mobile phones and gateways. However, if you want to push code into sensors, most of the cases, you would have to write custom logic using a lower level language like Arduino C. Another associated challenge is deploying, updating, and managing queries over time. If you choose to put custom low-level filtering code into sensors, I believe that will lead to a deployment complexities in the long run.

Analytics: Hindsight, Insight or Foresight?

Hindsight, insight, and foresight are three question types we can ask from data: To know what happened? to understand what happened? and predict what will happen.

Hindsight is possible with aggregations and applied statistics. We will aggregate data by different groups and compare those results using statistical techniques such as confidence intervals and statistical tests. A key component is  data visualizations that will show related data in context. (e.g. see Napoleon’s March and Hans Rosling’s famous Ted talk).

Insights and foresight would require machine learning and data mining. This includes finding patterns, modeling the current behavior, predicting future outcomes, and detecting anomalies. For more detailed discussion, I suggest you start following data science and machine learning tools (e.g. R, Apache Spark MLLib, WSO2 Machine Learner, GraphLab to name a few).

IoT analytics will pose new types of problems and demand more focus on some existing problems. Following are some analytics problems,  in my opinion, will play a key role in IoT analytics.

Time Series Processing

Most IoT data are collected via sensors over time. Hence, they are time series data,  and often most readings are autocorrelated. For example, a temperature reading is often highly affected by the earlier time step’s reading. However, most machine learning algorithms (e.g. Random Forests or SVM) do not consider autocorrelation. Hence, those algorithms would often do poorly while predicting  using IoT data.

This problem has been extensively studied under time series analysis (e.g. ARIMA model). Also, in recent years, Recurrent Neural Networks (RNN) has shown promising results with time series data. However, widely used Big Data frameworks such as Apache Spark and Hadoop do not support these models yet. IoT analytics community has to improve these models, build new models when needed, and incorporate them to big data analytics frameworks. For more information about the topic, please refer to the article Recurrent neural networks, Time series data and IoT: Part I.

Spatiotemporal Analysis and Forecasts

Similary, most IoT data would include location data, making them spatiotemporal data sets. (e.g. geospatial data collected over time). Just like time series data, these models would be affected by the spatial neighborhood. We would need to explore and learn spatiotemporal forecasting and other techniques and build tools that support them. Among related techniques are GIS databases (e.g. Geotrelis), and Panel Data analysis. Moreover, Machine learning techniques such as Recurrent Neural networks might also be used (see Application of a Dynamic Recurrent Neural Network in Spatio-Temporal Forecasting).

Anomaly detections

Many IoT use cases such as predictive maintenance, health warnings, finding plug points that consumes too much power, optimizations etc depend on detecting Anomalies. Anomaly detection poses several challenges.

  • Lack of training data – most use cases would not have training data, and hence unsupervised techniques such as clustering should be used.
  • Class imbalance – Even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. This problem is generally handled by building an ensemble of models where each model is trained with anomalous observations and resampled data from regular observations.
  • Click and explore – after detecting anomalies, they must be understood in context and vetted by humans. Tools, therefore, are required to show those anomalies in context and enable operators to explore data further starting from the anomalies. For example, if  an anomaly in a turbine is detected, it is useful to see that anomaly within regular data before and after the anomaly as well as to be able to study similar cases happened before.

What is our Response?

Finally, when we have analyzed and found actionable insights, we need to decide what to do with them. We have several choices.

IoTUsecaseTypesByDecision

  • Visualize the Results – build a dashboard that shows the data in context and let users explore, drill-down, and do root cause analysis.
  • Alerts – detect problems and notify the user using emails, SMS, or pager devices. Your primary challenge would be false positives that would severely affect the operator’s trust on the system. Finding the balance between false positives and ignoring true problems will be tricky.
  • Carrying out  Actions – next level is independent actions with open control loops. However, unlike the former case, the risk of a wrong diagnosis could have catastrophic consequences. Until we have a deeper understanding about the context, use cases would be limited to simple applications such as turning off a light, adjusting heating etc where associated risk are small.
  • Process & Environment control – this is the holy-grail of automated control. The system would continuously monitor and control the environment or the underline process in a closed control loop. The system has to understand the context, environment, and should be able to work around failures of actions etc. Much related work has been done under theme Autonomic computing  2001-2005 although a few use cases ever got deployed. Real life production deployment of this class, however, are several years away due to associated risks. We can think as NEST and Google Auto driving Car as first examples of such systems.

In general, we move towards automation when we need fast responses (e.g. algorithmic trading). More automation can be cheaper in the long run, but likely to be complex and expensive in the short run. As we learned from stock market crashes, the associated risks must not be underestimated.

It is worth noting that doing automation with IoT will be harder than big data automation use cases.  Most big data automation use cases either monitor computer systems or controlled environments like factories. In contrast, IoT data would be often fuzzy and uncertain. It is one thing to monitor and change a variable in automatic price setting algorithm. However, automating a use case in the natural world (e.g. an airport operations) is something different altogether. If we decide to go in the automation route, we need to spend significant time understanding, testing, retesting our scenarios.

Understanding IoT Use cases

Finally, let me wrap up by discussing the shape of common IoT data sets and use cases arises from them.

Data from most devices would have following fields.

  • Timestamp
  • Location, Grouping, or Proximity Data
  • Several readings associated with the device e.g. temperature, voltage and power, rpm, acceleration, and torque, etc.

The first use case is to monitor, visualize, and alerts about a single device data. This use case focuses on individual device owners.

However, more interesting use cases occur when we look at devices as part of a larger system: a fleet of vehicles, buildings in a city, a farm etc. Among aforementioned fields, time and location will play a key role in most IoT use cases. Using those two, we can categorize most use cases into two classes: stationary dots and moving dots.

Stationary dots

Among examples of “stationary dot” use cases are equipment deployments (e.g. buildings, smart meters, turbines, pumps etc). Their location is useful only as a grouping mechanism. The main goal is to monitor an already deployed system in operation.

Following are some of the use cases.

  • View of the current status, alerts on problems, drill down and root cause analysis
  • Optimizations of current operations
  • Preventive Maintenance
  • Surveillance

Moving dots

Among examples of moving dot use cases are fleet management, logistic networks, wildlife monitoring, monitoring customer interactions in a shop, traffic, etc. The goal of these use cases is to understand and control movements, interactions, and behavior of participants.

WSO2_CEP_TfL_Demo_-_YouTubeFollowing are some examples.

  • Sports analytics (e.g. see the following video)
  • Geo Fencing and Speed Limits
  • Monitoring customer behavior in a shop, guided interactions, and shop design improvements
  • Visualizing (e.g. time-lapse videos) of movement dynamics
  • Surveillance
  • Route optimizations

For example, the following is a sports analytics use case built using data from a real football game.

For both types of use cases, I believe it is possible to build generic extensible tools that provide an overall view of the devices and provide out of the box support for some of the use cases. However, specific machine learning models such as anomaly detection would need expert intervention for best results.  Such tools, if done right, could facilitate reuse, reduce cost, and improve the reliability of IoT systems. It is worth noting that this is one of the things “Big data” community did right. A key secret of “Big data” success so far has been the availability of high quality, generic open source middleware tools.

Also, there is room for companies that focus on specific use cases or classes of use cases. For example, Scanalytics focuses on foot traffic monitoring and Second spectrum focuses on sport analytics.  Although expensive, they would provide an integrated ready to go solutions. IoT system designers have a choice either going with a specialized vendor or building on top of open source tools (e.g. Eclipse IoT platform, WSO2 Analytics Platform).

Conclusion

This post discusses different aspects of an IoT analytics solutions pointing out challenges that you need to think about while building IoT analytics solutions or choosing analytics solutions.

Big data has solved many IoT analytics challenges. Specially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Following are the highlights.

  • How fast we need results? Real-time vs. batch or a combination.
  • How much data to keep? based on use cases and incoming data rate, we might choose between keeping none, summary, or everything. Edge analytics is also a related aspect of the same problem.
  • From analytics, do we want hindsight, insight or foresight? decide between aggregation and Machine learning methods. Also, techniques such as time series and spatiotemporal algorithms will play a key role with IoT use cases.
  • What is our Response from the system when we have an actionable insight? show a visualization, send alerts, or to do automatic control.

Finally, we discussed the shape of IoT data and few reusable scenarios and the potential of building middleware solutions for those scenarios.

Hope this was useful. If you have any thoughts, I would love to hear from you.

 

Taxonomy of IoT Usecases: Seeing IoT Forest from the Trees

IoT comes in many forms. Variation of use cases seems endless. IoT devices itself has many types and can be arranged in different configurations.

Following are of those device classes.

  • Ad-hoc/ Home/ Consumer (Embeddables , Wearables, Holdables, Surroundables, see Four types of Internet of Things?)
  • Smart Systems – ( they monitor the outside world, have lot of small sensors, have hubs that connect via Zigbee or cellular and connection from hubs to cloud)
  • M2M/ Industrial Internet (Sensor and inbuilt, often pre-designed)
  • Drones and Cameras (Never underestimate the most ubiquitous IoT device, video Cameras)

Those devices can be used to solve a wide range of problems. Obviously, it is hard to do a complete taxonomy, yet writing even a subset down would help us lot with understanding IoT.

The taxonomy is arranged around people, and each level moves further away from individual and becomes high level. Different levels are categorized from personal (e.g. wearables) to macro-level control ( smart cities). The following picture shows each category.

IoTUsecaseTaxonomy

Let us look at each category in detail.

1. Wearables

Wearables are devices that are with you. They range from pills you might swallow, a Fitbit, a watch, to your mobile phone. The goal of these use cases is to make your life better.

  • Health: Fitbit, personal health (e.g. Incentives for good habits)
  • From asset tracking to smart signage, and safety
  • Sports – digital coach, better sport analytics
  • Facial Recognition with real-life analytics and interactions

2. Smart Homes

These use cases try to monitor and improve your home giving you peace of mind, comfort, and efficiency.

  • Energy efficiency, smart lighting, smart metering, smart elements, smart heating, smart rooms, bedrooms
  • Integration with Calendar and other data, deriving context, and take decisions and drive the home environment based on current context.
  • Safety and security via home surveillance, monitor health and kids, perimeter checks for pets and kids etc.
  • Smart gardens (e.g. watering, status monitoring)

You can find more information from 9 Ways A Smart Home Can Improve Your Life.

3. Appliances

Appliances have a duel role. On one hand, they provide new experiences to the end user, hence play a role in Smart Home. On the other hand, they provide better visibility and control of appliance to the manufacturer. Devices include your car, smart lawn mowers, kettles etc.  Most products will have a digital twin, that will provide analytics and important information both to the consumer and the manufacturer.

Following are some use cases.

  • Products can interact with users better, optimize, learn and adapt to the user (e.g. smart washers and dryers that notify when done and product displays been replaced with apps)
  • Better after sales services, better diagnosis, remote diagnosis ( efficient customer support), faster update and critical patches
  • Adaptive and proactive maintenance as needed. With IoT, products can monitor themselves and act if there is a problem
  • Using product usage data to improve product design.
  • Get some appliances ( e.g expensive ones like load mower) under a pay per use model rather than buying them.
  • Know the customer better: better segmentation, avoid churn ( if he is not using it, find out)
  • Hobbyists/ Entertainment (e.g. drone racing, drone cameras)
  • Advertisements via your appliance (e.g. refrigerator let you order missing food via a App, and the manufacturer may charge for recommendations they made from companies)

HBR article, How Smart, Connected Products Are Transforming Companies, provide a good discussion about some of the use cases.

4. Smart Spaces

Smart spaces use cases monitor and manage a space such as a farm, a shop, forest etc. It would involve pre-designed sensors as well as ad-hoc sensors like drones etc. Often camera’s computer vision also plays a key role.

Following are some of the use cases.

  • Smart Agriculture (watering based on moisture levels, pest control, livestock management), correlate with other data sources like weather and delivery of pesticides etc though drones.
  • Surveillance ( wildlife, endangered species, forest cover, forest fire)
  • Smart Retail: Smart stores ( sensors to monitor, what gets attention), fast checkouts (e.g. via RFID), customer analytics for stores, In store targeted offers via smartphones, better customers service at the store.
  • Quick service restaurants(QSR) – measure staff performance & services, improve floor plan & remove bottlenecks, optimize queue & turnover
  • Smart Buildings ( Power, Security, Proactive Maintenance, HVAC etc)

For related use cases, see How The Internet of Things Will Shake Up Retail In 2015 and The Future Of Agriculture?

5. Smart Services Industries/ Logistics

These use cases use IoT to improve the services industry and logistics. They focus on monitoring and improving underline processes of those businesses. Following are few examples.

  • Smart logistics and Supply Chain( tracking, RFID tags)
  • Service industries: Airlines, Hospitality etc. The goal is efficient operations, and visibility (e.g. where my baggage?) and proactive maintenance.
  • Financial services, Smart Banking, Usage-based Insurance, Better data for Insurance, and Fraud detection via better data
  • Better delivery of products via Drones
  • Aviation – Report, find the problem, and find the fix, parts before plane lands,
  • Telecommunications networks

6. Smart Health

Smart health will be a combination of wearables, smart home, and smart services. This would include use cases like better health data through wearables, better care at hospitals, in-home care, smart pill bottles etc that would monitor and make sure medications are taken, and better integration of health records.

7. Industrial Internet

The Idea of the industrial internet is to use sensors and automation to better understand and manage complex processes. Unlike smart spaces, these use cases give owners much for flexibility and control. Most these environments already have sensors and actuators installed. Most of these use cases predate IoT and falls under M2M.

Following are some use cases.

  • Smart manufacturing
  • Power and renewable energy (e.g.Wind Turbines, Oil and Gas)  operations and predictive maintenance. The goal is to add value on top of existing assets (takes about 40 years to replace) .
  • Mining
  • Transport : Trains, Busses
  • HVAC and industrial machines

You can find more use cases from GE’s making world 1% better initiative.

8. Smart Cities

Smart Cities ( and my be Nations) brings everything together and provides a macro view of everything. They focus on improving public infrastructure and services that make the urban living better.

Following are some of the use cases.

  • Waste management, smart parking ( e.g. find parking spots)
  • Traffic management ( sensors, Drones), air quality and water quality, smart road tax
  • Security: Surveillance, gunfire sensors, Smart Street lightings, Flooding alerts,
  • Smart buildings (energy, elevators, lighting, HVAC), Smart bridges/ constructions(put lot of sensors into concrete etc)
  • Urban planning

You can find more information from articles How Big Data And The Internet Of Things Create Smarter Cities, and Smart Cities — A $1.5 Trillion Market Opportunity.

Conclusion

As we saw, use cases come in many forms and shapes and likely they will get integrated with and change our lives at many different levels. This is the reason that analysts have forecasted an unprecedented number of devices (e.g. 15-50B by 2020) as well as a market size (e.g. 1-7 Trillion by 2020) for IoT that dwarfed any earlier trends like SOA or Big data.

Following are few observations about the use cases.

  • Each use case tries to solve a real problem. They do this by finding a problem, instrumenting data around it, and analyzing that data and providing actionable insights or carrying out actions.
  • Some use cases are enabled by creative sensors, such as using camera to measure your heart rate or sensors mixed into the concrete while building a bridge.
  • Analytics are present in almost all use cases. One of the key, yet often unspoken assumption is that all data get collected and analyzed later. We call this batch analytics.
  • However, lot of use cases need realtime decisions and sometimes need to act on those decisions. There have been many efforts on relatime analytics, but comparatively less work has been done regarding acting on the decisions.
  • These use cases might lead to other use cases such as showing related advertisements on your appliance or on the associated mobile App.

Hope this was useful. I would love to hear about if you thoughts about different categories and use cases.

WSO2 CEP 4.0.0: What is New? Storm support, Dashboards, and Templates

WSO2 CEP 4.0 is out. You can find

The first thing to note is that we have integrated batch, realtime, interactive, and predictive analytics into one platform called WSO2 Data Analytics Server (DAS). Please refer to my earlier blog, Introducing WSO2 Analytics Platform: Note for Architects, to understand WSO2 CEP fits into DAS. DAS release is coming soon.

Let us discuss what is new in WSO2 CEP 4.0.0, and what would those features mean to the end user.

Storm Integration

WSO2 CEP supports distributed query executions on top of Apache Storm. Users can provide CEP queries that have partitions, and WSO2 CEP can automatically build a Storm Topology that is equivalent to the query, deploy, and run it. Following is an example of such a query. CEP will build a Storm topology, which will first partition the data by region, run the first query within each partition, and then collect the results and run the second query.

 
define partition on TempStream.region {
  from TempStream[temp > 33]
  insert into HighTempStream; 
}
from HighTempStream#window(1h)
  select max(temp)as max 
  insert into HourlyMaxTempStream;

Effectively, WSO2 CEO provides a SQL-like, stream processing language that runs on Apache Storm. Please refer to the following talk I did at O’reilly Strata for more information (slides).

Analytics Dashboard

WSO2 CEP now includes a dashboard and a Wizard for creating charts using data from event streams.

wizard

From the Wizard, you can choose a stream, select a chart type, assign its properties into different dimensions in the plot via a Wizard, and generate a chart. For an example, you can tell that you need a scatter plot where x-axis maps to time, y-axis maps to hit count where point colours maps to country, and point size maps to their population. The charts will be connected to CEP though web sockets and the scatter plot will update when new data become available in the underline event stream.

Query Templates

CEP queries are complicated. It is not very simple for non-technical user to write new queries. With these templates, developers can write parameterised queries and save them as a template. Then users can provide values for that template using a form and deploy them as a query.

For example, let’s assume we want the end users to write a query to detect high-speed vehicles where end user defines the speed. Then we will write parametrized query template like following.

 
from VehicleStream[speed > $1]

The end user, when he select the template, will see a form that let him specify the speed value. CEP will deploy a new query using the template and speed value by a click of a button. Following is a example form.

templates

Furthermore, WSO2 CEP now includes a Geo Dashboard, that you can configure via query templates. Following video shows the visualization of London traffic data using geo dashboard.

Siddhi Language Improvements

With WSO2 CEP 4.0, queries that have partitions defined would run each partition in parallel. Earlier, all executions would run in a single thread, and CEP will only use a single core  per Execution plan ( a collection of queries). New approach would significantly improve performance for some usecases.

Furthermore, CEP can now run Machine Learning models built with WSO2 ML and PMML models. It supports several anomaly detection algorithms as described in Fraud Detection and Prevention: A Data Analytics Approach white paper.

In addition, we have added time series regression and a forecaster as Siddhi functions. It also includes several new functions for string manipulations and mathematics. Furthermore, it includes a CronWindow that will trigger based on a Cron expression (see sample 115 for more details), which users can used to define time windows that starts in a specific time.

Also, now you can pack all queries and related artifacts into a WSO2 single Carbon archive, which will make it easier for users to manage their CEP execution plans as a single unit.

New Transports

New WSO2 CEP can receive event and send events using MQTT, which is one of the leading Internet of Things (IoT) protocols. Also it includes support for WebSockets that will make it much easier to build web apps that uses WSO2 CEP.

Tools

WSo2 CEP now includes an Event Simulator, that you can use to replay events stored in a CSV file for testing and demo purposes. Furthermore, it has a “Try it” feature that let user send events into CEP using it’s Web console, which is also useful for testing.

Conclusion

Please try it out. It is all free under apache Licence. We will love to hear your thoughts. If you find any problems or have suggestions, drop me a note or get in touch with us via architecture@wso2.org.

WSO2 Machine Learner: Why would You care?

After about a year worth of work, WSO2 Machine Learner (WSO2 ML) is officially out. You can find the pack from http://wso2.com/products/machine-learner/ ( also Code and User Guide). It is free and Opensource under Apache Licence ( which pretty much means you can do whatever with the code as long as you keep the same Licence).

Let me try to answer “the question”. How is it different and why would you care?

What is it?

The short answer is it is a Wizard and a system on top of Apache Spark MLLib. The long answer is the following picture.

ML-overview

You can use it to do the following

  1. User can start with data ( in his disk, in HDFS, or in WSO2 DAS)
  2. Explore the data ( more about that later)
  3. Create a Project and build machine learning models going through a Wizard
  4. Compare those models and find the best model
  5. Export that model and use it with WSO2 CEP, WSO2 ESB, or from Java Code.

For someone from Enterprise World?

WSO2 Machine Learner is designed for the Enterprise world. It comes as an integrated solution with the rest of the Big Data processing technologies: batch, realtime, and interactive analytics. Also, it includes support from data collection, analysis,  to communication (e.g. visualizations, APIs, and alerts). Please see the earlier post “Introducing WSO2 Analytics Platform: Note for Architects” for more details.  Hence, it is part of a complete analytics solution.

WSO2 ML handles the full predictive analytics lifecycle, including model deployment and management.

MLDeployment

If you are already collecting data, we can pull that data, process them, and build models. Models you built are immediately available to use from your main transaction flow ( via WSO2 ESB) or  data analysis flow ( via WSO2 CEP). Basically, you copy the model ID and add it to WSO2 ESB mediation scripts or WSO2 CEP queries, and now you have a Machine Learning integrated into your business. (Please see in Using Models for more information.) This handles details like keeping a central store of Models while deploying models in production and also let you quickly switch between models.

If you are not collecting data, you can start with WSO2 DAS and go from there. The same story holds.

Furthermore, it gives you the concept of a project where you can try out and keep track of multiple machine learning models. Also, it handles details like sending you an email when a long running machine learning algorithm execution has completed.

Finally, as we discuss in the next section, the ML Wizard is built such a way that you can use it with minimal understanding about Machine Learning. Sure, you will not get the same accuracy as the experts who will know how to tune the thing, but it can get you started and give you OK accuracy.

For a Machine Learning Newbie?

First of all, you need to understand what Machine Learning can do for you. Most problems, we know the exact steps to be followed to solve the problem. With those kinds of problems, all we have to do is to write a code that does those steps. This is what we call programming and lot of us do this day in day out.

However, there are other problems that you will learn by example. Driving a car, cycling, and drawing a picture are problems that we learn by looking at examples. If you want a computer to solve those problems, you cannot write a program to solve them because you do not know the algorithm. Machine Learning is used to solve specifically those problems. Instead of the algorithm, you give it lots of examples, and Machine Learning will learn a model (a function) from those examples. You can use the model to solve your initial problem. Google’s driverless car does exactly this.

If you are new to Machine Learning, I highly recommend looking at A Visual Introduction to Machine Learning and the following talk by Ron Beckerman.

The Machine Learner wizard tries to model the experience around what you want to do as oppose to showing you lot of ML algorithms. For example, you can choose to predict the next value, classify something to a one of the categories, or detect an anomaly. You can click through, use defaults, and get a model. You can try several algorithms and compare them with each other.

We support several standard techniques to compare ML models such as ROC curve, confusion Matrix, etc. CD’s blog post “Machine Learning for Everyone” talks about this in detail.

For example, following confusion matrix shows how much of true positives, false negatives etc resulted from the model.

confusion-matrix_r

The figure on the left chart shows a scatter plot of data points that are predicted correctly and incorrectly while the right-hand side shows the RoC curve.

predicted-vs-actual

roc-graph

However, at this point I suggest that you read How to Evaluate Machine Learning Models: Classification Metrics by Alice Zheng. It is ok to not to know how ML algorithms work, but you must know what models are better and why.

However, there is a catch. If you try well known Machine Learning datasets, they would work well ( You can find few of such data sets from the sample directory of the pack). However, sometimes with real datasets, getting good results need transforming features into different features, and that might be beyond you if you have just started. If you want to go pro and learn to transform features ( a.k.a. Feature Engineering) and other fascinating stuff, then Andrew Ng’s famous course https://www.coursera.org/learn/machine-learning is the best place to start.

For a Machine Learning Expert?

If you are an ML expert, still WSO2 Machine Learner can help in several ways.

First, it provides pretty sophisticated support for exploring the dataset based on a random sample. This includes scatter plots for looking at any two numerical features, parallel sets for looking at categorical data, Trellis sets for looking at 4-5 numerical dimensions at the same time, and cluster diagrams ( see below for some examples).

cluster-diagram trellis-chart parallel-set

Second, it gives you access to a large collection of scalable machine learning algorithms pretty easily. For a single node setup, you just download and unzip it. ( see below for how to do it).

Third, it provides an extensive set of model comparison measures as visualizations and also let you compare models side by side.

Fourth, in addition to predictive analytics, you have access to batch analytics though SparkSQL, interactive analytics with Lucence, and relatime analytics through WSO2 CEP. This will make understanding dataset as well as preprocessing data much easier. One limitation of this release is that those other types of analytics must be done before using data within WSO2 ML. However, the next release will enable you to run queries within the WSO2 ML pipeline as well.

Finally, you will also have all advantages listed under enterprise user such as seamless deployment of models and ability to switch the model easily.

Furthermore, many interesting features are coming shortly in the next release.

  • Support for Deep Learning and Neural Networks
  • Support for out of the Box Anomaly detection using Markov Chains and Clustering
  • Support to data cleanup and preprocessing using Data Wrangler and SparkSQL
  • Support for out of the box ensembles that let you combine models
  • Improvements to pipeline to warn the user on cases like class imbalances in classifications

Trying it Out

Carry out following steps

  1. Download WSO2 ML from http://wso2.com/products/machine-learner/
  2. Make sure you have Java 7 installed in your machine and set JAVA_HOME.
  3. Unzip the pack and run bin/wso2server.sh from the unpacked directory. Wait for WSO2 ML to start.
  4. Go to https://hostname:9443/ml and Login using username admin and password admin.
  5. Now you can upload your own dataset and follow along with the wizard. You can find more info from the User Guide. However, Wizard should be self-explanatory.

Remember, it is all free under apache Licence. Give it a try, and we will love to hear your thoughts. If you find any problems or have suggestions, report them via https://wso2.org/jira/browse/ML.

What Data Science and Big Data can do for Sri Lanka?

I am sure you have heard enough about Big data ( processing and handling a large amount of data) and Data science ( how to make decisions with data). There is lots of chatter on how they are going to solve all the problems we know,  bring about world peace, and how we will live happily ever after.

Let’s try to slow down, look around, and discuss what it can really do for a country like Sri Lanka. Well, first is that we can build some great Big Data tools, sell it and bring in lots of exports to Sri Lanka. However, that is selling shovels to the gold diggers  at the gold rush, not a bad business proposition. Instead, let’s try to understand how Big Data can make a difference in day to day lives.

Thinking about BigData

Big data must be viewed not as a large infrastructure operation, but as a medium to connect different entities, collect, and analyze information that will let us instill order into existing processes and to create new processes. It could give us a holistic picture into what is going on, sometimes predict what will happen, and add order into chaos by ranking and rating different items and entities. For example, given a sea of information (e.g. web, social media, error tickets, requests for help, transactions, etc.), it can find out what are most important items and find out who has something important to tell. Furthermore, by creating alignment between individual gain and quality information in the system, it will nudge participants to create better content and sometimes better behavior. 

Following are some use cases, in my opinion,  that could help Sri Lanka. They are arranged by the order of how practical they are, and I have listed any reservations and challenges with each.

Urban Planning and Policy Decisions

17263788584_ac66ab842b_z

(image credits) cc license.

Understand social dynamics like people geographic distribution, demographic distribution, mobility patterns etc to aid in policy and urban planning. This can be done through data sets like Census, CDR data, social media data ( in the right context) etc. The good news is this is already underway by Lirneasia ( see Big Data for Development Project, http://lirneasia.net/projects/bd4d/). However, there are many problems to solve. If you are a Sri Lankan research student looking for a thesis topic, chances are you can find dozen good problems in this project.

Traffic

If you work in Colombo, you are no stranger to this. I travel daily about 35km to work, and on a bad day, we travel in about 15km/h. To be fair, Sri Lankan traffic is better than most places in India and even some places in US (e.g. San Fransisco 101 traffic). Yet a large number people waste lots of time, and with the rate of vehicle increase, things will get unmanageable soon.

traffic

Colombo traffic plan introduced 6-7 years ago fixed many things, and new roads certainly helped. However, we still cannot measure traffic fast enough. Most decisions are done via manual vehicle counting and few automatic counters. We need a way to measure traffic in higher resolution, faster, and accurately. The we can understand what is going on and plan around it.

Among ideas to collect data are

  1. Build an automatic traffic counter ( University of Moratuwa ENT department had built this already or we can use a number plate reading technology each IMO should end up less than 10K LKR per unit)
  2. Collect data from traffic officers, use social media feeds like @road_lk (e.g. see Real-time Natural Language Processing for Crowdsourced Road Traffic Alerts)
  3. Collect data from traffic officers
  4. Or a combination.

We must understand that reason this does not happen is neither due to want of technology nor due to want of money, but want of concentrated effort. If we have more data, fast enough, we can do better modeling and plan around bottlenecks. Also eventually, we can act on traffic incidents realtime.

Manage Doners and Charities

We are a culture that donates from what little we have. Sri Lankan, both rich and poor donate alike. However, it is not clear how much of that is put to good use, how much get lost on the way, or how much lasting impressions they leave.

5655247429_677bc162aa_z

Using data collection, social media, and independent verifications, we could build much more accountability and visibility into the charitable activities, and we can prioritize and try to make a lasting impression.

For an example, if a random person asked for a help, I might not trust him. However, if a newspaper reporter has done a report, then I have a bit more trust. If a well-known person in society asked for help, there is even more trust. However, if a recommendation come from a personal friend whom I know, that is even better. If someone with credibility can pledge to follow up, it will make a big difference. We could build such a system, rank requests as well as people involved, and bring in greater trust and efficiency into the system. The model can be extended to independent verification of what was carried out, and also to track long-term change. Data collected over the process can be used to rate different parties in the process as well as to optimize the process.

Day to day Maintenence

report2 report1

It seems to get anything fix, Sri Lanka needs to create news. A system that needs a new paper report to get a public lavatory fixed cannot go too far. This can be fixed by borrowing an issue reporting system from open source. We need a Geo-tagged complaints and maintenance request map that let people up vote and down vote tickets with photographic evidence. Then government authorities can monitor this, and act accordingly. The government can enforce SLA to check and act. However, the most important aspect is that this creates a paper trail that will make sure that relevant authorities cannot claim ignorance. Moreover, you cannot stop issues being reported by chasing away people.

Do we have connectivity to make this work? I think we do. Chances are that it is easy to find a Nanasala (Community Internet Stations placed in public places in Sri Lanka), rather than going to an office and convincing officials to write down a complain.

Will we sink in a sea of complaints? Yes, we will! This is where data science comes in. We need a rating system against tickets and reporters that enforces reputation. That can be done!

It is worth noting that Garbage handling is an another version of the same problem that we can solve using a similar method where people can report about illegal dumpings, intelligence, and inefficient collection.

Few More ideas

  1. Law and Order ( Police investigation) – tracking data about crimes committed, building a known database about known felons that people can check against, studying distribution and dynamics of crime and adjusting officer deployments.
  2. Health records – let each person keep his own health record history and ability for researchers to anonymously query health records to find higher level patterns. Furthermore, let patients rate and complain about doctors and a system to verify and act.
  3. Health – Build a wearable based in-home health care solution ( good idea for a startup) that is based on a subscription. Sri Lanka is one of the countries that treat their elderly very well.
  4. Connecting Export Opportunities and Social Enterprises – Bring in technology to what organizations like Sarvodaya are doing while act as the bridge via finding potential markets, introducing potential suppliers, providing training and micro-financing.
  5. Crisis response – analyzing and coordinating efforts
  6. Disease spread – hotspot identification, prediction, preventive actions

Analysis of Retweeting Patterns in Sri Lankan 2015 General Election

The Election is done. Just the other day, we saw an analysis of twitter activities by Yudhanjaya, where he observed that few accounts shape the tone of the twitter LKA community. 

In this post, I am taking a detailed look at the retweet network, further drilling into twitter community. The dataset is the retweets graph for twitter hashtags #GESL15 and #GenElecSL collected between 4-22 of august.The archive includes 14k tweets of which about 9k are retweets. There are 2480 tweets accounts have participated. I collect the data through hawkeys.info. The analysis is done using a set of R scripts.

Following are some interesting observations.

How does the community look like?

denseNetwork sparsenetwork

The graph shows a visualization of the community. Each vertex represents an account. The first dense chart shows an edge for each retweet and the second sparse graph only shows an edge if five or more retweets have happened between the two accounts. The size of each node shows the number of retweets the node has received.

The community is arranged around few accounts that act as hubs, and first 10 authors have received about 40% retweets all retweets. This confirms Yudhanjaya‘s observations.

Furthermore, both graphs are well connected. Even the sparse graph is fully connected. Often in political conversions, different groups tend to segregate and cross talk is minimal. However, that is not the case in the LKA twitter graph. Maybe the presence of journalists as hubs in the network have enabled cross talk between groups.

The first table below shows the 15 accounts that had most retweets and the second table shows vertex betweenness values. Vertex betweenness is a measure of each node’s ability to connect different parts of the network.

betweeness toptweets

Four accounts appears on both measures, which further confirms their prominence.

What suggests a good reach?

Following two charts try to find any correlation between the number of tweets or the number of followers vs. retweets.

tweetsvsRetweetsBotsMarkedfollowersAndRetweets

However, data does not show such behavior. There are several bots that generate lots of tweets, but they do not generate much retweets. Furthermore, the accounts that receive many retweets have only about 50-100 tweets ( about 4-5 per day). This is evidence supporting that it is content, not the network strcture drives retweets, although it is not concusive.

Also, the relationship between the  number of followers and retweets also is not very clear. Although there are few account that have many followers and tweets, there are notable exceptions in the graph. Most likely this is caused by highly connected nature of the network, where followers are replaced by fast propergation through the network. Hence you can have lot of reach without having lot of followers.

What did they talk about?

The following picture shows an word cloud generated using all the tweets. However, there are not much surprises there.

wc

In contrast, top retweets in each day provide a superb chronicle on what happens in each day over time. You can get a view closer to this by typing #GenElecSL into twitter search box.

topretweets

Conclusion

Twitter network community for Sri Lankan general election 2015 has a very well connected retweets graph. Although few accounts shape the tone of the discussion, they seem to do a good job of enabling cross-communication between different groups. Reach, measured via retweets, seem to be independent of attributes like the frequency of tweets and the number of followers the account have. Finally, most retweeted tweets in each day seem to provide a useful chronicle of what happens in the election each day.

Note: It is worth noting that the community graph does not show the followers. Hence, the retweet can happens via another account as well (e.g. B retweets A’s message, and C having seen B’s retweet, he retweets A’s post. Then both C and B will have an edge from A in this graph. Therefore, these results does not discredit follower network in the community.

Patterns for Streaming Realtime Analytics

Introduction

We did a tutorial on DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns. Following is the summary of the tutorial.

Realtime analytics, or what people call Realtime Analytics, have two flavors.

  1. Realtime Interactive/Ad-hoc Analytics (users issue ad-hoc dynamic queries and the system responds interactively). Among example of such systems are Druid, SAP Hana, VoltDB, MemSQL, and Apache Drill.
  2. Realtime Streaming Analytics ( users issue static queries once and they do not change, and the system process data as they come in without storing). CEP and Stream Processing technologies are two example technologies that enable streaming analytics.

Realtime Interactive Analytics allow users to explore a large data set by issuing ad-hoc queries. Queries should respond within 10 seconds, which is considered the upper bound for acceptable human interaction. In contrast, this tutorial focuses on Realtime Streaming Analytics, which is processing data as they come in without storing them and react to those data very fast, often within few milliseconds. Such technologies are not new. History goes back to Active Databases (2000+), Stream processing (e.g. Aurora (2003) , Borealis (2005+) and later Apache Storm), Distributed Streaming Operators(2005), and Complex Event processing.

Still when thinking about Realtime analytics, many think only counting usecases. As we shall discuss, counting usecases are only the tip of the iceberg of real-life realtime usecases. Since the input data arrives as a data stream, a time dimension always presents in the data. This time dimension allows us to implement and perform much powerful usecases. For an example, Complex Event Processing technology provide operators like windows, joins, and temporal event sequence detection.

Stream processing technologies like Apache Samza and Apache Storm has received much attention under the theme large scale streaming analytics. However, these tools force every programmer to design and implement realtime analytics processing from first principals.

For an example, if users need a time window, they need to implement it from first principals. This is like every programmer implementing his own list data structure. Better understanding of common patterns will let us understand the domain better and build tools that handle those scenarios. This tutorial try to address this gap by describing 13 common relatime streaming analytics patterns and how to implement them. In the discussion, we will draw heavily from real life usecases done under Complex Event Processing and other technologies.

Realtime Streaming Analytics Patterns

Before looking at the patterns, let’s first agree on the terminology. Realtime Streaming Analytics accepts input as a set of streams where each stream consists of many events ordered in time. Each event has many attributes, but all events in a same stream have the same set of attributes or schema.

Pattern 1: Preprocessing

Preprocessing is often done as a projection from one data stream to the other or through filtering. Potential operations include

  • Filtering and removing some events
  • Reshaping a stream by removing, renaming, or adding new attributes to a stream
  • Splitting and combining attributes in a stream
  • Transforming attributes

For example, from a twitter data stream, we might choose to extract the fields: author, timestamp, location, and then filter them based on the location of the author.

Pattern 2: Alerts and Thresholds

This pattern detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase etc.

For an example, in TFL (Transport for London) Demo video based on transit data from London, we trigger a speed alert when the bus has exceed a given speed limit.

Also we can generate alerts for scenarios such as the server room temperature is continually increasing for last 5 mins.

Pattern 3: Simple Counting and Counting with Windows

This pattern includes aggregate functions like Min, Max, Percentiles etc, and they can be counted without storing any data. (e.g. counting number of failed transactions).

However, count are often useful with a time window attached to it.( e.g. failure count last hour). There are many types of windows: sliding windows vs. batch (tumbling) windows and time vs. length windows. There are four main variations.

  • Time, Sliding window: keeps each event for the given time window, produce an output whenever new event has added or removed.
  • Time, Batch window: also called tumbling windows, they only produce output at the end of the time window
  • Length, Sliding : same as the time, sliding window, but keeps an window of n events instead of selecting them by time.
  • Length, Batch window: same as the time, batch window, but keeps an window of n events instead of selecting them by time

Also there are special windows like decaying windows and unique windows.

Pattern 4: Joining Event Streams

Main idea behind this pattern is to match up multiple data streams and create a new event steam. For an example, lets assume we play a football game with both the players and the ball having sensors that emits events with current location and acceleration. We can use “joins” to detect when a player have kicked the ball. To that end, we can join the ball location stream and the player stream on the condition that they are close to each other by one meter and the ball’s acceleration has increased by more than 55m/s^2.

Among other usecases are combining data from two sensors, and detecting proximity of two vehicles.

Pattern 5: Data Correlation, Missing Events, and Erroneous Data

This pattern and the pattern 4 has lot in common where here too we match up multiple stream. In addition, we also correlate the data within the same stream. This is because different data sensors can send events in different rates, and many usecases require this fundamental operator.

Following are some possible scenarios.

  1. Matching up two data streams that send events in different speeds
  2. Detecting a missing event in a data stream ( e.g. detect a customer request that has not been responded within in 1 hour of its reception. )
  3. Detecting erroneous data (e.g. Detect failed sensors using a set of sensors that monitor overlapping regions and using those redundant data to find erroneous sensors and removing their data from further processing)

Pattern 6: Interacting with Databases

Often we need to combine the realtime data against the historical data stored in a disk. Following are few examples.

  • When a transaction happened, lookup the age using the customer ID from customer database to be used for Fraud detection (enrichment)
  • Checking a transaction against blacklists and whitelists in the database
  • Receive an input from the user (e.g. Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention.)

Pattern 7: Detecting Temporal Event Sequence Patterns

Using regular expressions with strings, we detect a pattern of characters from sequence of characters. Similarly, given a sequence of events, we can write a regular expression to detect a temporal sequence of events arranged on time where each event or condition about the event is parallel to a character in a string in the above example.

Often cited example, although bit simplistic, is that a thief, having stolen a credit card, would try a smaller transaction to make sure it works and then do a large transaction. Here the small transaction followed by a large transaction is a temporal sequence of events arranged on time, and can be detected using a regular expression written on top of an event sequence.

Such temporal sequence patterns are very powerful. For example, the follow video shows a real time analytics done using the data collected from a real football game. This was the dataset taken from DEBS 2013 Grand Challenge.

In the video, we used patterns on event sequence to detect the ball possession, the time period a specific player controlled the ball. A player possessed the ball from the time he hits the ball, until someone else hits the ball. This condition can be written as a regular expression: a hit by me, followed by any number of hits by me, followed by a hit by someone else. (We already discussed how to detect the hits on the ball in Pattern 4: Joins).

Pattern 8: Tracking

The eighth pattern tracks something over space and time and detects given conditions. Following are few examples

  • Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and geo-fences.
  • Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation.
  • Tracking airline luggages and making sure they are not been sent to wrong destinations
  • Tracking a logistic network and figure out bottlenecks and unexpected conditions.

For example, TFL Demo we discussed under pattern 2 shows an application that tracks and monitors London buses using the open data feeds exposed by TFL(Transport for London).

Pattern 9: Detecting Trends

We often encounter time series data. Detecting patterns from time series data and bringing them into operator attention are common use cases.

Following are some of the examples of tends.

  • Rise, Fall
  • Turn (switch from rise to a fall)
  • Outliers
  • Complex trends like triple bottom etc.

These trends are useful in a wide variety of use cases such as

  • Stock markets and Algorithmic trading
  • Enforcing SLA (Service Level Agreement), Auto Scaling, and Load Balancing
  • Predictive maintenance ( e.g. guessing the Hard Disk will fill within next week)

Pattern 10: Running the same Query in Batch and Realtime Pipelines

This pattern runs the same query in both Relatime and batch pipeline. It is often used to fill the gap left in the data due to batch processing. For example, if batch processing takes 15 minutes, results would lack the data for last 15 minutes.

Idea of this pattern, which is sometimes called “Lambda Architecture” is to use realtime analytics to fill the gap. Jay Kreps’s article “Questioning the Lambda Architecture” discusses this pattern in detail.

Pattern 11: Detecting and switching to Detailed Analysis

Main idea of the pattern is to detect a condition that suggests some anomaly, and further analyze it using historical data. This pattern is used with the use cases where we cannot analyze all the data with full detail. Instead we analyze anomalous cases in full detail. Following are few examples.

  • Use basic rules to detect Fraud (e.g. large transaction), then pull out all transactions done against that credit card for larger time period (e.g. 3 months data) from batch pipeline and run a detailed analysis
  • While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast on that region.
  • Detect good customers, for example through expenditure of more than $1000 within a month, and then run a detailed model to decide potential of offering a deal.

Pattern 12: Using a Model

Idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions. For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline. Among examples are Fraud Detections, Segmentation, Predict next value, Predict Churn.

Pattern 13: Online Control

There are many use cases where we need to control something online. The classical use cases are autopilot, self-driving, and robotics. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.

You can find details about pattern implementations from the following slide deck, and source code from https://github.com/suhothayan/DEBS-2015-Realtime-Analytics-Patterns.