WSO2 CEP 4.0.0: What is New? Storm support, Dashboards, and Templates

WSO2 CEP 4.0 is out. You can find

The first thing to note is that we have integrated batch, realtime, interactive, and predictive analytics into one platform called WSO2 Data Analytics Server (DAS). Please refer to my earlier blog, Introducing WSO2 Analytics Platform: Note for Architects, to understand WSO2 CEP fits into DAS. DAS release is coming soon.

Let us discuss what is new in WSO2 CEP 4.0.0, and what would those features mean to the end user.

Storm Integration

WSO2 CEP supports distributed query executions on top of Apache Storm. Users can provide CEP queries that have partitions, and WSO2 CEP can automatically build a Storm Topology that is equivalent to the query, deploy, and run it. Following is an example of such a query. CEP will build a Storm topology, which will first partition the data by region, run the first query within each partition, and then collect the results and run the second query.

 
define partition on TempStream.region {
  from TempStream[temp > 33]
  insert into HighTempStream; 
}
from HighTempStream#window(1h)
  select max(temp)as max 
  insert into HourlyMaxTempStream;

Effectively, WSO2 CEO provides a SQL-like, stream processing language that runs on Apache Storm. Please refer to the following talk I did at O’reilly Strata for more information (slides).

Analytics Dashboard

WSO2 CEP now includes a dashboard and a Wizard for creating charts using data from event streams.

wizard

From the Wizard, you can choose a stream, select a chart type, assign its properties into different dimensions in the plot via a Wizard, and generate a chart. For an example, you can tell that you need a scatter plot where x-axis maps to time, y-axis maps to hit count where point colours maps to country, and point size maps to their population. The charts will be connected to CEP though web sockets and the scatter plot will update when new data become available in the underline event stream.

Query Templates

CEP queries are complicated. It is not very simple for non-technical user to write new queries. With these templates, developers can write parameterised queries and save them as a template. Then users can provide values for that template using a form and deploy them as a query.

For example, let’s assume we want the end users to write a query to detect high-speed vehicles where end user defines the speed. Then we will write parametrized query template like following.

 
from VehicleStream[speed > $1]

The end user, when he select the template, will see a form that let him specify the speed value. CEP will deploy a new query using the template and speed value by a click of a button. Following is a example form.

templates

Furthermore, WSO2 CEP now includes a Geo Dashboard, that you can configure via query templates. Following video shows the visualization of London traffic data using geo dashboard.

Siddhi Language Improvements

With WSO2 CEP 4.0, queries that have partitions defined would run each partition in parallel. Earlier, all executions would run in a single thread, and CEP will only use a single core  per Execution plan ( a collection of queries). New approach would significantly improve performance for some usecases.

Furthermore, CEP can now run Machine Learning models built with WSO2 ML and PMML models. It supports several anomaly detection algorithms as described in Fraud Detection and Prevention: A Data Analytics Approach white paper.

In addition, we have added time series regression and a forecaster as Siddhi functions. It also includes several new functions for string manipulations and mathematics. Furthermore, it includes a CronWindow that will trigger based on a Cron expression (see sample 115 for more details), which users can used to define time windows that starts in a specific time.

Also, now you can pack all queries and related artifacts into a WSO2 single Carbon archive, which will make it easier for users to manage their CEP execution plans as a single unit.

New Transports

New WSO2 CEP can receive event and send events using MQTT, which is one of the leading Internet of Things (IoT) protocols. Also it includes support for WebSockets that will make it much easier to build web apps that uses WSO2 CEP.

Tools

WSo2 CEP now includes an Event Simulator, that you can use to replay events stored in a CSV file for testing and demo purposes. Furthermore, it has a “Try it” feature that let user send events into CEP using it’s Web console, which is also useful for testing.

Conclusion

Please try it out. It is all free under apache Licence. We will love to hear your thoughts. If you find any problems or have suggestions, drop me a note or get in touch with us via architecture@wso2.org.

Advertisements

WSO2 Machine Learner: Why would You care?

After about a year worth of work, WSO2 Machine Learner (WSO2 ML) is officially out. You can find the pack from http://wso2.com/products/machine-learner/ ( also Code and User Guide). It is free and Opensource under Apache Licence ( which pretty much means you can do whatever with the code as long as you keep the same Licence).

Let me try to answer “the question”. How is it different and why would you care?

What is it?

The short answer is it is a Wizard and a system on top of Apache Spark MLLib. The long answer is the following picture.

ML-overview

You can use it to do the following

  1. User can start with data ( in his disk, in HDFS, or in WSO2 DAS)
  2. Explore the data ( more about that later)
  3. Create a Project and build machine learning models going through a Wizard
  4. Compare those models and find the best model
  5. Export that model and use it with WSO2 CEP, WSO2 ESB, or from Java Code.

For someone from Enterprise World?

WSO2 Machine Learner is designed for the Enterprise world. It comes as an integrated solution with the rest of the Big Data processing technologies: batch, realtime, and interactive analytics. Also, it includes support from data collection, analysis,  to communication (e.g. visualizations, APIs, and alerts). Please see the earlier post “Introducing WSO2 Analytics Platform: Note for Architects” for more details.  Hence, it is part of a complete analytics solution.

WSO2 ML handles the full predictive analytics lifecycle, including model deployment and management.

MLDeployment

If you are already collecting data, we can pull that data, process them, and build models. Models you built are immediately available to use from your main transaction flow ( via WSO2 ESB) or  data analysis flow ( via WSO2 CEP). Basically, you copy the model ID and add it to WSO2 ESB mediation scripts or WSO2 CEP queries, and now you have a Machine Learning integrated into your business. (Please see in Using Models for more information.) This handles details like keeping a central store of Models while deploying models in production and also let you quickly switch between models.

If you are not collecting data, you can start with WSO2 DAS and go from there. The same story holds.

Furthermore, it gives you the concept of a project where you can try out and keep track of multiple machine learning models. Also, it handles details like sending you an email when a long running machine learning algorithm execution has completed.

Finally, as we discuss in the next section, the ML Wizard is built such a way that you can use it with minimal understanding about Machine Learning. Sure, you will not get the same accuracy as the experts who will know how to tune the thing, but it can get you started and give you OK accuracy.

For a Machine Learning Newbie?

First of all, you need to understand what Machine Learning can do for you. Most problems, we know the exact steps to be followed to solve the problem. With those kinds of problems, all we have to do is to write a code that does those steps. This is what we call programming and lot of us do this day in day out.

However, there are other problems that you will learn by example. Driving a car, cycling, and drawing a picture are problems that we learn by looking at examples. If you want a computer to solve those problems, you cannot write a program to solve them because you do not know the algorithm. Machine Learning is used to solve specifically those problems. Instead of the algorithm, you give it lots of examples, and Machine Learning will learn a model (a function) from those examples. You can use the model to solve your initial problem. Google’s driverless car does exactly this.

If you are new to Machine Learning, I highly recommend looking at A Visual Introduction to Machine Learning and the following talk by Ron Beckerman.

The Machine Learner wizard tries to model the experience around what you want to do as oppose to showing you lot of ML algorithms. For example, you can choose to predict the next value, classify something to a one of the categories, or detect an anomaly. You can click through, use defaults, and get a model. You can try several algorithms and compare them with each other.

We support several standard techniques to compare ML models such as ROC curve, confusion Matrix, etc. CD’s blog post “Machine Learning for Everyone” talks about this in detail.

For example, following confusion matrix shows how much of true positives, false negatives etc resulted from the model.

confusion-matrix_r

The figure on the left chart shows a scatter plot of data points that are predicted correctly and incorrectly while the right-hand side shows the RoC curve.

predicted-vs-actual

roc-graph

However, at this point I suggest that you read How to Evaluate Machine Learning Models: Classification Metrics by Alice Zheng. It is ok to not to know how ML algorithms work, but you must know what models are better and why.

However, there is a catch. If you try well known Machine Learning datasets, they would work well ( You can find few of such data sets from the sample directory of the pack). However, sometimes with real datasets, getting good results need transforming features into different features, and that might be beyond you if you have just started. If you want to go pro and learn to transform features ( a.k.a. Feature Engineering) and other fascinating stuff, then Andrew Ng’s famous course https://www.coursera.org/learn/machine-learning is the best place to start.

For a Machine Learning Expert?

If you are an ML expert, still WSO2 Machine Learner can help in several ways.

First, it provides pretty sophisticated support for exploring the dataset based on a random sample. This includes scatter plots for looking at any two numerical features, parallel sets for looking at categorical data, Trellis sets for looking at 4-5 numerical dimensions at the same time, and cluster diagrams ( see below for some examples).

cluster-diagram trellis-chart parallel-set

Second, it gives you access to a large collection of scalable machine learning algorithms pretty easily. For a single node setup, you just download and unzip it. ( see below for how to do it).

Third, it provides an extensive set of model comparison measures as visualizations and also let you compare models side by side.

Fourth, in addition to predictive analytics, you have access to batch analytics though SparkSQL, interactive analytics with Lucence, and relatime analytics through WSO2 CEP. This will make understanding dataset as well as preprocessing data much easier. One limitation of this release is that those other types of analytics must be done before using data within WSO2 ML. However, the next release will enable you to run queries within the WSO2 ML pipeline as well.

Finally, you will also have all advantages listed under enterprise user such as seamless deployment of models and ability to switch the model easily.

Furthermore, many interesting features are coming shortly in the next release.

  • Support for Deep Learning and Neural Networks
  • Support for out of the Box Anomaly detection using Markov Chains and Clustering
  • Support to data cleanup and preprocessing using Data Wrangler and SparkSQL
  • Support for out of the box ensembles that let you combine models
  • Improvements to pipeline to warn the user on cases like class imbalances in classifications

Trying it Out

Carry out following steps

  1. Download WSO2 ML from http://wso2.com/products/machine-learner/
  2. Make sure you have Java 7 installed in your machine and set JAVA_HOME.
  3. Unzip the pack and run bin/wso2server.sh from the unpacked directory. Wait for WSO2 ML to start.
  4. Go to https://hostname:9443/ml and Login using username admin and password admin.
  5. Now you can upload your own dataset and follow along with the wizard. You can find more info from the User Guide. However, Wizard should be self-explanatory.

Remember, it is all free under apache Licence. Give it a try, and we will love to hear your thoughts. If you find any problems or have suggestions, report them via https://wso2.org/jira/browse/ML.

Dissecting the Big data Twitter Community through a Big data Lense

BigData hashtag is hyperactive, with close to 2000 tweets each day with more than 20,000 tweeps. This post digs into the tweet archive from August 03-25 to understand dynamics about the Big data community.

Tweeter communities have activities: tweets, retweets, replies, and followers. Among them, retweets suggest a strong agreement by the actor with the tweet’s content. Hence, retweets graph is a good representation of actual connections in the network, their strengths, as well as the propagation of information through the network.

The Network

This post, therefore, will focus on the retweets graph. The following graph shows a visualization of the retweets graph  where vertex represents an account, edges represent retweets, and the size of the node represents the number of retweets each node has received. Each edge is weighted by the number of retweets between two accounts, and it shows an edge from account A to B only if B has retweeted two or more tweets by A.

network2

The first thing you will notice is that the top three tweeps have received a large proportion of retweets. The following heat map shows retweets received by top tweeps.

retweetsHeatmap

  1. KirkDBorne 2588
  2. jose_garde 1730
  3. craigbrownphd 1546

In the network, we can see that the three of them have their own following. However, the graph has a phantom edge which has lots of edges placed around it in the right middle. That turns out to be a twitter bot ( BigDataTweetBot) which has tweeted lots of other people’s tweets.

network10The following figure shows a more spare version of the same graph that only shows an edge if two accounts have more than 10 retweets between them. On this network, the KirkDBorne community seems to be pretty well-connected, while others are pretty isolated. This suggests that his community is stronger.

Is it a Small World Network?

As shown by the following plot, the Retweets distribution follows a Power law, but edge distribution is close to Power law but falls short. The network is close to a scale-free network.

Degree

RetweetDistribution

However, the network has a very high diameter of 154 and a mean path length 11. Hence, it is not a small world network. Furthermore, it’s Cluster coefficient is very small (0.0009953724), which suggest that the cross chatter in the network is very small. So the Big data retweets do not create a cohesive community.

How can I get more Retweets?
twtRTFw

When we talk about retweets, this is the thought on everyone’s mind. The plot shows number of tweets per day in X-axis, the number of followers on Y axis in log scale, and each point’s size and the color is decided by the number of retweets it has received.

According to the plot, having a lot of followers helps and necessary, but it is not a sufficient. However,  tweeting a lot seems to help, and most tweeps tweeting more than ten tweets a day have received at least 10 retweets. ( Retweets are not included).

Are Tweet Bots Useful?

Do retweets bots (e.g. BigDataTweetBot, NoSQLDigest) are useful or do they just create noise by retweeting things blindly? Let us investigate. Let’s look at the betweenness centrality, which is a measure of the role of each node in connecting the network, to understand who are key connectors in the network. @Espenel takes the first while the fourth takes by @KirkDBorne. Second and third are taken by twitter bots (BigDataTweetBot, NoSQLDigest), which suggests that twitter bots are indeed useful.

What did Community talking about?

wordcloud

Following word cloud shows the words that have been most often used. The word cloud has most of the usual suspects, like links to IoT and cloud,  businesses, marketing etc. Among companies Google, Intel and IBM have been mentioned.

Interestingly, we do not see any of the big data tools. It is possible that related discussions happen in their own hashtags such as #hadoop and #spark.

Following are most tweeted tweets through the time period

  1. #bigdata to our users !!! check the new keyword suggestions for an improved
  2. 4 predictive #analytics and practical applications for the everyday marketer (422)
  3. marrying #data to #analytics a major theme at #hp’s conference  (153)
  4. combining analytics and security to treat vulnerabilities like ants (150)
  5. sbi uses big data mining to check defaults biz loss: when state bank of india  (141)

Following are most tweeted tweets by day. We only list tweets that have had more than 75 retweets in a day. It shows the number of tweets it has received within brackets.

  1. Aug 05: guidelines to optimize #bigdata transfers (89)
  2. Aug 10:#nfl taps #bigdata to study #concussions but major game changes far off (139)
  3. Aug 10: sbi uses big data mining to check defaults biz loss: when state bank of india (sbi)  (140)
  4. Aug 12: #iot facts + how to make business sense of the internet of things (85)
  5. Aug 18: idf 2015: intel teams with google to bring realsense to project tango (113)
  6. Aug 18: marrying #data to #analytics a major theme at #hp’s conference (152)
  7. Aug 19: combining analytics and security to treat vulnerabilities like ants: bill franks chief analytics off (149)
  8. Aug 20: qantas annual profit soars to au$975m: australia’s flying kangaroo is out of the red having boosted (115)
  9. Aug 20: top news: sap oem on twitter: “top 10 #bigdata twitter handles to follow @merv (78)
  10. Aug 22: five open source big data projects to watch (132)
  11. Aug 22: 3 ways that big data are used to study #climatechange (126)
  12. Aug 22: should #bigdata be used to measure #employee #productivity? (110)
  13. Aug 23: e-commerce market #analytics to #ebay #amazon #alibaba sellers and buyers
  14. Aug 23: should #bigdata be used to measure #employee #productivity? (134)

One interesting observation is that most trending tweets were about usecases, not about tools or techniques.

Summary

  1. Few well-known tweeps have a lot of retweets, and top three roughly have their own communities.
  2. The network is roughly scale-free, but not a small world network. Nodes are weakly connected, which suggests non-cohesive  communities.
  3. A large number of followers is a necessary but not a sufficient condition to receive a lot of retweets. Tweeting a lot seems to help.
  4. Tweet bots are centrally placed and likely useful.
  5. Most retweeted tweets seem to focus on use cases.

What Data Science and Big Data can do for Sri Lanka?

I am sure you have heard enough about Big data ( processing and handling a large amount of data) and Data science ( how to make decisions with data). There is lots of chatter on how they are going to solve all the problems we know,  bring about world peace, and how we will live happily ever after.

Let’s try to slow down, look around, and discuss what it can really do for a country like Sri Lanka. Well, first is that we can build some great Big Data tools, sell it and bring in lots of exports to Sri Lanka. However, that is selling shovels to the gold diggers  at the gold rush, not a bad business proposition. Instead, let’s try to understand how Big Data can make a difference in day to day lives.

Thinking about BigData

Big data must be viewed not as a large infrastructure operation, but as a medium to connect different entities, collect, and analyze information that will let us instill order into existing processes and to create new processes. It could give us a holistic picture into what is going on, sometimes predict what will happen, and add order into chaos by ranking and rating different items and entities. For example, given a sea of information (e.g. web, social media, error tickets, requests for help, transactions, etc.), it can find out what are most important items and find out who has something important to tell. Furthermore, by creating alignment between individual gain and quality information in the system, it will nudge participants to create better content and sometimes better behavior. 

Following are some use cases, in my opinion,  that could help Sri Lanka. They are arranged by the order of how practical they are, and I have listed any reservations and challenges with each.

Urban Planning and Policy Decisions

17263788584_ac66ab842b_z

(image credits) cc license.

Understand social dynamics like people geographic distribution, demographic distribution, mobility patterns etc to aid in policy and urban planning. This can be done through data sets like Census, CDR data, social media data ( in the right context) etc. The good news is this is already underway by Lirneasia ( see Big Data for Development Project, http://lirneasia.net/projects/bd4d/). However, there are many problems to solve. If you are a Sri Lankan research student looking for a thesis topic, chances are you can find dozen good problems in this project.

Traffic

If you work in Colombo, you are no stranger to this. I travel daily about 35km to work, and on a bad day, we travel in about 15km/h. To be fair, Sri Lankan traffic is better than most places in India and even some places in US (e.g. San Fransisco 101 traffic). Yet a large number people waste lots of time, and with the rate of vehicle increase, things will get unmanageable soon.

traffic

Colombo traffic plan introduced 6-7 years ago fixed many things, and new roads certainly helped. However, we still cannot measure traffic fast enough. Most decisions are done via manual vehicle counting and few automatic counters. We need a way to measure traffic in higher resolution, faster, and accurately. The we can understand what is going on and plan around it.

Among ideas to collect data are

  1. Build an automatic traffic counter ( University of Moratuwa ENT department had built this already or we can use a number plate reading technology each IMO should end up less than 10K LKR per unit)
  2. Collect data from traffic officers, use social media feeds like @road_lk (e.g. see Real-time Natural Language Processing for Crowdsourced Road Traffic Alerts)
  3. Collect data from traffic officers
  4. Or a combination.

We must understand that reason this does not happen is neither due to want of technology nor due to want of money, but want of concentrated effort. If we have more data, fast enough, we can do better modeling and plan around bottlenecks. Also eventually, we can act on traffic incidents realtime.

Manage Doners and Charities

We are a culture that donates from what little we have. Sri Lankan, both rich and poor donate alike. However, it is not clear how much of that is put to good use, how much get lost on the way, or how much lasting impressions they leave.

5655247429_677bc162aa_z

Using data collection, social media, and independent verifications, we could build much more accountability and visibility into the charitable activities, and we can prioritize and try to make a lasting impression.

For an example, if a random person asked for a help, I might not trust him. However, if a newspaper reporter has done a report, then I have a bit more trust. If a well-known person in society asked for help, there is even more trust. However, if a recommendation come from a personal friend whom I know, that is even better. If someone with credibility can pledge to follow up, it will make a big difference. We could build such a system, rank requests as well as people involved, and bring in greater trust and efficiency into the system. The model can be extended to independent verification of what was carried out, and also to track long-term change. Data collected over the process can be used to rate different parties in the process as well as to optimize the process.

Day to day Maintenence

report2 report1

It seems to get anything fix, Sri Lanka needs to create news. A system that needs a new paper report to get a public lavatory fixed cannot go too far. This can be fixed by borrowing an issue reporting system from open source. We need a Geo-tagged complaints and maintenance request map that let people up vote and down vote tickets with photographic evidence. Then government authorities can monitor this, and act accordingly. The government can enforce SLA to check and act. However, the most important aspect is that this creates a paper trail that will make sure that relevant authorities cannot claim ignorance. Moreover, you cannot stop issues being reported by chasing away people.

Do we have connectivity to make this work? I think we do. Chances are that it is easy to find a Nanasala (Community Internet Stations placed in public places in Sri Lanka), rather than going to an office and convincing officials to write down a complain.

Will we sink in a sea of complaints? Yes, we will! This is where data science comes in. We need a rating system against tickets and reporters that enforces reputation. That can be done!

It is worth noting that Garbage handling is an another version of the same problem that we can solve using a similar method where people can report about illegal dumpings, intelligence, and inefficient collection.

Few More ideas

  1. Law and Order ( Police investigation) – tracking data about crimes committed, building a known database about known felons that people can check against, studying distribution and dynamics of crime and adjusting officer deployments.
  2. Health records – let each person keep his own health record history and ability for researchers to anonymously query health records to find higher level patterns. Furthermore, let patients rate and complain about doctors and a system to verify and act.
  3. Health – Build a wearable based in-home health care solution ( good idea for a startup) that is based on a subscription. Sri Lanka is one of the countries that treat their elderly very well.
  4. Connecting Export Opportunities and Social Enterprises – Bring in technology to what organizations like Sarvodaya are doing while act as the bridge via finding potential markets, introducing potential suppliers, providing training and micro-financing.
  5. Crisis response – analyzing and coordinating efforts
  6. Disease spread – hotspot identification, prediction, preventive actions