We did a tutorial on DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns. Following is the summary of the tutorial.
Realtime analytics, or what people call Realtime Analytics, have two flavors.
- Realtime Interactive/Ad-hoc Analytics (users issue ad-hoc dynamic queries and the system responds interactively). Among example of such systems are Druid, SAP Hana, VoltDB, MemSQL, and Apache Drill.
- Realtime Streaming Analytics ( users issue static queries once and they do not change, and the system process data as they come in without storing). CEP and Stream Processing technologies are two example technologies that enable streaming analytics.
Realtime Interactive Analytics allow users to explore a large data set by issuing ad-hoc queries. Queries should respond within 10 seconds, which is considered the upper bound for acceptable human interaction. In contrast, this tutorial focuses on Realtime Streaming Analytics, which is processing data as they come in without storing them and react to those data very fast, often within few milliseconds. Such technologies are not new. History goes back to Active Databases (2000+), Stream processing (e.g. Aurora (2003) , Borealis (2005+) and later Apache Storm), Distributed Streaming Operators(2005), and Complex Event processing.
Still when thinking about Realtime analytics, many think only counting usecases. As we shall discuss, counting usecases are only the tip of the iceberg of real-life realtime usecases. Since the input data arrives as a data stream, a time dimension always presents in the data. This time dimension allows us to implement and perform much powerful usecases. For an example, Complex Event Processing technology provide operators like windows, joins, and temporal event sequence detection.
Stream processing technologies like Apache Samza and Apache Storm has received much attention under the theme large scale streaming analytics. However, these tools force every programmer to design and implement realtime analytics processing from first principals.
For an example, if users need a time window, they need to implement it from first principals. This is like every programmer implementing his own list data structure. Better understanding of common patterns will let us understand the domain better and build tools that handle those scenarios. This tutorial try to address this gap by describing 13 common relatime streaming analytics patterns and how to implement them. In the discussion, we will draw heavily from real life usecases done under Complex Event Processing and other technologies.
Realtime Streaming Analytics Patterns
Before looking at the patterns, let’s first agree on the terminology. Realtime Streaming Analytics accepts input as a set of streams where each stream consists of many events ordered in time. Each event has many attributes, but all events in a same stream have the same set of attributes or schema.
Pattern 1: Preprocessing
Preprocessing is often done as a projection from one data stream to the other or through filtering. Potential operations include
- Filtering and removing some events
- Reshaping a stream by removing, renaming, or adding new attributes to a stream
- Splitting and combining attributes in a stream
- Transforming attributes
For example, from a twitter data stream, we might choose to extract the fields: author, timestamp, location, and then filter them based on the location of the author.
Pattern 2: Alerts and Thresholds
This pattern detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase etc.
For an example, in TFL (Transport for London) Demo video based on transit data from London, we trigger a speed alert when the bus has exceed a given speed limit.
Also we can generate alerts for scenarios such as the server room temperature is continually increasing for last 5 mins.
Pattern 3: Simple Counting and Counting with Windows
This pattern includes aggregate functions like Min, Max, Percentiles etc, and they can be counted without storing any data. (e.g. counting number of failed transactions).
However, count are often useful with a time window attached to it.( e.g. failure count last hour). There are many types of windows: sliding windows vs. batch (tumbling) windows and time vs. length windows. There are four main variations.
- Time, Sliding window: keeps each event for the given time window, produce an output whenever new event has added or removed.
- Time, Batch window: also called tumbling windows, they only produce output at the end of the time window
- Length, Sliding : same as the time, sliding window, but keeps an window of n events instead of selecting them by time.
- Length, Batch window: same as the time, batch window, but keeps an window of n events instead of selecting them by time
Also there are special windows like decaying windows and unique windows.
Pattern 4: Joining Event Streams
Main idea behind this pattern is to match up multiple data streams and create a new event steam. For an example, lets assume we play a football game with both the players and the ball having sensors that emits events with current location and acceleration. We can use “joins” to detect when a player have kicked the ball. To that end, we can join the ball location stream and the player stream on the condition that they are close to each other by one meter and the ball’s acceleration has increased by more than 55m/s^2.
Among other usecases are combining data from two sensors, and detecting proximity of two vehicles.
Pattern 5: Data Correlation, Missing Events, and Erroneous Data
This pattern and the pattern 4 has lot in common where here too we match up multiple stream. In addition, we also correlate the data within the same stream. This is because different data sensors can send events in different rates, and many usecases require this fundamental operator.
Following are some possible scenarios.
- Matching up two data streams that send events in different speeds
- Detecting a missing event in a data stream ( e.g. detect a customer request that has not been responded within in 1 hour of its reception. )
- Detecting erroneous data (e.g. Detect failed sensors using a set of sensors that monitor overlapping regions and using those redundant data to find erroneous sensors and removing their data from further processing)
Pattern 6: Interacting with Databases
Often we need to combine the realtime data against the historical data stored in a disk. Following are few examples.
- When a transaction happened, lookup the age using the customer ID from customer database to be used for Fraud detection (enrichment)
- Checking a transaction against blacklists and whitelists in the database
- Receive an input from the user (e.g. Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention.)
Pattern 7: Detecting Temporal Event Sequence Patterns
Using regular expressions with strings, we detect a pattern of characters from sequence of characters. Similarly, given a sequence of events, we can write a regular expression to detect a temporal sequence of events arranged on time where each event or condition about the event is parallel to a character in a string in the above example.
Often cited example, although bit simplistic, is that a thief, having stolen a credit card, would try a smaller transaction to make sure it works and then do a large transaction. Here the small transaction followed by a large transaction is a temporal sequence of events arranged on time, and can be detected using a regular expression written on top of an event sequence.
Such temporal sequence patterns are very powerful. For example, the follow video shows a real time analytics done using the data collected from a real football game. This was the dataset taken from DEBS 2013 Grand Challenge.
In the video, we used patterns on event sequence to detect the ball possession, the time period a specific player controlled the ball. A player possessed the ball from the time he hits the ball, until someone else hits the ball. This condition can be written as a regular expression: a hit by me, followed by any number of hits by me, followed by a hit by someone else. (We already discussed how to detect the hits on the ball in Pattern 4: Joins).
Pattern 8: Tracking
The eighth pattern tracks something over space and time and detects given conditions. Following are few examples
- Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and geo-fences.
- Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation.
- Tracking airline luggages and making sure they are not been sent to wrong destinations
- Tracking a logistic network and figure out bottlenecks and unexpected conditions.
For example, TFL Demo we discussed under pattern 2 shows an application that tracks and monitors London buses using the open data feeds exposed by TFL(Transport for London).
Pattern 9: Detecting Trends
We often encounter time series data. Detecting patterns from time series data and bringing them into operator attention are common use cases.
Following are some of the examples of tends.
- Rise, Fall
- Turn (switch from rise to a fall)
- Complex trends like triple bottom etc.
These trends are useful in a wide variety of use cases such as
- Stock markets and Algorithmic trading
- Enforcing SLA (Service Level Agreement), Auto Scaling, and Load Balancing
- Predictive maintenance ( e.g. guessing the Hard Disk will fill within next week)
Pattern 10: Running the same Query in Batch and Realtime Pipelines
This pattern runs the same query in both Relatime and batch pipeline. It is often used to fill the gap left in the data due to batch processing. For example, if batch processing takes 15 minutes, results would lack the data for last 15 minutes.
Idea of this pattern, which is sometimes called “Lambda Architecture” is to use realtime analytics to fill the gap. Jay Kreps’s article “Questioning the Lambda Architecture” discusses this pattern in detail.
Pattern 11: Detecting and switching to Detailed Analysis
Main idea of the pattern is to detect a condition that suggests some anomaly, and further analyze it using historical data. This pattern is used with the use cases where we cannot analyze all the data with full detail. Instead we analyze anomalous cases in full detail. Following are few examples.
- Use basic rules to detect Fraud (e.g. large transaction), then pull out all transactions done against that credit card for larger time period (e.g. 3 months data) from batch pipeline and run a detailed analysis
- While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast on that region.
- Detect good customers, for example through expenditure of more than $1000 within a month, and then run a detailed model to decide potential of offering a deal.
Pattern 12: Using a Model
Idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions. For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline. Among examples are Fraud Detections, Segmentation, Predict next value, Predict Churn.
Pattern 13: Online Control
There are many use cases where we need to control something online. The classical use cases are autopilot, self-driving, and robotics. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.
You can find details about pattern implementations from the following slide deck, and source code from https://github.com/suhothayan/DEBS-2015-Realtime-Analytics-Patterns.