Analysis of Retweeting Patterns in Sri Lankan 2015 General Election

The Election is done. Just the other day, we saw an analysis of twitter activities by Yudhanjaya, where he observed that few accounts shape the tone of the twitter LKA community. 

In this post, I am taking a detailed look at the retweet network, further drilling into twitter community. The dataset is the retweets graph for twitter hashtags #GESL15 and #GenElecSL collected between 4-22 of august.The archive includes 14k tweets of which about 9k are retweets. There are 2480 tweets accounts have participated. I collect the data through hawkeys.info. The analysis is done using a set of R scripts.

Following are some interesting observations.

How does the community look like?

denseNetwork sparsenetwork

The graph shows a visualization of the community. Each vertex represents an account. The first dense chart shows an edge for each retweet and the second sparse graph only shows an edge if five or more retweets have happened between the two accounts. The size of each node shows the number of retweets the node has received.

The community is arranged around few accounts that act as hubs, and first 10 authors have received about 40% retweets all retweets. This confirms Yudhanjaya‘s observations.

Furthermore, both graphs are well connected. Even the sparse graph is fully connected. Often in political conversions, different groups tend to segregate and cross talk is minimal. However, that is not the case in the LKA twitter graph. Maybe the presence of journalists as hubs in the network have enabled cross talk between groups.

The first table below shows the 15 accounts that had most retweets and the second table shows vertex betweenness values. Vertex betweenness is a measure of each node’s ability to connect different parts of the network.

betweeness toptweets

Four accounts appears on both measures, which further confirms their prominence.

What suggests a good reach?

Following two charts try to find any correlation between the number of tweets or the number of followers vs. retweets.

tweetsvsRetweetsBotsMarkedfollowersAndRetweets

However, data does not show such behavior. There are several bots that generate lots of tweets, but they do not generate much retweets. Furthermore, the accounts that receive many retweets have only about 50-100 tweets ( about 4-5 per day). This is evidence supporting that it is content, not the network strcture drives retweets, although it is not concusive.

Also, the relationship between the  number of followers and retweets also is not very clear. Although there are few account that have many followers and tweets, there are notable exceptions in the graph. Most likely this is caused by highly connected nature of the network, where followers are replaced by fast propergation through the network. Hence you can have lot of reach without having lot of followers.

What did they talk about?

The following picture shows an word cloud generated using all the tweets. However, there are not much surprises there.

wc

In contrast, top retweets in each day provide a superb chronicle on what happens in each day over time. You can get a view closer to this by typing #GenElecSL into twitter search box.

topretweets

Conclusion

Twitter network community for Sri Lankan general election 2015 has a very well connected retweets graph. Although few accounts shape the tone of the discussion, they seem to do a good job of enabling cross-communication between different groups. Reach, measured via retweets, seem to be independent of attributes like the frequency of tweets and the number of followers the account have. Finally, most retweeted tweets in each day seem to provide a useful chronicle of what happens in the election each day.

Note: It is worth noting that the community graph does not show the followers. Hence, the retweet can happens via another account as well (e.g. B retweets A’s message, and C having seen B’s retweet, he retweets A’s post. Then both C and B will have an edge from A in this graph. Therefore, these results does not discredit follower network in the community.

Patterns for Streaming Realtime Analytics

Introduction

We did a tutorial on DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns. Following is the summary of the tutorial.

Realtime analytics, or what people call Realtime Analytics, have two flavors.

  1. Realtime Interactive/Ad-hoc Analytics (users issue ad-hoc dynamic queries and the system responds interactively). Among example of such systems are Druid, SAP Hana, VoltDB, MemSQL, and Apache Drill.
  2. Realtime Streaming Analytics ( users issue static queries once and they do not change, and the system process data as they come in without storing). CEP and Stream Processing technologies are two example technologies that enable streaming analytics.

Realtime Interactive Analytics allow users to explore a large data set by issuing ad-hoc queries. Queries should respond within 10 seconds, which is considered the upper bound for acceptable human interaction. In contrast, this tutorial focuses on Realtime Streaming Analytics, which is processing data as they come in without storing them and react to those data very fast, often within few milliseconds. Such technologies are not new. History goes back to Active Databases (2000+), Stream processing (e.g. Aurora (2003) , Borealis (2005+) and later Apache Storm), Distributed Streaming Operators(2005), and Complex Event processing.

Still when thinking about Realtime analytics, many think only counting usecases. As we shall discuss, counting usecases are only the tip of the iceberg of real-life realtime usecases. Since the input data arrives as a data stream, a time dimension always presents in the data. This time dimension allows us to implement and perform much powerful usecases. For an example, Complex Event Processing technology provide operators like windows, joins, and temporal event sequence detection.

Stream processing technologies like Apache Samza and Apache Storm has received much attention under the theme large scale streaming analytics. However, these tools force every programmer to design and implement realtime analytics processing from first principals.

For an example, if users need a time window, they need to implement it from first principals. This is like every programmer implementing his own list data structure. Better understanding of common patterns will let us understand the domain better and build tools that handle those scenarios. This tutorial try to address this gap by describing 13 common relatime streaming analytics patterns and how to implement them. In the discussion, we will draw heavily from real life usecases done under Complex Event Processing and other technologies.

Realtime Streaming Analytics Patterns

Before looking at the patterns, let’s first agree on the terminology. Realtime Streaming Analytics accepts input as a set of streams where each stream consists of many events ordered in time. Each event has many attributes, but all events in a same stream have the same set of attributes or schema.

Pattern 1: Preprocessing

Preprocessing is often done as a projection from one data stream to the other or through filtering. Potential operations include

  • Filtering and removing some events
  • Reshaping a stream by removing, renaming, or adding new attributes to a stream
  • Splitting and combining attributes in a stream
  • Transforming attributes

For example, from a twitter data stream, we might choose to extract the fields: author, timestamp, location, and then filter them based on the location of the author.

Pattern 2: Alerts and Thresholds

This pattern detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase etc.

For an example, in TFL (Transport for London) Demo video based on transit data from London, we trigger a speed alert when the bus has exceed a given speed limit.

Also we can generate alerts for scenarios such as the server room temperature is continually increasing for last 5 mins.

Pattern 3: Simple Counting and Counting with Windows

This pattern includes aggregate functions like Min, Max, Percentiles etc, and they can be counted without storing any data. (e.g. counting number of failed transactions).

However, count are often useful with a time window attached to it.( e.g. failure count last hour). There are many types of windows: sliding windows vs. batch (tumbling) windows and time vs. length windows. There are four main variations.

  • Time, Sliding window: keeps each event for the given time window, produce an output whenever new event has added or removed.
  • Time, Batch window: also called tumbling windows, they only produce output at the end of the time window
  • Length, Sliding : same as the time, sliding window, but keeps an window of n events instead of selecting them by time.
  • Length, Batch window: same as the time, batch window, but keeps an window of n events instead of selecting them by time

Also there are special windows like decaying windows and unique windows.

Pattern 4: Joining Event Streams

Main idea behind this pattern is to match up multiple data streams and create a new event steam. For an example, lets assume we play a football game with both the players and the ball having sensors that emits events with current location and acceleration. We can use “joins” to detect when a player have kicked the ball. To that end, we can join the ball location stream and the player stream on the condition that they are close to each other by one meter and the ball’s acceleration has increased by more than 55m/s^2.

Among other usecases are combining data from two sensors, and detecting proximity of two vehicles.

Pattern 5: Data Correlation, Missing Events, and Erroneous Data

This pattern and the pattern 4 has lot in common where here too we match up multiple stream. In addition, we also correlate the data within the same stream. This is because different data sensors can send events in different rates, and many usecases require this fundamental operator.

Following are some possible scenarios.

  1. Matching up two data streams that send events in different speeds
  2. Detecting a missing event in a data stream ( e.g. detect a customer request that has not been responded within in 1 hour of its reception. )
  3. Detecting erroneous data (e.g. Detect failed sensors using a set of sensors that monitor overlapping regions and using those redundant data to find erroneous sensors and removing their data from further processing)

Pattern 6: Interacting with Databases

Often we need to combine the realtime data against the historical data stored in a disk. Following are few examples.

  • When a transaction happened, lookup the age using the customer ID from customer database to be used for Fraud detection (enrichment)
  • Checking a transaction against blacklists and whitelists in the database
  • Receive an input from the user (e.g. Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention.)

Pattern 7: Detecting Temporal Event Sequence Patterns

Using regular expressions with strings, we detect a pattern of characters from sequence of characters. Similarly, given a sequence of events, we can write a regular expression to detect a temporal sequence of events arranged on time where each event or condition about the event is parallel to a character in a string in the above example.

Often cited example, although bit simplistic, is that a thief, having stolen a credit card, would try a smaller transaction to make sure it works and then do a large transaction. Here the small transaction followed by a large transaction is a temporal sequence of events arranged on time, and can be detected using a regular expression written on top of an event sequence.

Such temporal sequence patterns are very powerful. For example, the follow video shows a real time analytics done using the data collected from a real football game. This was the dataset taken from DEBS 2013 Grand Challenge.

In the video, we used patterns on event sequence to detect the ball possession, the time period a specific player controlled the ball. A player possessed the ball from the time he hits the ball, until someone else hits the ball. This condition can be written as a regular expression: a hit by me, followed by any number of hits by me, followed by a hit by someone else. (We already discussed how to detect the hits on the ball in Pattern 4: Joins).

Pattern 8: Tracking

The eighth pattern tracks something over space and time and detects given conditions. Following are few examples

  • Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and geo-fences.
  • Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation.
  • Tracking airline luggages and making sure they are not been sent to wrong destinations
  • Tracking a logistic network and figure out bottlenecks and unexpected conditions.

For example, TFL Demo we discussed under pattern 2 shows an application that tracks and monitors London buses using the open data feeds exposed by TFL(Transport for London).

Pattern 9: Detecting Trends

We often encounter time series data. Detecting patterns from time series data and bringing them into operator attention are common use cases.

Following are some of the examples of tends.

  • Rise, Fall
  • Turn (switch from rise to a fall)
  • Outliers
  • Complex trends like triple bottom etc.

These trends are useful in a wide variety of use cases such as

  • Stock markets and Algorithmic trading
  • Enforcing SLA (Service Level Agreement), Auto Scaling, and Load Balancing
  • Predictive maintenance ( e.g. guessing the Hard Disk will fill within next week)

Pattern 10: Running the same Query in Batch and Realtime Pipelines

This pattern runs the same query in both Relatime and batch pipeline. It is often used to fill the gap left in the data due to batch processing. For example, if batch processing takes 15 minutes, results would lack the data for last 15 minutes.

Idea of this pattern, which is sometimes called “Lambda Architecture” is to use realtime analytics to fill the gap. Jay Kreps’s article “Questioning the Lambda Architecture” discusses this pattern in detail.

Pattern 11: Detecting and switching to Detailed Analysis

Main idea of the pattern is to detect a condition that suggests some anomaly, and further analyze it using historical data. This pattern is used with the use cases where we cannot analyze all the data with full detail. Instead we analyze anomalous cases in full detail. Following are few examples.

  • Use basic rules to detect Fraud (e.g. large transaction), then pull out all transactions done against that credit card for larger time period (e.g. 3 months data) from batch pipeline and run a detailed analysis
  • While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast on that region.
  • Detect good customers, for example through expenditure of more than $1000 within a month, and then run a detailed model to decide potential of offering a deal.

Pattern 12: Using a Model

Idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions. For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline. Among examples are Fraud Detections, Segmentation, Predict next value, Predict Churn.

Pattern 13: Online Control

There are many use cases where we need to control something online. The classical use cases are autopilot, self-driving, and robotics. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.

You can find details about pattern implementations from the following slide deck, and source code from https://github.com/suhothayan/DEBS-2015-Realtime-Analytics-Patterns.