Dissecting the Big data Twitter Community through a Big data Lense

BigData hashtag is hyperactive, with close to 2000 tweets each day with more than 20,000 tweeps. This post digs into the tweet archive from August 03-25 to understand dynamics about the Big data community.

Tweeter communities have activities: tweets, retweets, replies, and followers. Among them, retweets suggest a strong agreement by the actor with the tweet’s content. Hence, retweets graph is a good representation of actual connections in the network, their strengths, as well as the propagation of information through the network.

The Network

This post, therefore, will focus on the retweets graph. The following graph shows a visualization of the retweets graph  where vertex represents an account, edges represent retweets, and the size of the node represents the number of retweets each node has received. Each edge is weighted by the number of retweets between two accounts, and it shows an edge from account A to B only if B has retweeted two or more tweets by A.


The first thing you will notice is that the top three tweeps have received a large proportion of retweets. The following heat map shows retweets received by top tweeps.


  1. KirkDBorne 2588
  2. jose_garde 1730
  3. craigbrownphd 1546

In the network, we can see that the three of them have their own following. However, the graph has a phantom edge which has lots of edges placed around it in the right middle. That turns out to be a twitter bot ( BigDataTweetBot) which has tweeted lots of other people’s tweets.

network10The following figure shows a more spare version of the same graph that only shows an edge if two accounts have more than 10 retweets between them. On this network, the KirkDBorne community seems to be pretty well-connected, while others are pretty isolated. This suggests that his community is stronger.

Is it a Small World Network?

As shown by the following plot, the Retweets distribution follows a Power law, but edge distribution is close to Power law but falls short. The network is close to a scale-free network.



However, the network has a very high diameter of 154 and a mean path length 11. Hence, it is not a small world network. Furthermore, it’s Cluster coefficient is very small (0.0009953724), which suggest that the cross chatter in the network is very small. So the Big data retweets do not create a cohesive community.

How can I get more Retweets?

When we talk about retweets, this is the thought on everyone’s mind. The plot shows number of tweets per day in X-axis, the number of followers on Y axis in log scale, and each point’s size and the color is decided by the number of retweets it has received.

According to the plot, having a lot of followers helps and necessary, but it is not a sufficient. However,  tweeting a lot seems to help, and most tweeps tweeting more than ten tweets a day have received at least 10 retweets. ( Retweets are not included).

Are Tweet Bots Useful?

Do retweets bots (e.g. BigDataTweetBot, NoSQLDigest) are useful or do they just create noise by retweeting things blindly? Let us investigate. Let’s look at the betweenness centrality, which is a measure of the role of each node in connecting the network, to understand who are key connectors in the network. @Espenel takes the first while the fourth takes by @KirkDBorne. Second and third are taken by twitter bots (BigDataTweetBot, NoSQLDigest), which suggests that twitter bots are indeed useful.

What did Community talking about?


Following word cloud shows the words that have been most often used. The word cloud has most of the usual suspects, like links to IoT and cloud,  businesses, marketing etc. Among companies Google, Intel and IBM have been mentioned.

Interestingly, we do not see any of the big data tools. It is possible that related discussions happen in their own hashtags such as #hadoop and #spark.

Following are most tweeted tweets through the time period

  1. #bigdata to our users !!! check the new keyword suggestions for an improved
  2. 4 predictive #analytics and practical applications for the everyday marketer (422)
  3. marrying #data to #analytics a major theme at #hp’s conference  (153)
  4. combining analytics and security to treat vulnerabilities like ants (150)
  5. sbi uses big data mining to check defaults biz loss: when state bank of india  (141)

Following are most tweeted tweets by day. We only list tweets that have had more than 75 retweets in a day. It shows the number of tweets it has received within brackets.

  1. Aug 05: guidelines to optimize #bigdata transfers (89)
  2. Aug 10:#nfl taps #bigdata to study #concussions but major game changes far off (139)
  3. Aug 10: sbi uses big data mining to check defaults biz loss: when state bank of india (sbi)  (140)
  4. Aug 12: #iot facts + how to make business sense of the internet of things (85)
  5. Aug 18: idf 2015: intel teams with google to bring realsense to project tango (113)
  6. Aug 18: marrying #data to #analytics a major theme at #hp’s conference (152)
  7. Aug 19: combining analytics and security to treat vulnerabilities like ants: bill franks chief analytics off (149)
  8. Aug 20: qantas annual profit soars to au$975m: australia’s flying kangaroo is out of the red having boosted (115)
  9. Aug 20: top news: sap oem on twitter: “top 10 #bigdata twitter handles to follow @merv (78)
  10. Aug 22: five open source big data projects to watch (132)
  11. Aug 22: 3 ways that big data are used to study #climatechange (126)
  12. Aug 22: should #bigdata be used to measure #employee #productivity? (110)
  13. Aug 23: e-commerce market #analytics to #ebay #amazon #alibaba sellers and buyers
  14. Aug 23: should #bigdata be used to measure #employee #productivity? (134)

One interesting observation is that most trending tweets were about usecases, not about tools or techniques.


  1. Few well-known tweeps have a lot of retweets, and top three roughly have their own communities.
  2. The network is roughly scale-free, but not a small world network. Nodes are weakly connected, which suggests non-cohesive  communities.
  3. A large number of followers is a necessary but not a sufficient condition to receive a lot of retweets. Tweeting a lot seems to help.
  4. Tweet bots are centrally placed and likely useful.
  5. Most retweeted tweets seem to focus on use cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s