Thinking Deeply about IoT Analytics

A typical IoT system would have following architecture.

IoTArch

As the picture depicts, sensors would collect data and transfer them to a gateway, which  in turn would send them to a processing system ( analytics cloud). Gateway can choose either to or not to summarizing  or preprocess the data.

The Connection between sensors and gateway would be via Radio Frequency (e.g. Zigbee), BLE, Wifi, or even wired connections. Often, the gateway is a mobile phone.

The connection from the gateway to Analytic servers would be  via Internet, LAN, or WiFi connection, and it will use a higher level protocol such as MQTT or CoAp (e.g. see IoT Protocols).

Since our focus is on IoT analytics, let’s not drill into devices and connectivity. Assuming that part is done, then how hard is IoT analytics? is it just a matter of offloading the data into one of the IoT analytics platforms or are there hidden surprises?

In this post, I am trying to answer those questions.  Efforts under the theme “Big data”  has solved many IoT analytics challenges. Especially, the system challenges related to large-scale data management, learning, and data visualizations. Data for “Big data”, however, came mostly from computer based systems (e.g. transaction logs, system logs, social networks,  and mobile phones). IoT data, in contrast, will come from the natural world, would be more detailed, fuzzy, and large. Nature of that data, assumptions, and use cases differ between old Big data and new IoT data. IoT analytics designers can build on top of big data, yet work is far from being done.

Let us look at few things we need to worry about.

How fast you need results?

Depends on how fast we need results from the data, our design changes. This decision depends on our use cases. We should ask ourselves, does the value of our insights ( results) degrade over time and how fast? For example, if we are going to improve the design of a product using data, then we can wait days if not weeks. On the other hand, if we are dealing with stock markets and other similar use cases where winner takes all,  milliseconds are a big deal.

Speed comes in several levels.

  • Few hours – send your data into a Data Lake and use a MapReduce technology such as Hadoop or Spark for processing.
  • Few Seconds – send data into a stream processing system (e.g. Apache Storm or Apache Samza),  an in-memory computing system (e.g. VoltDB, Sap Hana), or an interactive query system (e.g. Apache Drill) for processing.
  • Few milliseconds – send data to a system like Complex Event Processing where records are processed one by one and produce very fast outputs.

The following picture summarizes those observations.

BigdataToolingLandscape3

Chances are we will have use cases that falls under more than one, and then we will have to use multiple technologies.

How much data to keep?

Next, we should decide how much data to keep and in what form. It is a tradeoff between cost vs. potential value of data and associated risks. Data is valuable. We see companies acquired just for their data and Google, Facebook going an extraordinary length to access data. Furthermore, we might find a bug or improvement in the current algorithm, and we might want to go back and rerun the algorithm on old data. Having said that, all decision must be made thinking about the big picture and current limits.

Following are our choices.

  • keep all the data and save it to a data lake ( the argument is that disk is cheap)
  • process all the data in a streaming fashion and not keep any data at all.
  • keep a processed or summarized version of the data. However, it is possible that you cannot recover all the information from the summaries later.

The next question is where to do the processing and how much of that logic we should push towards the sensors. There are three options.

  • Do all processing at analytics servers
  • Push some queries into the gateway
  • Push some queries down to sensors as well.

IoT community already has the technology to push the logic to gateways. Most gateways are full-fledged computers or mobile phones, and they can run higher level logic such as SQL-like CEP queries. For example, we have been working to place a light-weight CEP engine into mobile phones and gateways. However, if you want to push code into sensors, most of the cases, you would have to write custom logic using a lower level language like Arduino C. Another associated challenge is deploying, updating, and managing queries over time. If you choose to put custom low-level filtering code into sensors, I believe that will lead to a deployment complexities in the long run.

Analytics: Hindsight, Insight or Foresight?

Hindsight, insight, and foresight are three question types we can ask from data: To know what happened? to understand what happened? and predict what will happen.

Hindsight is possible with aggregations and applied statistics. We will aggregate data by different groups and compare those results using statistical techniques such as confidence intervals and statistical tests. A key component is  data visualizations that will show related data in context. (e.g. see Napoleon’s March and Hans Rosling’s famous Ted talk).

Insights and foresight would require machine learning and data mining. This includes finding patterns, modeling the current behavior, predicting future outcomes, and detecting anomalies. For more detailed discussion, I suggest you start following data science and machine learning tools (e.g. R, Apache Spark MLLib, WSO2 Machine Learner, GraphLab to name a few).

IoT analytics will pose new types of problems and demand more focus on some existing problems. Following are some analytics problems,  in my opinion, will play a key role in IoT analytics.

Time Series Processing

Most IoT data are collected via sensors over time. Hence, they are time series data,  and often most readings are autocorrelated. For example, a temperature reading is often highly affected by the earlier time step’s reading. However, most machine learning algorithms (e.g. Random Forests or SVM) do not consider autocorrelation. Hence, those algorithms would often do poorly while predicting  using IoT data.

This problem has been extensively studied under time series analysis (e.g. ARIMA model). Also, in recent years, Recurrent Neural Networks (RNN) has shown promising results with time series data. However, widely used Big Data frameworks such as Apache Spark and Hadoop do not support these models yet. IoT analytics community has to improve these models, build new models when needed, and incorporate them to big data analytics frameworks. For more information about the topic, please refer to the article Recurrent neural networks, Time series data and IoT: Part I.

Spatiotemporal Analysis and Forecasts

Similary, most IoT data would include location data, making them spatiotemporal data sets. (e.g. geospatial data collected over time). Just like time series data, these models would be affected by the spatial neighborhood. We would need to explore and learn spatiotemporal forecasting and other techniques and build tools that support them. Among related techniques are GIS databases (e.g. Geotrelis), and Panel Data analysis. Moreover, Machine learning techniques such as Recurrent Neural networks might also be used (see Application of a Dynamic Recurrent Neural Network in Spatio-Temporal Forecasting).

Anomaly detections

Many IoT use cases such as predictive maintenance, health warnings, finding plug points that consumes too much power, optimizations etc depend on detecting Anomalies. Anomaly detection poses several challenges.

  • Lack of training data – most use cases would not have training data, and hence unsupervised techniques such as clustering should be used.
  • Class imbalance – Even when training data is available, often there will be few dozen anomalies exists among millions of regular data points. This problem is generally handled by building an ensemble of models where each model is trained with anomalous observations and resampled data from regular observations.
  • Click and explore – after detecting anomalies, they must be understood in context and vetted by humans. Tools, therefore, are required to show those anomalies in context and enable operators to explore data further starting from the anomalies. For example, if  an anomaly in a turbine is detected, it is useful to see that anomaly within regular data before and after the anomaly as well as to be able to study similar cases happened before.

What is our Response?

Finally, when we have analyzed and found actionable insights, we need to decide what to do with them. We have several choices.

IoTUsecaseTypesByDecision

  • Visualize the Results – build a dashboard that shows the data in context and let users explore, drill-down, and do root cause analysis.
  • Alerts – detect problems and notify the user using emails, SMS, or pager devices. Your primary challenge would be false positives that would severely affect the operator’s trust on the system. Finding the balance between false positives and ignoring true problems will be tricky.
  • Carrying out  Actions – next level is independent actions with open control loops. However, unlike the former case, the risk of a wrong diagnosis could have catastrophic consequences. Until we have a deeper understanding about the context, use cases would be limited to simple applications such as turning off a light, adjusting heating etc where associated risk are small.
  • Process & Environment control – this is the holy-grail of automated control. The system would continuously monitor and control the environment or the underline process in a closed control loop. The system has to understand the context, environment, and should be able to work around failures of actions etc. Much related work has been done under theme Autonomic computing  2001-2005 although a few use cases ever got deployed. Real life production deployment of this class, however, are several years away due to associated risks. We can think as NEST and Google Auto driving Car as first examples of such systems.

In general, we move towards automation when we need fast responses (e.g. algorithmic trading). More automation can be cheaper in the long run, but likely to be complex and expensive in the short run. As we learned from stock market crashes, the associated risks must not be underestimated.

It is worth noting that doing automation with IoT will be harder than big data automation use cases.  Most big data automation use cases either monitor computer systems or controlled environments like factories. In contrast, IoT data would be often fuzzy and uncertain. It is one thing to monitor and change a variable in automatic price setting algorithm. However, automating a use case in the natural world (e.g. an airport operations) is something different altogether. If we decide to go in the automation route, we need to spend significant time understanding, testing, retesting our scenarios.

Understanding IoT Use cases

Finally, let me wrap up by discussing the shape of common IoT data sets and use cases arises from them.

Data from most devices would have following fields.

  • Timestamp
  • Location, Grouping, or Proximity Data
  • Several readings associated with the device e.g. temperature, voltage and power, rpm, acceleration, and torque, etc.

The first use case is to monitor, visualize, and alerts about a single device data. This use case focuses on individual device owners.

However, more interesting use cases occur when we look at devices as part of a larger system: a fleet of vehicles, buildings in a city, a farm etc. Among aforementioned fields, time and location will play a key role in most IoT use cases. Using those two, we can categorize most use cases into two classes: stationary dots and moving dots.

Stationary dots

Among examples of “stationary dot” use cases are equipment deployments (e.g. buildings, smart meters, turbines, pumps etc). Their location is useful only as a grouping mechanism. The main goal is to monitor an already deployed system in operation.

Following are some of the use cases.

  • View of the current status, alerts on problems, drill down and root cause analysis
  • Optimizations of current operations
  • Preventive Maintenance
  • Surveillance

Moving dots

Among examples of moving dot use cases are fleet management, logistic networks, wildlife monitoring, monitoring customer interactions in a shop, traffic, etc. The goal of these use cases is to understand and control movements, interactions, and behavior of participants.

WSO2_CEP_TfL_Demo_-_YouTubeFollowing are some examples.

  • Sports analytics (e.g. see the following video)
  • Geo Fencing and Speed Limits
  • Monitoring customer behavior in a shop, guided interactions, and shop design improvements
  • Visualizing (e.g. time-lapse videos) of movement dynamics
  • Surveillance
  • Route optimizations

For example, the following is a sports analytics use case built using data from a real football game.

For both types of use cases, I believe it is possible to build generic extensible tools that provide an overall view of the devices and provide out of the box support for some of the use cases. However, specific machine learning models such as anomaly detection would need expert intervention for best results.  Such tools, if done right, could facilitate reuse, reduce cost, and improve the reliability of IoT systems. It is worth noting that this is one of the things “Big data” community did right. A key secret of “Big data” success so far has been the availability of high quality, generic open source middleware tools.

Also, there is room for companies that focus on specific use cases or classes of use cases. For example, Scanalytics focuses on foot traffic monitoring and Second spectrum focuses on sport analytics.  Although expensive, they would provide an integrated ready to go solutions. IoT system designers have a choice either going with a specialized vendor or building on top of open source tools (e.g. Eclipse IoT platform, WSO2 Analytics Platform).

Conclusion

This post discusses different aspects of an IoT analytics solutions pointing out challenges that you need to think about while building IoT analytics solutions or choosing analytics solutions.

Big data has solved many IoT analytics challenges. Specially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Following are the highlights.

  • How fast we need results? Real-time vs. batch or a combination.
  • How much data to keep? based on use cases and incoming data rate, we might choose between keeping none, summary, or everything. Edge analytics is also a related aspect of the same problem.
  • From analytics, do we want hindsight, insight or foresight? decide between aggregation and Machine learning methods. Also, techniques such as time series and spatiotemporal algorithms will play a key role with IoT use cases.
  • What is our Response from the system when we have an actionable insight? show a visualization, send alerts, or to do automatic control.

Finally, we discussed the shape of IoT data and few reusable scenarios and the potential of building middleware solutions for those scenarios.

Hope this was useful. If you have any thoughts, I would love to hear from you.

 

Advertisements

Taxonomy of IoT Usecases: Seeing IoT Forest from the Trees

IoT comes in many forms. Variation of use cases seems endless. IoT devices itself has many types and can be arranged in different configurations.

Following are of those device classes.

  • Ad-hoc/ Home/ Consumer (Embeddables , Wearables, Holdables, Surroundables, see Four types of Internet of Things?)
  • Smart Systems – ( they monitor the outside world, have lot of small sensors, have hubs that connect via Zigbee or cellular and connection from hubs to cloud)
  • M2M/ Industrial Internet (Sensor and inbuilt, often pre-designed)
  • Drones and Cameras (Never underestimate the most ubiquitous IoT device, video Cameras)

Those devices can be used to solve a wide range of problems. Obviously, it is hard to do a complete taxonomy, yet writing even a subset down would help us lot with understanding IoT.

The taxonomy is arranged around people, and each level moves further away from individual and becomes high level. Different levels are categorized from personal (e.g. wearables) to macro-level control ( smart cities). The following picture shows each category.

IoTUsecaseTaxonomy

Let us look at each category in detail.

1. Wearables

Wearables are devices that are with you. They range from pills you might swallow, a Fitbit, a watch, to your mobile phone. The goal of these use cases is to make your life better.

  • Health: Fitbit, personal health (e.g. Incentives for good habits)
  • From asset tracking to smart signage, and safety
  • Sports – digital coach, better sport analytics
  • Facial Recognition with real-life analytics and interactions

2. Smart Homes

These use cases try to monitor and improve your home giving you peace of mind, comfort, and efficiency.

  • Energy efficiency, smart lighting, smart metering, smart elements, smart heating, smart rooms, bedrooms
  • Integration with Calendar and other data, deriving context, and take decisions and drive the home environment based on current context.
  • Safety and security via home surveillance, monitor health and kids, perimeter checks for pets and kids etc.
  • Smart gardens (e.g. watering, status monitoring)

You can find more information from 9 Ways A Smart Home Can Improve Your Life.

3. Appliances

Appliances have a duel role. On one hand, they provide new experiences to the end user, hence play a role in Smart Home. On the other hand, they provide better visibility and control of appliance to the manufacturer. Devices include your car, smart lawn mowers, kettles etc.  Most products will have a digital twin, that will provide analytics and important information both to the consumer and the manufacturer.

Following are some use cases.

  • Products can interact with users better, optimize, learn and adapt to the user (e.g. smart washers and dryers that notify when done and product displays been replaced with apps)
  • Better after sales services, better diagnosis, remote diagnosis ( efficient customer support), faster update and critical patches
  • Adaptive and proactive maintenance as needed. With IoT, products can monitor themselves and act if there is a problem
  • Using product usage data to improve product design.
  • Get some appliances ( e.g expensive ones like load mower) under a pay per use model rather than buying them.
  • Know the customer better: better segmentation, avoid churn ( if he is not using it, find out)
  • Hobbyists/ Entertainment (e.g. drone racing, drone cameras)
  • Advertisements via your appliance (e.g. refrigerator let you order missing food via a App, and the manufacturer may charge for recommendations they made from companies)

HBR article, How Smart, Connected Products Are Transforming Companies, provide a good discussion about some of the use cases.

4. Smart Spaces

Smart spaces use cases monitor and manage a space such as a farm, a shop, forest etc. It would involve pre-designed sensors as well as ad-hoc sensors like drones etc. Often camera’s computer vision also plays a key role.

Following are some of the use cases.

  • Smart Agriculture (watering based on moisture levels, pest control, livestock management), correlate with other data sources like weather and delivery of pesticides etc though drones.
  • Surveillance ( wildlife, endangered species, forest cover, forest fire)
  • Smart Retail: Smart stores ( sensors to monitor, what gets attention), fast checkouts (e.g. via RFID), customer analytics for stores, In store targeted offers via smartphones, better customers service at the store.
  • Quick service restaurants(QSR) – measure staff performance & services, improve floor plan & remove bottlenecks, optimize queue & turnover
  • Smart Buildings ( Power, Security, Proactive Maintenance, HVAC etc)

For related use cases, see How The Internet of Things Will Shake Up Retail In 2015 and The Future Of Agriculture?

5. Smart Services Industries/ Logistics

These use cases use IoT to improve the services industry and logistics. They focus on monitoring and improving underline processes of those businesses. Following are few examples.

  • Smart logistics and Supply Chain( tracking, RFID tags)
  • Service industries: Airlines, Hospitality etc. The goal is efficient operations, and visibility (e.g. where my baggage?) and proactive maintenance.
  • Financial services, Smart Banking, Usage-based Insurance, Better data for Insurance, and Fraud detection via better data
  • Better delivery of products via Drones
  • Aviation – Report, find the problem, and find the fix, parts before plane lands,
  • Telecommunications networks

6. Smart Health

Smart health will be a combination of wearables, smart home, and smart services. This would include use cases like better health data through wearables, better care at hospitals, in-home care, smart pill bottles etc that would monitor and make sure medications are taken, and better integration of health records.

7. Industrial Internet

The Idea of the industrial internet is to use sensors and automation to better understand and manage complex processes. Unlike smart spaces, these use cases give owners much for flexibility and control. Most these environments already have sensors and actuators installed. Most of these use cases predate IoT and falls under M2M.

Following are some use cases.

  • Smart manufacturing
  • Power and renewable energy (e.g.Wind Turbines, Oil and Gas)  operations and predictive maintenance. The goal is to add value on top of existing assets (takes about 40 years to replace) .
  • Mining
  • Transport : Trains, Busses
  • HVAC and industrial machines

You can find more use cases from GE’s making world 1% better initiative.

8. Smart Cities

Smart Cities ( and my be Nations) brings everything together and provides a macro view of everything. They focus on improving public infrastructure and services that make the urban living better.

Following are some of the use cases.

  • Waste management, smart parking ( e.g. find parking spots)
  • Traffic management ( sensors, Drones), air quality and water quality, smart road tax
  • Security: Surveillance, gunfire sensors, Smart Street lightings, Flooding alerts,
  • Smart buildings (energy, elevators, lighting, HVAC), Smart bridges/ constructions(put lot of sensors into concrete etc)
  • Urban planning

You can find more information from articles How Big Data And The Internet Of Things Create Smarter Cities, and Smart Cities — A $1.5 Trillion Market Opportunity.

Conclusion

As we saw, use cases come in many forms and shapes and likely they will get integrated with and change our lives at many different levels. This is the reason that analysts have forecasted an unprecedented number of devices (e.g. 15-50B by 2020) as well as a market size (e.g. 1-7 Trillion by 2020) for IoT that dwarfed any earlier trends like SOA or Big data.

Following are few observations about the use cases.

  • Each use case tries to solve a real problem. They do this by finding a problem, instrumenting data around it, and analyzing that data and providing actionable insights or carrying out actions.
  • Some use cases are enabled by creative sensors, such as using camera to measure your heart rate or sensors mixed into the concrete while building a bridge.
  • Analytics are present in almost all use cases. One of the key, yet often unspoken assumption is that all data get collected and analyzed later. We call this batch analytics.
  • However, lot of use cases need realtime decisions and sometimes need to act on those decisions. There have been many efforts on relatime analytics, but comparatively less work has been done regarding acting on the decisions.
  • These use cases might lead to other use cases such as showing related advertisements on your appliance or on the associated mobile App.

Hope this was useful. I would love to hear about if you thoughts about different categories and use cases.

Beyond Distributed ML Algorithms: Techniques for Learning from Large Data Sets

Almost always, more data let machine learning (ML) algorithms do better. Sometimes, more data let simpler algorithms like logistic regression do better than complex algorithms such as SVM. This has been observed in academia (e.g. see A Few Useful Things to Know about Machine Learning), community (e.g. In Machine Learning, What is Better: More Data or better Algorithms) and Keggale competitions.

Moreover, more data has enabled previously underperforming algorithms like Neural Networks to come back and take over the limelight. For an example, Google has used the new reincarnation of Neural networks, Deep Learning, for image recognition with amazing results. Try a query like “boy on a tree” in Google image search, and the results will amaze you.

boy_on_a_tree_-_Google_Search

In this post, let’s explore different methods for learning from large datasets. An obvious method is parallel and distributed execution. One of the key points I want to make is that although effective, distributed executions are not the only option.

Le’ts start with a great talk by Ron Beckerman on the topic.

He provides a great overview into our topic. Let’s start with Hadoop.

Use Hadoop

When community looked to learn from large datasets, they already knew a way to do parallel executions: Hadoop (MapReduce). So everyone tried ML algorithms using Hadoop, which kind of worked. There are hundreds of papers written and Apache Mahout came out as the opensource implementation of those ML algorithms.

That got people started. Hadoop-based processing, however, had a big flow. Most Machine learning algorithms have an iterative part (see the famous paper A Few Useful Things to Know about Machine Learning). To run the iterative part, the Hadoop model must load the data from the file system again and again. Since Network and Disk IO are the main bottlenecks for distributed computations like MapReduce, the Hadoop was very slow. The article, MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! , is a very good treatment of the related aspects.

Of course, this was fiercely competed (e.g. see MapReduce is Good Enough). However, arguments do not make performance problems go away. When an alternative, in the form of Apache Spark, become available, people started to move on.

New Techniques for Scaling ML

To run an algorithm parallelly, we need to somehow break the problem into smaller parts and assign it to different threads or machines. This is a problem that has been well studied (e.g. see famous 13 Dwarfs paper). The post, An Introduction to Distributed Machine Learning by Krishna Sridhar, describes the motivation behind this approach.

We have to either partition the data (e.g.  KD trees, Max-margin trees, Convex trees) or partition the execution. However, most machine learning algorithms were not embarrassingly parallel, which means you need communications between your threads or machines. This is bad news. Amdahl’s law says that resulting sequential parts in the algorithms are prohibitively expensive.

Then come a breakthrough. Machine learning algorithms are optimization problems, and they search a large parameter space to find the function or representation that best represent the data. For this search, data need not be consistent. Instead, the algorithm can continue while lazily updating each other, and still the answer will be correct.  The post, Parallel Machine Learning with Hogwild!, by Krishna Sridhar describes this beautifully.

That means we can just break most machine learning algorithms (e.g. by data) and run them parallel while communicating lazily without slowing down sender or the receiver. This is the approach used by Apache Spark. Coupled with its ability to process data again and again, it was much easier to implement algorithms with Apache Spark. So much so that Apache Mahout, the Hadoop Machine Learning project, switched to Spark and stop adding new Hadoop-based executions.

Above approach partitions the data and run it in a batch execution style. However, lazy communication between different jobs is complicated in batch style systems like Spark. Alternative is to break the data and assign them to different nodes, which will pin data always to a one node. Then, while carrying out computations, nodes can periodically broadcast their current state to other nodes in an asynchronous style.

However, broadcasting in a distributed system is both expensive and complicated. To solve this problem, a new centralized approach is used. The idea is to use a centerlized server called “parameter server”, that collects the current state of nodes periodically and redistributes it back to everyone. Die hard distributed people does not like this due to the central server, but the state of machine learning algorithms are small and this approach scale for most practical applications. Indeed, Google uses this. You can find more information from the following talk by Jeff Dean.

This is primarily used to scale up Neural networks and Probabilistic Graphical Models (Kalman filters, Belief Networks). You can find an opensource implementation from http://parameterserver.org. In the following talk, Alex Somla talks about parameter servers in detail.

Avoid Parallelism and Make Data Small

However, it has not been clearly established that parallel distributed execution is indeed the superior approach for all kind of problems. For an example, Ben Hamner from Kaggle observes in the following talk that  down sampling 1/10 to 1/100 often does not affect final results significantly in most competitions. Furthermore, he observes that most winners are teams that can iterate and improve their solutions faster.

Hence sampling is a viable and very powerful approach. Specially, at the initial stages where the data scientist explores possible solutions. An interesting related work has done by prof. Michal Jorden’s group, which they call Bag of Little Bootstrap (BLB). The main Idea is to sample the dataset with replacements, build models, and then looking at error bars to decide on  the quality of models. You can find more information from their paper from A scalable bootstrap for massive data.

The second idea is to observe that in distributed computations, a significant part of the computing power is spent on communications. If we have enough of memory and use technologies like GPUs, can we solve most problems in single multi-core computing? The answer is yes. It has been demonstrated that this approach can handle moderate size data sets. For example, in 2009, GPU based KMeans algorithm clustering  1 billion data points looking for 1,000 clusters took only 26 minutes  while distributed approach took 6 days. You can find more information from the blog post, GPU and Large Scale Data Mining, by Suleiman Shehu.

Finally, streaming can also help. Most of the time we collect data for hours and want to build a model using those data very fast. However, if we build the model in streaming fashion as data arrives, we have much more time available for the computation and in some cases even a single machine might be enough. However, one major weakness is that streaming algorithms are fixed, and cannot be used to do explorative data analysis.

Parting Thoughts

I believe we should be practical. Although, in some large use cases like Google’s image search, we must use large distributed machine learning algorithms. However, when possible, we should also use simpler methods. Specially, at the initial exploration phase while exploring possible models. Remember that is is often who can iterate fastest wins in Kaggle.

Other Resources

Following are some of the other content that are relevant to the topic, although I did not refer to them above.

  1. Ron Bekkerman, http://hunch.net/~large_scale_survey/
  2. Scaling Decision trees http://hunch.net/~large_scale_survey/TreeEnsembles.pdf
  3. What is Scalable Machine Learning?, http://blog.mikiobraun.de/2014/07/what-is-scalable-machine-learning.html
  4. Scaling big data mining infrastructure: the twitter experience J Lin, D Ryaboy – ACM SIGKDD Explorations Newsletter, 2013 – dl.acm.org
  5. Monoidify! monoids as a design principle for efficient mapreduce algorithms, J Lin, http://arxiv.org/abs/1304.7544 
  6. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML, http://www.vldb.org/pvldb/vol7/p553-boehm.pdf