Almost always, more data let machine learning (ML) algorithms do better. Sometimes, more data let simpler algorithms like logistic regression do better than complex algorithms such as SVM. This has been observed in academia (e.g. see A Few Useful Things to Know about Machine Learning), community (e.g. In Machine Learning, What is Better: More Data or better Algorithms) and Keggale competitions.
Moreover, more data has enabled previously underperforming algorithms like Neural Networks to come back and take over the limelight. For an example, Google has used the new reincarnation of Neural networks, Deep Learning, for image recognition with amazing results. Try a query like “boy on a tree” in Google image search, and the results will amaze you.
In this post, let’s explore different methods for learning from large datasets. An obvious method is parallel and distributed execution. One of the key points I want to make is that although effective, distributed executions are not the only option.
Le’ts start with a great talk by Ron Beckerman on the topic.
He provides a great overview into our topic. Let’s start with Hadoop.
When community looked to learn from large datasets, they already knew a way to do parallel executions: Hadoop (MapReduce). So everyone tried ML algorithms using Hadoop, which kind of worked. There are hundreds of papers written and Apache Mahout came out as the opensource implementation of those ML algorithms.
That got people started. Hadoop-based processing, however, had a big flow. Most Machine learning algorithms have an iterative part (see the famous paper A Few Useful Things to Know about Machine Learning). To run the iterative part, the Hadoop model must load the data from the file system again and again. Since Network and Disk IO are the main bottlenecks for distributed computations like MapReduce, the Hadoop was very slow. The article, MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! , is a very good treatment of the related aspects.
Of course, this was fiercely competed (e.g. see MapReduce is Good Enough). However, arguments do not make performance problems go away. When an alternative, in the form of Apache Spark, become available, people started to move on.
New Techniques for Scaling ML
To run an algorithm parallelly, we need to somehow break the problem into smaller parts and assign it to different threads or machines. This is a problem that has been well studied (e.g. see famous 13 Dwarfs paper). The post, An Introduction to Distributed Machine Learning by Krishna Sridhar, describes the motivation behind this approach.
We have to either partition the data (e.g. KD trees, Max-margin trees, Convex trees) or partition the execution. However, most machine learning algorithms were not embarrassingly parallel, which means you need communications between your threads or machines. This is bad news. Amdahl’s law says that resulting sequential parts in the algorithms are prohibitively expensive.
Then come a breakthrough. Machine learning algorithms are optimization problems, and they search a large parameter space to find the function or representation that best represent the data. For this search, data need not be consistent. Instead, the algorithm can continue while lazily updating each other, and still the answer will be correct. The post, Parallel Machine Learning with Hogwild!, by Krishna Sridhar describes this beautifully.
That means we can just break most machine learning algorithms (e.g. by data) and run them parallel while communicating lazily without slowing down sender or the receiver. This is the approach used by Apache Spark. Coupled with its ability to process data again and again, it was much easier to implement algorithms with Apache Spark. So much so that Apache Mahout, the Hadoop Machine Learning project, switched to Spark and stop adding new Hadoop-based executions.
Above approach partitions the data and run it in a batch execution style. However, lazy communication between different jobs is complicated in batch style systems like Spark. Alternative is to break the data and assign them to different nodes, which will pin data always to a one node. Then, while carrying out computations, nodes can periodically broadcast their current state to other nodes in an asynchronous style.
However, broadcasting in a distributed system is both expensive and complicated. To solve this problem, a new centralized approach is used. The idea is to use a centerlized server called “parameter server”, that collects the current state of nodes periodically and redistributes it back to everyone. Die hard distributed people does not like this due to the central server, but the state of machine learning algorithms are small and this approach scale for most practical applications. Indeed, Google uses this. You can find more information from the following talk by Jeff Dean.
This is primarily used to scale up Neural networks and Probabilistic Graphical Models (Kalman filters, Belief Networks). You can find an opensource implementation from http://parameterserver.org. In the following talk, Alex Somla talks about parameter servers in detail.
Avoid Parallelism and Make Data Small
However, it has not been clearly established that parallel distributed execution is indeed the superior approach for all kind of problems. For an example, Ben Hamner from Kaggle observes in the following talk that down sampling 1/10 to 1/100 often does not affect final results significantly in most competitions. Furthermore, he observes that most winners are teams that can iterate and improve their solutions faster.
Hence sampling is a viable and very powerful approach. Specially, at the initial stages where the data scientist explores possible solutions. An interesting related work has done by prof. Michal Jorden’s group, which they call Bag of Little Bootstrap (BLB). The main Idea is to sample the dataset with replacements, build models, and then looking at error bars to decide on the quality of models. You can find more information from their paper from A scalable bootstrap for massive data.
The second idea is to observe that in distributed computations, a significant part of the computing power is spent on communications. If we have enough of memory and use technologies like GPUs, can we solve most problems in single multi-core computing? The answer is yes. It has been demonstrated that this approach can handle moderate size data sets. For example, in 2009, GPU based KMeans algorithm clustering 1 billion data points looking for 1,000 clusters took only 26 minutes while distributed approach took 6 days. You can find more information from the blog post, GPU and Large Scale Data Mining, by Suleiman Shehu.
Finally, streaming can also help. Most of the time we collect data for hours and want to build a model using those data very fast. However, if we build the model in streaming fashion as data arrives, we have much more time available for the computation and in some cases even a single machine might be enough. However, one major weakness is that streaming algorithms are fixed, and cannot be used to do explorative data analysis.
I believe we should be practical. Although, in some large use cases like Google’s image search, we must use large distributed machine learning algorithms. However, when possible, we should also use simpler methods. Specially, at the initial exploration phase while exploring possible models. Remember that is is often who can iterate fastest wins in Kaggle.
Following are some of the other content that are relevant to the topic, although I did not refer to them above.
- Ron Bekkerman, http://hunch.net/~large_scale_survey/
- Scaling Decision trees http://hunch.net/~large_scale_survey/TreeEnsembles.pdf
- What is Scalable Machine Learning?, http://blog.mikiobraun.de/2014/07/what-is-scalable-machine-learning.html
- Scaling big data mining infrastructure: the twitter experience J Lin, D Ryaboy – ACM SIGKDD Explorations Newsletter, 2013 – dl.acm.org
- Monoidify! monoids as a design principle for efficient mapreduce algorithms, J Lin, http://arxiv.org/abs/1304.7544
- Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML, http://www.vldb.org/pvldb/vol7/p553-boehm.pdf