Mabble Rabble: Scalable Machine Learning

2 March 2017

Scalable Machine Learning

Reasons for why scale machine learning:

training data doesn't fit on a single machine
time to train model is too long
too high volume of data that is coming in
low latency requirements for predictions

How to spend less time on a scalable infrastructure:

choose the right ML algorithm that is fast and lean that is able to work on a single machine accurately
subsampling data
vertical scalability
sacrificing accuracy if it is cheaper

Horizontal scalability options:

Hadoop ecosystem with Mahout
Spark ecosystem with MLlib
Turi from GraphLab
Streaming Technologies like Kafka, Storm, AWS Kinesis, Flink, Spark Streaming

Scalability consideration for a model-building pipeline:

choose scalable option like logistic regression or svm
scaling up nonlinear algorithms by making approximations
use a distributed infrastructure to scale out

How to scale predictions in both volume and velocity:

Infrastructure that allows scale up across the number of workers
Sending same prediction to multiple workers and returning back the first one to optimize prediction velocity
choose an algorithm that can parallelize across multiple machines

A curious alternative for Hadoop for scalability is also Vowpal Wabbit for building models on large datasets without the requirement of a big data system. Feature selection also comes in handy when one wants to reduce the size of training data by selecting and retaining the most predictive subset of features. Lasso is a linear algorithm that is often use for feature selection. In respect of prediction velocity and volume, scaling in volume means being able to handle more data while scaling velocity means being able to do it fast enough for a use case. One also has to weigh out the sacrifice between speed and accuracy of predictions.