2 March 2017

Scalable Machine Learning

Reasons for why scale machine learning:
  • training data doesn't fit on a single machine
  • time to train model is too long
  • too high volume of data that is coming in
  • low latency requirements for predictions
How to spend less time on a scalable infrastructure:
  • choose the right ML algorithm that is fast and lean that is able to work on a single machine accurately
  • subsampling data
  • vertical scalability
  • sacrificing accuracy if it is cheaper
Horizontal scalability options:
  • Hadoop ecosystem with Mahout
  • Spark ecosystem with MLlib
  • Turi from GraphLab
  • Streaming Technologies like Kafka, Storm, AWS Kinesis, Flink, Spark Streaming
Scalability consideration for a model-building pipeline:
  • choose scalable option like logistic regression or svm
  • scaling up nonlinear algorithms by making approximations
  • use a distributed infrastructure to scale out
How to scale predictions in both volume and velocity:
  • Infrastructure that allows scale up across the number of workers
  • Sending same prediction to multiple workers and returning back the first one to optimize prediction velocity
  • choose an algorithm that can parallelize across multiple machines
A curious alternative for Hadoop for scalability is also Vowpal Wabbit for building models on large datasets without the requirement of a big data system. Feature selection also comes in handy when one wants to reduce the size of training data by selecting and retaining the most predictive subset of features. Lasso is a linear algorithm that is often use for feature selection. In respect of prediction velocity and volume, scaling in volume means being able to handle more data while scaling velocity means being able to do it fast enough for a use case. One also has to weigh out the sacrifice between speed and accuracy of predictions.