2013年6月28日 星期五

Large-scale machine learning at twitter

This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform.

This paper means that machine learning is just another Pig script, which allows seamless integration
with existing infrastructure for data management, scheduling, and monitoring in a production environment,

This paper has having three contributions.

  1. Provide an overview of Twitter's analytics stack
  2. Describe Pig extensions that allow seamless integration of machine learning capabilities into this production platform.
  3. Identify stochastic gradient descent and ensemble methods as being particularly amenable to large-scale machine learning.

Training Models

Illustration of how learners are integrated into Pig storage functions. 

By controlling the number of reducers in the nal MapReduce job, we can control the number of models constructed: on the left, a single classi er, and on the right, a two-classi er ensemble.


Their machine learning algorithms can be divided into two classes: batch learners and online learners.

Batch learners require all data to be held in memory, and therefore the Pig storage functions wrapping such learners must first internally buffer all training instances before training. 

Online learners have no such restriction: the Pig storage function simply streams through incoming instances,feeds them to the learner, and then discards the example. 



Conclusion:
Pig allows us to seamlessly scale down onto individual servers or even laptops by running in "local mode".
And the paper describe the Stochastic Gradient Descent and Ensemble Methods application.

沒有留言:

張貼留言