Model selection and tuning at scale

Model Selection and
Tuning at Scale
March 2016

About us
Owen Zhang
Chief Product Officer @ DataRobot
Former #1 ranked Data Scientist on
Kaggle
Former VP, Science @ AIG
Peter Prettenhofer
Software Engineer @ DataRobot
Scikit-learn core developer

Agenda
● Introduction
● Case-study Criteo 1TB
● Conclusion / Discussion

Model Selection
● Estimating the performance of different models in order to choose the best one.
● K-Fold Cross-validation
● The devil is in the detail:
○ Partitioning
○ Leakage
○ Sample size
○ Stacked-models require nested layers
Train Validation Holdout
1 2 3 4 5

Model Complexity & Overfitting

Underfitting or Overfitting?
http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

Model Tuning
● Optimizing the performance of a model
● Example: Gradient Boosted Trees
○ Nr of trees
○ Learning rate
○ Tree depth / Nr of leaf nodes
○ Min leaf size
○ Example subsampling rate
○ Feature subsampling rate

Search Space
Hyperparameter GBRT (naive) GBRT RandomForest
Nr of trees 5 1 1
Learning rate 5 5 -
Tree depth 5 5 1
Min leaf size 3 3 3
Example subsample rate 3 1 1
Feature subsample rate 2 2 5
Total 2250 150 15

Hyperparameter Optimization
● Grid Search
● Random Search
● Bayesian optimization

Challenges at Scale
● Why learning with more data is harder?
○ Paradox: we could use more complex models due to more data but we cannot because
of computational constraints*
○ => we need more efficient ways for creating complex models!
● Need to account for the combined cost: model fitting + model selection / tuning
○ Smart hyperparameter tuning tries to decrease the # of model fits
○ … we can accomplish this with fewer hyperparameters too**
* Pedro Domingos, A few useful things to know about machine learning, 2012.
** Practitioners often favor algorithms with few hyperparameters such as RandomForest or
AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)

A case study -- binary classification on 1TB of data
● Criteo click through data
● Down sampled ads impression data on 24 days
● Fully anonymized dataset:
○ 1 target
○ 13 integer features
○ 26 hashed categorical features
● Experiment setup:
○ Using day 0 - day 22 data for training, day 23 data for testing

Big Data?
Data size:
● ~46GB/day
● ~180,000,000/day
However it is very imbalanced (even after downsampling non-events)
● ~3.5% events rate
Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB
● Will fit into a single node under “optimal” conditions
● Loss of model accuracy is negligible in most situations
Assuming 0.1% raw event (click through) rate:
Raw Data:
35TB@.1%
Data:
1TB@3.5%
Data:
70GB@50%

Where to start?
● 70GB (~260,000,000 data points) is still a lot of data
● Let’s take a tiny slice of that to experiment
○ Take 0.25%, then .5%, then 1%, and do grid search on them
Time (Seconds)
RF
ASVM
Regularized
Regression
GBM (with Count)
GBM (without Count)Better

GBM is the way to go, let’s go up to 10% data
# of Trees
Sample Size/Depth of Tree/Time to Finish

A “Fairer” Way of Comparing Models
A better model
when time is the
constraint

Can We Extrapolate?
?
Where We (can) do
better than generic
Bayesian
Optimization

Tree Depth vs Data Size
● A natural heuristic -- increment tree depth by 1 every time data size doubles
1%
2%
4%
10%
Optimal Depth = a + b * log(DataSize)

What about VW?
● Highly efficient online learning algorithm
● Support adaptive learning rate
● Inherently linear, user needs to specify non-linear feature or interactions explicitly
● 2-way and 3-way interactions can be generated on the fly
● Supports “every k” validation
● The only “tuning” REQUIRED is specification of interactions
○ Due to availability of progressive validation, bad interactions can be detected immediately
thus don’t waste time:

Data pipeline for VW
Training
Test
T1
T2
Tm
Test
T1s
Random
Split
T2s
Tms
Random
Shuffle
Concat +
Interleave
It takes longer to
prep the data than
to run the model!

VW Results
Without
With Count + Count*Numeric
Interaction
1% Data
10% Data
100% Data

Putting it All Together 1 Hour 1 Day

Do We Really “Tune/Select Model @ Scale”?
● What we claim we do:
○ Model tuning and selection on big data
● We we actually do:
○ Model tuning and selection on small data
○ Re-run the model and expect/hope performance/hyper
parameters extrapolate as expected
● If you start the model tuning/selection process with GBs (even
100s of MBs) of data, you are doing it wrong!

Some Interesting Observations
● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise)
non-linear model, even with much larger data
● There is meaningful structure in the hyper parameter space
● When we have limited time (relative to data size), running “deeper” models on smaller data
sample may actually yield better results
● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We
need models that has # of parameters that can scale with # of data points
○ GBM can have any many parameters as we want
○ So does factorization machines
● For any data any model we will run into a “diminishing return” issue, as data get bigger and
bigger

Model selection and tuning at scale

Recommended

More Related Content

What's hot (20)

Similar to Model selection and tuning at scale (20)

Recently uploaded (20)

Model selection and tuning at scale