SlideShare a Scribd company logo
Automated
Machine
Learning
(AutoML)
By Hayim Makabee
November 2019
Automated
Machine
Learning
Automated Machine Learning (AutoML)
systems find the right algorithm and
hyperparameters in a data-driven way
without any human intervention.
Auto ML
Benefits
AutoML allows the data scientist
to extend his productivity
without adding more members
to the data science team.
AutoML addresses the skills gap
between the demand for data
science talent and the
availability of this talent.
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Olson
Experiment
on Parameter
Tuning
Used 165 classification data sets from a variety of sources
and 13 different classification algorithms from scikit-learn.
Compared classification accuracy using default parameters
for each algorithm to a tuned version of those algorithms.
On average, got 5–10% improvement in classification
accuracy from tuning algorithms from default parameters.
However, there is no parameter combination that works best
for all problems.
Tuning is mandatory to see improvement and this feature is
built into all AutoML solutions.
Automated Machine Learning (Auto ML)
Example: Learning Rate
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Bayesian
Optimization
Bayesian
Optimization for
Hyperparameter
Selection
Build a probabilistic model to capture the relationship
between hyperparameter settings and their
performance.
Use the model to select useful hyperparameter
settings to try next by trading off exploration
(searching in parts of the space where the model is
uncertain) and exploitation (focusing on parts of the
space predicted to perform well).
Run the machine learning algorithm with those
hyperparameter settings, measure the performance
and update the probabilistic model.
Bayesian Optimization Algorithm
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Auto-sklearn
Auto-sklearn is open source, implemented
in python and built around the scikit-
learn library.
It contains a machine learning pipeline
which takes care of missing values,
categorical features, sparse and dense data,
and rescaling the data.
Next, the pipeline applies a preprocessing
algorithm and an ML algorithm.
Generalizing the Bayesian Algorithm
Bayesian Optimization can be generalized to jointly select algorithms,
preprocessing methods, and their hyperparameters as follows:
• The choices of classifier / regressor and preprocessing methods are top-
level, categorical hyperparameters, and based on their settings the
hyperparameters of the selected methods become active.
• The combined space can then be searched with Bayesian optimization
methods that handle such high-dimensional, conditional spaces.
Hyperparameters
Auto-sklearn includes 15 ML algorithms, 14
preprocessing methods, and all their respective
hyperparameters, yielding a total of 110
hyperparameters.
Meta-learning
Optimizing performance in Auto-sklearn’s space
of 110 hyperparameters can of course be slow.
To jumpstart this process it uses meta-learning
to start from good hyperparameter settings for
previous similar datasets.
Specifically, Auto-sklearn comes with a
database of previous optimization runs on 140
diverse datasets from OpenML.
For a new dataset, it first identifies the most
similar datasets and starts from the saved best
settings for those.
Ensemble
Selection
• Auto-sklearn automatically construct
ensembles.
• Instead of returning a single
hyperparameter, it automatically
constructs ensembles from the models
trained during the Bayesian
optimization.
• Specifically, Auto-sklearn
uses Ensemble Selection to create
small, powerful ensembles with
increased predictive power and
robustness.
Winning the
AutoML
challenge
The ChaLearn AutoML challenge was a machine
learning competition.
Auto-sklearn placed in the top three for nine out of
ten phases and won six of them.
Particularly in the last two phases, Auto-sklearn won
both the auto track and the tweakathon.
During the last two phases of the tweakathon the
team combined Auto-sklearn with Auto-Net for
several datasets to further boost performance.
Auto-sklearn Example
X_train, X_test, y_train, y_test = 
sklearn.model_selection.train_test_split(X, y, test_size = 0.3)
automl = autosklearn.classification.AutoSklearnClassifier
(time_left_for_this_task=3600, per_run_time_limit=360)
automl.fit(X_train, y_train)
print(automl.show_models())
predictions = automl.predict(X_test)
probabilities = automl.predict_proba(X_test)[:,1]
Result = Ensemble
(0.520000, SimpleClassificationPipeline({'balancing:strategy': 'weighting',
'categorical_encoding:__choice__': 'one_hot_encoding',
'classifier:__choice__': 'random_forest', 'imputation:strategy':
'most_frequent', 'preprocessor:__choice__': 'no_preprocessing',
'rescaling:__choice__': 'quantile_transformer',
'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True',
'classifier:random_forest:bootstrap': 'True',
'classifier:random_forest:criterion': 'entropy',
'classifier:random_forest:max_depth': 'None',
'classifier:random_forest:max_features': 0.7884268823432835,
'classifier:random_forest:max_leaf_nodes': 'None',
'classifier:random_forest:min_impurity_decrease': 0.0,
'classifier:random_forest:min_samples_leaf': 20,
'classifier:random_forest:min_samples_split': 15,
'classifier:random_forest:min_weight_fraction_leaf': 0.0,
'classifier:random_forest:n_estimators': 100,
'rescaling:quantile_transformer:n_quantiles': 1000,
'rescaling:quantile_transformer:output_distribution': 'uniform',
'categorical_encoding:one_hot_encoding:minimum_fraction':
0.002615346832354839}
Result = Ensemble
(0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none',
'categorical_encoding:__choice__': 'no_encoding’,'classifier:__choice__':
'k_nearest_neighbors', 'imputation:strategy': 'mean',
'preprocessor:__choice__': 'no_preprocessing', 'rescaling:__choice__':
'standardize', 'classifier:k_nearest_neighbors:n_neighbors': 1,
'classifier:k_nearest_neighbors:p': 2,
'classifier:k_nearest_neighbors:weights': 'uniform'}
Performance – Impact of Time
AutoML 20 Minute Run
Accuracy : 0.89
Precision : 0.89
Recall : 1.0
ROC AUC : 0.61
AutoML 60 Minutes Run
Accuracy : 0.89
Precision : 0.90
Recall : 0.99
ROC AUC : 0.72
Performance – Over-Fitting
AutoML 60 Minutes Run
Accuracy : 0.89
Precision : 0.90
Recall : 0.99
ROC AUC : 0.72
AutoML 120 Minutes Run
Accuracy : 0.87
Precision : 0.91
Recall : 0.95
ROC AUC : 0.70
Performance X Non-AutoML (train data 1)
AutoML 60 Minutes Run
Accuracy : 0.89
Precision : 0.90
Recall : 0.99
ROC AUC : 0.72
XGBoost
ACCURACY = 0.89
PRECISION = 0.90
RECALL = 0.99
AUC = 0.71
Performance X Non-AutoML (train data 2)
AutoML 60 Minutes Run
Accuracy : 0.73
Precision : 0.69
Recall : 0.56
ROC AUC : 0.79
XGBoost
ACCURACY = 0.66
PRECISION = 0.61
RECALL = 0.42
AUC = 0.62
Performance – Balance
Negative = 8 X Positive
Accuracy : 0.99
Precision : 1.0
Recall : 0.91
ROC AUC : 0.97
Negative = 20 X Positive
Accuracy : 0.99
Precision : 0.99
Recall : 0.76
ROC AUC : 0.94
TPOT
• TPOT = Tree-based
Pipeline Optimization
Tool
• TPOT is a Python
Automated Machine
Learning tool that
optimizes machine
learning pipelines using
Genetic Algorithms.
TPOT uses Genetic Algorithms to
find the best ML model and
hyperparameters based on the
training / validation set.
The model options include all the
algorithms implemented in the
scikit-learn library.
Parameters include population size
and number of generations to run
the Genetic Algorithm.
TPOT
Genetic
Algorithm
Automated Machine Learning (Auto ML)
Practical
Questions
Can we really move AutoML from the Lab to
production environments?
What would be the latency of using an
Ensemble of models in production?
Would the AutoML training time be prohibitive
for big datasets?
I think we need Incremental AutoML: in which
the previous model (together with new data)
serves as an input to find the next best model.
My personal
experience:
Semi-AutoML
at Yahoo Labs
Finite (large) number of manually pre-defined
model configurations (hyperparameters).
Incremental Learning: previous model was used
as input for training new models.
Used Hadoop Map-Reduce: each Reducer used
one configuration, trained a model and
measured its performance (parallel training).
The model with best performance was chosen
for deployment.
What next? My
personal opinion
Automated ML will not replace the Data Scientist but
will enable the Data Scientist to produce more models
in less time with higher quality.
This is probably the end of “good enough models” using
standard parameters because the Data Scientist did not
have time to check different parameters.
The main advantage is not saving time. The main
benefit is doing things that were never done because of
lack of time.
Data scientists will have more time to collaborate with
business experts to get domain knowledge and use it in
feature engineering.
References
• https://medium.com/@ODSC/the-past-present-and-future-of-automated-machine-learning-5e081ca4b71a
• https://softwareengineeringdaily.com/2019/05/15/introduction-to-automated-machine-learning-automl/
• https://medium.com/georgian-impact-blog/automatic-machine-learning-aml-landscape-survey-f75c3ae3bbf2
• https://medium.com/@MLJARofficial/automl-comparison-4b01229fae5e
• https://www.fast.ai/2018/07/16/auto-ml2/
• https://www.kdnuggets.com/2016/10/interview-auto-sklearn-automated-data-science-machine-learning-team.html
• https://www.kdnuggets.com/2016/08/winning-automl-challenge-auto-sklearn.html
• https://www.kdnuggets.com/2017/07/design-evolution-evolve-neural-network-automl.html
• https://www.slideshare.net/JoaquinVanschoren/automl-lectures-acdl-2019
• https://www.youtube.com/watch?v=QrJlj0VCHys
• https://www.youtube.com/watch?v=jn-22XyKsgo
• https://cloud.google.com/automl/
Thanks!
Questions?
Comments?

More Related Content

What's hot (20)

PPTX
Automated Machine Learning
safa cimenli
 
PDF
Automatic machine learning (AutoML) 101
QuantUniversity
 
PPTX
Microsoft Introduction to Automated Machine Learning
Setu Chokshi
 
PPTX
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
PDF
Using the power of Generative AI at scale
Maxim Salnikov
 
PDF
Intro to LLMs
Loic Merckel
 
PDF
Automatic Machine Learning, AutoML
Himadri Mishra
 
PPTX
Machine learning
Saurabh Agrawal
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
PPTX
Machine Learning
Girish Khanzode
 
PPT
Machine learning
Rajib Kumar De
 
PPTX
Future of AI - 2023 07 25.pptx
Greg Makowski
 
PPTX
End-to-End Machine Learning Project
Eng Teong Cheah
 
PPTX
Explainable AI
Wagston Staehler
 
PDF
How to choose Machine Learning algorithm.
Mala Deep Upadhaya
 
PDF
Explainable AI (XAI) - A Perspective
Saurabh Kaushik
 
PDF
Machine Learning
Shrey Malik
 
PDF
Microsoft + OpenAI: Recent Updates (Machine Learning 15minutes! Broadcast #74)
Naoki (Neo) SATO
 
PDF
Large Language Models Bootcamp
Data Science Dojo
 
PDF
ML-Ops how to bring your data science to production
Herman Wu
 
Automated Machine Learning
safa cimenli
 
Automatic machine learning (AutoML) 101
QuantUniversity
 
Microsoft Introduction to Automated Machine Learning
Setu Chokshi
 
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
Using the power of Generative AI at scale
Maxim Salnikov
 
Intro to LLMs
Loic Merckel
 
Automatic Machine Learning, AutoML
Himadri Mishra
 
Machine learning
Saurabh Agrawal
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Machine Learning
Girish Khanzode
 
Machine learning
Rajib Kumar De
 
Future of AI - 2023 07 25.pptx
Greg Makowski
 
End-to-End Machine Learning Project
Eng Teong Cheah
 
Explainable AI
Wagston Staehler
 
How to choose Machine Learning algorithm.
Mala Deep Upadhaya
 
Explainable AI (XAI) - A Perspective
Saurabh Kaushik
 
Machine Learning
Shrey Malik
 
Microsoft + OpenAI: Recent Updates (Machine Learning 15minutes! Broadcast #74)
Naoki (Neo) SATO
 
Large Language Models Bootcamp
Data Science Dojo
 
ML-Ops how to bring your data science to production
Herman Wu
 

Similar to Automated Machine Learning (Auto ML) (20)

PDF
GDG DEvFest Hellas 2020 - Automated ML - Panagiotis Papaemmanouil
Panagiotis Papaemmanouil
 
PDF
PythonとAutoML at PyConJP 2019
Masashi Shibata
 
PPTX
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
Jisang Yoon
 
PPTX
How to automate Machine Learning pipeline ?
Axel de Romblay
 
PDF
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
 
PPTX
AutoML_Seminar_PPT_Condensed_presentation.pptx
dcryptic01
 
PPTX
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
PPTX
Automated Machine Learning in Action
Manning Publications
 
PPTX
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
PDF
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
PPTX
Automated machine learning - Global AI night 2019
Marco Zamana
 
PPTX
Introduction to Auto ML
Dmitry Petukhov
 
PDF
AUTOMATED MACHINE LEARNING
ANAI
 
PPTX
MLBox 0.8.2
Axel de Romblay
 
PPTX
Aws autopilot
Vivek Raja P S
 
PDF
Open and Automated Machine Learning
Joaquin Vanschoren
 
PPTX
Machine learning to solve bioinformatics problems
JunaidAKG
 
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
PPTX
Everything you need to know about AutoML
Arpitha Gurumurthy
 
GDG DEvFest Hellas 2020 - Automated ML - Panagiotis Papaemmanouil
Panagiotis Papaemmanouil
 
PythonとAutoML at PyConJP 2019
Masashi Shibata
 
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
Jisang Yoon
 
How to automate Machine Learning pipeline ?
Axel de Romblay
 
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
 
AutoML_Seminar_PPT_Condensed_presentation.pptx
dcryptic01
 
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Automated Machine Learning in Action
Manning Publications
 
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Automated machine learning - Global AI night 2019
Marco Zamana
 
Introduction to Auto ML
Dmitry Petukhov
 
AUTOMATED MACHINE LEARNING
ANAI
 
MLBox 0.8.2
Axel de Romblay
 
Aws autopilot
Vivek Raja P S
 
Open and Automated Machine Learning
Joaquin Vanschoren
 
Machine learning to solve bioinformatics problems
JunaidAKG
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
Everything you need to know about AutoML
Arpitha Gurumurthy
 
Ad

More from Hayim Makabee (20)

PDF
Movie Quotes Search Engine Industrial Project
Hayim Makabee
 
PPTX
Managing your Reputation
Hayim Makabee
 
PPTX
Applications of Machine Learning - INDT Webinar
Hayim Makabee
 
PPTX
Applications of Machine Learning
Hayim Makabee
 
PPTX
Blue Ocean Strategy: KashKlik Use Case
Hayim Makabee
 
PPTX
Managing your Reputation Gvahim Webinar
Hayim Makabee
 
PPTX
Managing your Reputation
Hayim Makabee
 
PPTX
The Story of a Young Oleh (Immigrant in Israel)
Hayim Makabee
 
PPTX
Software Architecture for Agile Development
Hayim Makabee
 
PPTX
Adaptable Designs for Agile Software Development
Hayim Makabee
 
PPTX
Applications of Machine Learning
Hayim Makabee
 
PPTX
Antifragile Software Design
Hayim Makabee
 
PPTX
To document or not to document? An exploratory study on developers' motivatio...
Hayim Makabee
 
PDF
To document or not to document? An exploratory study on developers' motivatio...
Hayim Makabee
 
PPTX
The SOLID Principles Illustrated by Design Patterns
Hayim Makabee
 
PPTX
Aliyah: Looking for a hi-tech job in Israel
Hayim Makabee
 
PPTX
The Role of the Software Architect (short version)
Hayim Makabee
 
PPTX
Software Quality Attributes
Hayim Makabee
 
PPTX
The Role of the Software Architect
Hayim Makabee
 
PDF
Reducing Technical Debt: Using Persuasive Technology for Encouraging Software...
Hayim Makabee
 
Movie Quotes Search Engine Industrial Project
Hayim Makabee
 
Managing your Reputation
Hayim Makabee
 
Applications of Machine Learning - INDT Webinar
Hayim Makabee
 
Applications of Machine Learning
Hayim Makabee
 
Blue Ocean Strategy: KashKlik Use Case
Hayim Makabee
 
Managing your Reputation Gvahim Webinar
Hayim Makabee
 
Managing your Reputation
Hayim Makabee
 
The Story of a Young Oleh (Immigrant in Israel)
Hayim Makabee
 
Software Architecture for Agile Development
Hayim Makabee
 
Adaptable Designs for Agile Software Development
Hayim Makabee
 
Applications of Machine Learning
Hayim Makabee
 
Antifragile Software Design
Hayim Makabee
 
To document or not to document? An exploratory study on developers' motivatio...
Hayim Makabee
 
To document or not to document? An exploratory study on developers' motivatio...
Hayim Makabee
 
The SOLID Principles Illustrated by Design Patterns
Hayim Makabee
 
Aliyah: Looking for a hi-tech job in Israel
Hayim Makabee
 
The Role of the Software Architect (short version)
Hayim Makabee
 
Software Quality Attributes
Hayim Makabee
 
The Role of the Software Architect
Hayim Makabee
 
Reducing Technical Debt: Using Persuasive Technology for Encouraging Software...
Hayim Makabee
 
Ad

Recently uploaded (20)

PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Governing Geospatial Data at Scale: Optimizing ArcGIS Online with FME in Envi...
Safe Software
 
PDF
Quantum Threats Are Closer Than You Think – Act Now to Stay Secure
WSO2
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PPTX
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PDF
FME in Overdrive: Unleashing the Power of Parallel Processing
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
Deploy Faster, Run Smarter: Learn Containers with QNAP
QNAP Marketing
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Governing Geospatial Data at Scale: Optimizing ArcGIS Online with FME in Envi...
Safe Software
 
Quantum Threats Are Closer Than You Think – Act Now to Stay Secure
WSO2
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
FME in Overdrive: Unleashing the Power of Parallel Processing
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Deploy Faster, Run Smarter: Learn Containers with QNAP
QNAP Marketing
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 

Automated Machine Learning (Auto ML)

  • 2. Automated Machine Learning Automated Machine Learning (AutoML) systems find the right algorithm and hyperparameters in a data-driven way without any human intervention.
  • 3. Auto ML Benefits AutoML allows the data scientist to extend his productivity without adding more members to the data science team. AutoML addresses the skills gap between the demand for data science talent and the availability of this talent.
  • 8. Olson Experiment on Parameter Tuning Used 165 classification data sets from a variety of sources and 13 different classification algorithms from scikit-learn. Compared classification accuracy using default parameters for each algorithm to a tuned version of those algorithms. On average, got 5–10% improvement in classification accuracy from tuning algorithms from default parameters. However, there is no parameter combination that works best for all problems. Tuning is mandatory to see improvement and this feature is built into all AutoML solutions.
  • 14. Bayesian Optimization for Hyperparameter Selection Build a probabilistic model to capture the relationship between hyperparameter settings and their performance. Use the model to select useful hyperparameter settings to try next by trading off exploration (searching in parts of the space where the model is uncertain) and exploitation (focusing on parts of the space predicted to perform well). Run the machine learning algorithm with those hyperparameter settings, measure the performance and update the probabilistic model.
  • 18. Auto-sklearn Auto-sklearn is open source, implemented in python and built around the scikit- learn library. It contains a machine learning pipeline which takes care of missing values, categorical features, sparse and dense data, and rescaling the data. Next, the pipeline applies a preprocessing algorithm and an ML algorithm.
  • 19. Generalizing the Bayesian Algorithm Bayesian Optimization can be generalized to jointly select algorithms, preprocessing methods, and their hyperparameters as follows: • The choices of classifier / regressor and preprocessing methods are top- level, categorical hyperparameters, and based on their settings the hyperparameters of the selected methods become active. • The combined space can then be searched with Bayesian optimization methods that handle such high-dimensional, conditional spaces.
  • 20. Hyperparameters Auto-sklearn includes 15 ML algorithms, 14 preprocessing methods, and all their respective hyperparameters, yielding a total of 110 hyperparameters.
  • 21. Meta-learning Optimizing performance in Auto-sklearn’s space of 110 hyperparameters can of course be slow. To jumpstart this process it uses meta-learning to start from good hyperparameter settings for previous similar datasets. Specifically, Auto-sklearn comes with a database of previous optimization runs on 140 diverse datasets from OpenML. For a new dataset, it first identifies the most similar datasets and starts from the saved best settings for those.
  • 22. Ensemble Selection • Auto-sklearn automatically construct ensembles. • Instead of returning a single hyperparameter, it automatically constructs ensembles from the models trained during the Bayesian optimization. • Specifically, Auto-sklearn uses Ensemble Selection to create small, powerful ensembles with increased predictive power and robustness.
  • 23. Winning the AutoML challenge The ChaLearn AutoML challenge was a machine learning competition. Auto-sklearn placed in the top three for nine out of ten phases and won six of them. Particularly in the last two phases, Auto-sklearn won both the auto track and the tweakathon. During the last two phases of the tweakathon the team combined Auto-sklearn with Auto-Net for several datasets to further boost performance.
  • 24. Auto-sklearn Example X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.3) automl = autosklearn.classification.AutoSklearnClassifier (time_left_for_this_task=3600, per_run_time_limit=360) automl.fit(X_train, y_train) print(automl.show_models()) predictions = automl.predict(X_test) probabilities = automl.predict_proba(X_test)[:,1]
  • 25. Result = Ensemble (0.520000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'categorical_encoding:__choice__': 'one_hot_encoding', 'classifier:__choice__': 'random_forest', 'imputation:strategy': 'most_frequent', 'preprocessor:__choice__': 'no_preprocessing', 'rescaling:__choice__': 'quantile_transformer', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'classifier:random_forest:bootstrap': 'True', 'classifier:random_forest:criterion': 'entropy', 'classifier:random_forest:max_depth': 'None', 'classifier:random_forest:max_features': 0.7884268823432835, 'classifier:random_forest:max_leaf_nodes': 'None', 'classifier:random_forest:min_impurity_decrease': 0.0, 'classifier:random_forest:min_samples_leaf': 20, 'classifier:random_forest:min_samples_split': 15, 'classifier:random_forest:min_weight_fraction_leaf': 0.0, 'classifier:random_forest:n_estimators': 100, 'rescaling:quantile_transformer:n_quantiles': 1000, 'rescaling:quantile_transformer:output_distribution': 'uniform', 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.002615346832354839}
  • 26. Result = Ensemble (0.020000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'categorical_encoding:__choice__': 'no_encoding’,'classifier:__choice__': 'k_nearest_neighbors', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'no_preprocessing', 'rescaling:__choice__': 'standardize', 'classifier:k_nearest_neighbors:n_neighbors': 1, 'classifier:k_nearest_neighbors:p': 2, 'classifier:k_nearest_neighbors:weights': 'uniform'}
  • 27. Performance – Impact of Time AutoML 20 Minute Run Accuracy : 0.89 Precision : 0.89 Recall : 1.0 ROC AUC : 0.61 AutoML 60 Minutes Run Accuracy : 0.89 Precision : 0.90 Recall : 0.99 ROC AUC : 0.72
  • 28. Performance – Over-Fitting AutoML 60 Minutes Run Accuracy : 0.89 Precision : 0.90 Recall : 0.99 ROC AUC : 0.72 AutoML 120 Minutes Run Accuracy : 0.87 Precision : 0.91 Recall : 0.95 ROC AUC : 0.70
  • 29. Performance X Non-AutoML (train data 1) AutoML 60 Minutes Run Accuracy : 0.89 Precision : 0.90 Recall : 0.99 ROC AUC : 0.72 XGBoost ACCURACY = 0.89 PRECISION = 0.90 RECALL = 0.99 AUC = 0.71
  • 30. Performance X Non-AutoML (train data 2) AutoML 60 Minutes Run Accuracy : 0.73 Precision : 0.69 Recall : 0.56 ROC AUC : 0.79 XGBoost ACCURACY = 0.66 PRECISION = 0.61 RECALL = 0.42 AUC = 0.62
  • 31. Performance – Balance Negative = 8 X Positive Accuracy : 0.99 Precision : 1.0 Recall : 0.91 ROC AUC : 0.97 Negative = 20 X Positive Accuracy : 0.99 Precision : 0.99 Recall : 0.76 ROC AUC : 0.94
  • 32. TPOT • TPOT = Tree-based Pipeline Optimization Tool • TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using Genetic Algorithms.
  • 33. TPOT uses Genetic Algorithms to find the best ML model and hyperparameters based on the training / validation set. The model options include all the algorithms implemented in the scikit-learn library. Parameters include population size and number of generations to run the Genetic Algorithm. TPOT
  • 36. Practical Questions Can we really move AutoML from the Lab to production environments? What would be the latency of using an Ensemble of models in production? Would the AutoML training time be prohibitive for big datasets? I think we need Incremental AutoML: in which the previous model (together with new data) serves as an input to find the next best model.
  • 37. My personal experience: Semi-AutoML at Yahoo Labs Finite (large) number of manually pre-defined model configurations (hyperparameters). Incremental Learning: previous model was used as input for training new models. Used Hadoop Map-Reduce: each Reducer used one configuration, trained a model and measured its performance (parallel training). The model with best performance was chosen for deployment.
  • 38. What next? My personal opinion Automated ML will not replace the Data Scientist but will enable the Data Scientist to produce more models in less time with higher quality. This is probably the end of “good enough models” using standard parameters because the Data Scientist did not have time to check different parameters. The main advantage is not saving time. The main benefit is doing things that were never done because of lack of time. Data scientists will have more time to collaborate with business experts to get domain knowledge and use it in feature engineering.
  • 39. References • https://medium.com/@ODSC/the-past-present-and-future-of-automated-machine-learning-5e081ca4b71a • https://softwareengineeringdaily.com/2019/05/15/introduction-to-automated-machine-learning-automl/ • https://medium.com/georgian-impact-blog/automatic-machine-learning-aml-landscape-survey-f75c3ae3bbf2 • https://medium.com/@MLJARofficial/automl-comparison-4b01229fae5e • https://www.fast.ai/2018/07/16/auto-ml2/ • https://www.kdnuggets.com/2016/10/interview-auto-sklearn-automated-data-science-machine-learning-team.html • https://www.kdnuggets.com/2016/08/winning-automl-challenge-auto-sklearn.html • https://www.kdnuggets.com/2017/07/design-evolution-evolve-neural-network-automl.html • https://www.slideshare.net/JoaquinVanschoren/automl-lectures-acdl-2019 • https://www.youtube.com/watch?v=QrJlj0VCHys • https://www.youtube.com/watch?v=jn-22XyKsgo • https://cloud.google.com/automl/