SlideShare a Scribd company logo
LEVERAGING OPEN SOURCE
E D U A R D O A R I Ñ O D E L A R U B I A
C H I E F D A T A S C I E N T I S T , D O M I N O D A T A L A B
E D U A R D O @ D O M I N O D A T A L A B . C O M
T W I T T E R : @ E A R I N O
A U T O M A T E D D A T A S C I E N C E T O O L S
CONTENTS
Introduction
1
WELCOME TO MY DATA POPUP TALK
Some background
2
Tools Available
3
A nice self-serving
way to eat up at least
a few minutes of this
talk.
INTRODUCTION
PICTURE SLIDE
DATA SCIENTIST
A BIT ABOUT ME
A QUICK TIMELINE
Manufacturing &
Logistics
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Let’s discuss what is ML,
what is data science, and
make sure we’re all using the
same words to mean the
same things.
SOME BACKGROUND
FIND A CATEGORY
Detect defective, classify
workloads, categorize
vendors
WHAT IS MACHINE LEARNING?
FIND A NUMBER
Predict yields, decide optimal
run rates, predict tolerances
FIND STRUCTURE
Competitive intelligence,
understand vendor
processes, market segments
KMEANS, KOHONEN
SOM
Field of study that gives computers the ability to learn without being explicitly
programmed"
GLM, RIDGE, ETC…
KNN, NEURAL NET,
ETC.
Biology is not the study of microscopes. Though they
sure make biology a whole lot easier, they are a tool.
ML plays a part in the data science process, but data
science is not just applied ML. They make it a whole lot
easier, it is a tool.
ML IS NOT
DATA SCIENCE
SO WHAT CAN WE AUTOMATE?
(C) SZILARD PAFKA
(C) SZILARD PAFKA
(C) SZILARD PAFKA
(C) SZILARD PAFKA
Leveraging Open Source Automated Data Science Tools
So now that we’ve spent some
time together, what are some
good open source tools we can
use?
TOOLS AVAILABLE
ANGRY OLD MAN
RANT
Data Science tools are incredibly automated!
We’re in a golden age of data science automation.
It’s really not very long ago that in order to train a
model you had to go out into some professor’s FTP
server and figure out how to get some library to
even compile.
Here are some things we just take for granted that
are now automated…
The original sample is
randomly partitioned into k
equal sized subsamples
CROSS VALIDATION
1
Hyperparameter sweeps
are something that you just
simply had to code by hand
GRID SEARCH
3
Scaling? Centering? Box cox?
These were things that you
had to do by hand, and doing
them wrong was bad.
PRE PROCESSING
2
Have you ever used a plotting
library which allowed you to
facet? That used to be a thing you
just had to make by hand
VISUALIZATION
4
6
Both R and Python now provide
multiple feature selection
strategies, from RFE to threshold
approaches
FEATURE SELECTION
5
This one blows my mind. With
tools like h2o’s ensembling, you
can literally just build ensembles
of learners with 1 line of code.
ENSEMBLING
All the interesting problems
are unbalanced class
problems.
balance_classes=TRUE???
CLASS BALANCES
8
This space intentionally left
empty for future
developments
ETC…
3
Oh for goodness sakes, google’s
Automatic Machine Learning
freaking designs entire new deep
learning architectures???
DEEP ARCHITECTURES
9
BUT DON’T FORGET HOW LUCKY WE ARE
Between the massive hardware that is available to us, and the
incredible libraries that have been created by the community,
we’re infinitely more productive than we were just a few years
ago.
But we want even more automation… so let’s talk about some
cool tools :)
WE’RE SPOILED
AUTOMATED
DATA SCIENCE
IS HUNGRY FOR RESOURCES
FEATURE
ENGINEERING
Feature engineering is often considered the dark art of data science. Like
when your differential equations professor told you that you should “stare at
it” until it made sense.
scikit-feature is an open-source feature selection repository in Python developed by Data Mining and Machine Learning Lab at Ari zona State
University. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. scikit-
feature contains around 40 popular feature selection algorithms, including traditional feature selection algorithms and some structural and
streaming feature selection algorithms.
SCIKIT FEATURE
SO COOL RIGHT
SADLY IT SEEMS
TO BE MOSTLY
ABANDONED
HELPS MAKE THE SAUSAGE
A 'data.frame' processor/conditioner that prepares real-world data for
predictive modeling in a statistically sound manner. 'vtreat' prepares
variables so that data has fewer exceptional cases, making it easier
to safely use models in production. Common problems 'vtreat'
defends against: 'Inf', 'NA', too many categorical levels, rare
categorical levels, and new categorical levels (levels seen during
application, but not during training).
VTREAT
THERE’S A TON MORE
SO MANY PROBLEMS…
1. Bad numerical values (NA, NaN, sentinels)
2. Categorial values (missing levels, novel levels in production)
3. Categorical values with too many levels
4. Weird skew
Vtreat provides “y-aware” processing
Treatment of missing values
through safe replacement plus
indicator column (a simple but very
powerful method when combined
with downstream machine learning
algorithms).
1
Explicit coding of categorical variable
levels as new indicator variables
(with optional suppression of non-
significant indicators).
3
Treatment of novel levels (new
values of categorical variable seen
during test or application, but not seen
during training) through sub-models
(or impact/effects coding of pooled
rare events).
2
User specified significance pruning
on levels coded into effects/impact
sub-models
4
6
Treatment of categorical variables
with very large numbers of levels
through sub-models
5
Collaring/Winsorizing of unexpected
out of range numeric inputs (clipping)
Leveraging Open Source Automated Data Science Tools
WARNING
Your data had better be pretty clean!
These automated ML tools are amazing,
but your data needs to be in pretty good
shape. Nice, numerical, no weird missing
values…
So chain them together and use vtreat!
AND…
auto-sklearn is an automated machine learning toolkit and a drop-in
replacement for a scikit-learn estimator:
auto-sklearn frees a machine learning user from algorithm selection and
hyperparameter tuning. It leverages recent advantages in Bayesian
optimization, meta-learning and ensemble construction. Learn more about the
technology behind auto-sklearn by reading this paper published at the NIPS
2015 .
AUTO-SKLEARN
AWARDS
Of additional note, Auto-sklearn won both
the auto and the tweakathon tracks of the
ChaLearn AutoML challenge.
Leveraging Open Source Automated Data Science Tools
RANDAL
OLSON
TPOT will automate the most tedious part of
machine learning by intelligently exploring
thousands of possible pipelines to find the
best one for your data.
Once TPOT is finished searching (or you get
tired of waiting), it provides you with the
Python code for the best pipeline it found so
you can tinker with the pipeline from there.
TPOT CREATOR
Though both projects are open source, written in Python, and aimed at simplifying a machine learning process by way of AutoML , in contrast to
Auto-sklearn using Bayesian optimization, TPOT's approach is based on genetic programming.
One of the real benefits of TPOT is that it produces ready-to-run, standalone Python code for the best-performing model, in the form of a scikit-
learn pipeline. This code, representing the best performing of all candidate models, can then be modified or inspected for ad ditional insight,
effectively being able to serve as a starting point as opposed to solely as an end product.
GENETIC
PROGRAMMING
- MATTHEW MAYO, KDNUGGETS.
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
COMING SOON?
Supposedly is going to take advantage of
a lot of the existing infrastructure in h2o,
with ensembles in the back end, hyper
parameter search, etc…
VERY excited to see what happens next!
AUTOML
COMING SOON?
Supposedly is going to take advantage of
a lot of the existing infrastructure in h2o,
with ensembles in the back end, hyper
parameter search, etc…
VERY excited to see what happens next!
AUTOML
Leveraging Open Source Automated Data Science Tools
The current version of AutoML trains and cross-validates a Random Forest, an
Extremely-Randomized Forest, a random grid of Gradient Boosting Machines
(GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble of all
the models.
http://tiny.cc/automl
Leveraging Open Source Automated Data Science Tools
THANK YOU
R E A C H O U T A T
E D U A R D O @ D O M I N O D A T A L A B . C O M
@ E A R I N O
F O R C O M I N G T O M Y T A L K
W E A R E H I R I N G !
H T T P S : / / W W W . D O M I N O D A T A L A B . C O M / C A R E E R S /
Ad

More Related Content

What's hot (20)

The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
Domino Data Lab
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
Bas Geerdink
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Sri Ambati
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
odsc
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Dhiana Deva
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Greg Goltsov
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Edureka!
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
Roger Huang
 
Data science
Data scienceData science
Data science
GitanshuSharma1
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Edureka!
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
Manish Jain
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
Dr. Mohan K. Bavirisetty
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
Domino Data Lab
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
Bas Geerdink
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Sri Ambati
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
odsc
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Dhiana Deva
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Greg Goltsov
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Edureka!
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
Roger Huang
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Edureka!
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
Manish Jain
 

Similar to Leveraging Open Source Automated Data Science Tools (20)

From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
InfluxData
 
How to Become a Machine Learning Expert Step-by-Step Guide | IABAC
How to Become a Machine Learning Expert Step-by-Step Guide | IABACHow to Become a Machine Learning Expert Step-by-Step Guide | IABAC
How to Become a Machine Learning Expert Step-by-Step Guide | IABAC
vamshit5
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Smart Data Webinar: Machine Learning Update
Smart Data Webinar: Machine Learning UpdateSmart Data Webinar: Machine Learning Update
Smart Data Webinar: Machine Learning Update
DATAVERSITY
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Train, explain, acclaim. Build a good model in three steps
Train, explain, acclaim.  Build a good model in three stepsTrain, explain, acclaim.  Build a good model in three steps
Train, explain, acclaim. Build a good model in three steps
Przemek Biecek
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
geetachauhan
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Burr Sutter
 
AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...
AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...
AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...
Tapp AI
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
Matthew Sinclair
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The Cloud
Bob Rhubart
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Cary Millsap
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
InfluxData
 
How to Become a Machine Learning Expert Step-by-Step Guide | IABAC
How to Become a Machine Learning Expert Step-by-Step Guide | IABACHow to Become a Machine Learning Expert Step-by-Step Guide | IABAC
How to Become a Machine Learning Expert Step-by-Step Guide | IABAC
vamshit5
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Smart Data Webinar: Machine Learning Update
Smart Data Webinar: Machine Learning UpdateSmart Data Webinar: Machine Learning Update
Smart Data Webinar: Machine Learning Update
DATAVERSITY
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Train, explain, acclaim. Build a good model in three steps
Train, explain, acclaim.  Build a good model in three stepsTrain, explain, acclaim.  Build a good model in three steps
Train, explain, acclaim. Build a good model in three steps
Przemek Biecek
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
geetachauhan
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...
Burr Sutter
 
AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...
AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...
AIML Learning Path Cheat Sheet Essential Tools & Frameworks to Crack Your Int...
Tapp AI
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
Matthew Sinclair
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The Cloud
Bob Rhubart
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Cary Millsap
 
Ad

More from Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
Domino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
Domino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
Domino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
Domino Data Lab
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
Domino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
Domino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
Domino Data Lab
 
Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
Domino Data Lab
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
Domino Data Lab
 
Building Up Local Models of Customers
Building Up Local Models of CustomersBuilding Up Local Models of Customers
Building Up Local Models of Customers
Domino Data Lab
 
Making Investing A Science
Making Investing A ScienceMaking Investing A Science
Making Investing A Science
Domino Data Lab
 
What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
Domino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
Domino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
Domino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
Domino Data Lab
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
Domino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
Domino Data Lab
 
Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
Domino Data Lab
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
Domino Data Lab
 
Building Up Local Models of Customers
Building Up Local Models of CustomersBuilding Up Local Models of Customers
Building Up Local Models of Customers
Domino Data Lab
 
Making Investing A Science
Making Investing A ScienceMaking Investing A Science
Making Investing A Science
Domino Data Lab
 
Ad

Recently uploaded (20)

Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 

Leveraging Open Source Automated Data Science Tools

  • 1. LEVERAGING OPEN SOURCE E D U A R D O A R I Ñ O D E L A R U B I A C H I E F D A T A S C I E N T I S T , D O M I N O D A T A L A B E D U A R D O @ D O M I N O D A T A L A B . C O M T W I T T E R : @ E A R I N O A U T O M A T E D D A T A S C I E N C E T O O L S
  • 2. CONTENTS Introduction 1 WELCOME TO MY DATA POPUP TALK Some background 2 Tools Available 3
  • 3. A nice self-serving way to eat up at least a few minutes of this talk. INTRODUCTION
  • 8. Let’s discuss what is ML, what is data science, and make sure we’re all using the same words to mean the same things. SOME BACKGROUND
  • 9. FIND A CATEGORY Detect defective, classify workloads, categorize vendors WHAT IS MACHINE LEARNING? FIND A NUMBER Predict yields, decide optimal run rates, predict tolerances FIND STRUCTURE Competitive intelligence, understand vendor processes, market segments KMEANS, KOHONEN SOM Field of study that gives computers the ability to learn without being explicitly programmed" GLM, RIDGE, ETC… KNN, NEURAL NET, ETC.
  • 10. Biology is not the study of microscopes. Though they sure make biology a whole lot easier, they are a tool. ML plays a part in the data science process, but data science is not just applied ML. They make it a whole lot easier, it is a tool. ML IS NOT DATA SCIENCE SO WHAT CAN WE AUTOMATE?
  • 16. So now that we’ve spent some time together, what are some good open source tools we can use? TOOLS AVAILABLE
  • 17. ANGRY OLD MAN RANT Data Science tools are incredibly automated! We’re in a golden age of data science automation. It’s really not very long ago that in order to train a model you had to go out into some professor’s FTP server and figure out how to get some library to even compile. Here are some things we just take for granted that are now automated…
  • 18. The original sample is randomly partitioned into k equal sized subsamples CROSS VALIDATION 1 Hyperparameter sweeps are something that you just simply had to code by hand GRID SEARCH 3 Scaling? Centering? Box cox? These were things that you had to do by hand, and doing them wrong was bad. PRE PROCESSING 2
  • 19. Have you ever used a plotting library which allowed you to facet? That used to be a thing you just had to make by hand VISUALIZATION 4 6 Both R and Python now provide multiple feature selection strategies, from RFE to threshold approaches FEATURE SELECTION 5 This one blows my mind. With tools like h2o’s ensembling, you can literally just build ensembles of learners with 1 line of code. ENSEMBLING
  • 20. All the interesting problems are unbalanced class problems. balance_classes=TRUE??? CLASS BALANCES 8 This space intentionally left empty for future developments ETC… 3 Oh for goodness sakes, google’s Automatic Machine Learning freaking designs entire new deep learning architectures??? DEEP ARCHITECTURES 9
  • 21. BUT DON’T FORGET HOW LUCKY WE ARE Between the massive hardware that is available to us, and the incredible libraries that have been created by the community, we’re infinitely more productive than we were just a few years ago. But we want even more automation… so let’s talk about some cool tools :) WE’RE SPOILED
  • 23. FEATURE ENGINEERING Feature engineering is often considered the dark art of data science. Like when your differential equations professor told you that you should “stare at it” until it made sense.
  • 24. scikit-feature is an open-source feature selection repository in Python developed by Data Mining and Machine Learning Lab at Ari zona State University. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. scikit- feature contains around 40 popular feature selection algorithms, including traditional feature selection algorithms and some structural and streaming feature selection algorithms. SCIKIT FEATURE SO COOL RIGHT
  • 25. SADLY IT SEEMS TO BE MOSTLY ABANDONED
  • 26. HELPS MAKE THE SAUSAGE A 'data.frame' processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. 'vtreat' prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems 'vtreat' defends against: 'Inf', 'NA', too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). VTREAT
  • 27. THERE’S A TON MORE SO MANY PROBLEMS… 1. Bad numerical values (NA, NaN, sentinels) 2. Categorial values (missing levels, novel levels in production) 3. Categorical values with too many levels 4. Weird skew Vtreat provides “y-aware” processing
  • 28. Treatment of missing values through safe replacement plus indicator column (a simple but very powerful method when combined with downstream machine learning algorithms). 1 Explicit coding of categorical variable levels as new indicator variables (with optional suppression of non- significant indicators). 3 Treatment of novel levels (new values of categorical variable seen during test or application, but not seen during training) through sub-models (or impact/effects coding of pooled rare events). 2
  • 29. User specified significance pruning on levels coded into effects/impact sub-models 4 6 Treatment of categorical variables with very large numbers of levels through sub-models 5 Collaring/Winsorizing of unexpected out of range numeric inputs (clipping)
  • 31. WARNING Your data had better be pretty clean! These automated ML tools are amazing, but your data needs to be in pretty good shape. Nice, numerical, no weird missing values… So chain them together and use vtreat!
  • 32. AND… auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator: auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015 . AUTO-SKLEARN
  • 33. AWARDS Of additional note, Auto-sklearn won both the auto and the tweakathon tracks of the ChaLearn AutoML challenge.
  • 35. RANDAL OLSON TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. TPOT CREATOR
  • 36. Though both projects are open source, written in Python, and aimed at simplifying a machine learning process by way of AutoML , in contrast to Auto-sklearn using Bayesian optimization, TPOT's approach is based on genetic programming. One of the real benefits of TPOT is that it produces ready-to-run, standalone Python code for the best-performing model, in the form of a scikit- learn pipeline. This code, representing the best performing of all candidate models, can then be modified or inspected for ad ditional insight, effectively being able to serve as a starting point as opposed to solely as an end product. GENETIC PROGRAMMING - MATTHEW MAYO, KDNUGGETS.
  • 39. COMING SOON? Supposedly is going to take advantage of a lot of the existing infrastructure in h2o, with ensembles in the back end, hyper parameter search, etc… VERY excited to see what happens next! AUTOML
  • 40. COMING SOON? Supposedly is going to take advantage of a lot of the existing infrastructure in h2o, with ensembles in the back end, hyper parameter search, etc… VERY excited to see what happens next! AUTOML
  • 42. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble of all the models. http://tiny.cc/automl
  • 44. THANK YOU R E A C H O U T A T E D U A R D O @ D O M I N O D A T A L A B . C O M @ E A R I N O F O R C O M I N G T O M Y T A L K W E A R E H I R I N G ! H T T P S : / / W W W . D O M I N O D A T A L A B . C O M / C A R E E R S /