SlideShare a Scribd company logo
OpenML
O P E N , A U T O M AT E D M A C H I N E L E A R N I N G
J O A Q U I N VA N S C H O R E N , T U / E
@open_mlwww.openml.org
OpenML
You can be part of this presentation :)
Follow the code examples:
• On Google Colab: goo.gl/VwbKb4
• On Github: https://git.io/fA3eL
J O A Q U I N VA N S C H O R E N , T U / E @open_mlwww.openml.org
World-wide telescope
Networked science
(Not so)
Automatic
Machine Learning
It’s hard to find and learn from prior machine learning data
(Auto)ML: manual work, unnecessary friction
Hard to find and reuse prior results
No standards / hubs for sharing and organizing results
Scattered, ill-described, datasets
Manual searching, reformatting, making assumptions
Hard to automate model building end-to-end
Requires automated data organization, clean APIs,…
Myriad algorithms, versions, languages
Write code, set up experiments, store results,…
Reproducibility is hard
Manually tracking every detail is error-prone
Easy to use: Integrated in many ML tools/environments
Easy to contribute: Automated sharing of data, code, results
Organized data: APIs to find & reuse data, models, experiments
Reward structure: Track your impact, build reputation
Self-learning: Learn from many experiments to help people
OpenML
S H A R E A N D R E U S E
M A C H I N E L E A R N I N G D ATA O N L I N E
www.openml.org
OpenML: Components
Flows: Pipelines/code that build ML models
Run locally (or wherever), auto-upload all results
Datasets: Auto-annotated, organized, well-formatted
Find the datasets you need, share your own
Tasks: Auto-generated, machine-readable
Everyone’s results are directly comparable
Runs:All results from running flows on tasks
All details needed for tracking and reproducibility
Evaluations can be queried, compared, reused
It starts with data
Data (tabular) easily uploaded or referenced (URL)
It starts with data
Data can remain in existing repositories
-> registered via URL, transparent to users
interoperability
For now: only tabular data
-> ARFF or CSV import (auto-annotate features)
-> FrictionlessData support in the works
auto-versioned, analysed, organised online
Search (API)
import	openml	as	oml	
openml_list	=	oml.datasets.list_datasets()
Python, R, Java, C#
Search (API)
import	pandas	as	pd	
datalist	=	pd.DataFrame.from_dict(openml_list)	
datalist[datalist.NumberOfInstances>10000	
										].sort_values(['NumberOfInstances'])
Python, R, Java, C#
Search (API)
datalist.query('name	==	"eeg-eye-state"')
Python, R, Java, C#
data id
Get (API)
dataset	=	oml.datasets.get_dataset(1471)	
dataset.description[:500]	
Python, R, Java, C#
Get (API)
X,	y,	attribute_names	=	dataset.get_data(	
			return_attribute_names=True)	
Python, R, Java, C#
eeg	=	pd.DataFrame(X,	columns=attribute_names)	
eeg['class']	=	y
Get (API)
eeg.plot()	
pd.DataFrame(y).plot()	
Python, R, Java, C#
Fit (API) Python, R, Java, C#
from	sklearn	import	neighbors	
clf	=	neighbors.KNeighborsClassifier()	
clf.fit(X,	y)
Complete code to build a model,
automatically, anywhere
import	openml	as	oml	
from	sklearn	import	neighbors,	tree	
dataset	=	oml.datasets.get_dataset(1471)	
X,	y	=	dataset.get_data()	
clf	=	neighbors.KNeighborsClassifier()	
clf.fit(X,	y)
Tasks contain data, goals, procedures.
Auto-build + evaluate models correctly
All evaluations are directly comparable
optimize accuracy
Predict target T
Tasks
benchmarking and collaboration
10-fold Xval
10-fold Xval
Predict target T
Collaborate in real time online
optimize accuracy
Search
task_list	=	oml.tasks.list_tasks(size=5000)
task id
Search
mytasks	=	pd.DataFrame.from_dict(task_list)	
mytasks.query('name=="eeg-eye-state"')
task id
Get
task	=	oml.tasks.get_task(14951)
Auto-run algorithms/workflows on any task
Integrated in many machine learning tools (+ APIs)
Flows
Run experiments locally, share them globally
Integrated in many machine learning tools (+ APIs)
import	openml	as	oml	
from	sklearn	import	tree	
task	=	oml.tasks.get_task(14951)	
clf	=	tree.ExtraTreeClassifier()	
flow	=	oml.flows.sklearn_to_flow(clf)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Fit and share (complete code)
Uploaded to http://www.openml.org/r/9204488
import	openml	as	oml	
from	sklearn	import	tree	
task	=	oml.tasks.get_task(14951)	
clf	=	tree.ExtraTreeClassifier()	
flow	=	oml.flows.sklearn_to_flow(clf)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Fit and share pipelines
Uploaded to http://www.openml.org/r/7943199
from	sklearn	import	pipeline,	ensemble,	preprocessing	
from	openml	import	tasks,runs,	datasets	
task	=	tasks.get_task(59)	
pipe	=	pipeline.Pipeline(steps=[	
												('Imputer',	preprocessing.Imputer()),	
												('OneHotEncoder',	preprocessing.OneHotEncoder(),	
												('Classifier',	ensemble.RandomForestClassifier())	
											])	
flow	=	oml.flows.sklearn_to_flow(pipe)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Fit and share deep learning models
import	keras	
from	keras.models	import	Sequential,		
from	keras.layers	import	Dense,	Dropout,	Flatten,	Conv2D,	MaxPooling
from	keras.layers.core	import	Activation	
model	=	Sequential()	
model.add(Reshape((28,	28,	1),	input_shape=(784,)))	
model.add(Conv2D(20,	(5,	5),	padding=“same",	input_shape=(28,28,1),	
										activation='relu'))	
model.add(MaxPooling2D(pool_size=(2,	2),	strides=(2,	2)))	
model.add(Conv2D(50,	(5,	5),	padding="same",	activation='relu'))	
model.add(MaxPooling2D(pool_size=(2,	2),	strides=(2,	2)))	
model.add(Flatten())	
model.add(Dense(500))	
model.add(Activation(‘relu'))	
model.add(Dense(10))	
model.add(Activation('softmax'))	
model.compile(loss=keras.losses.categorical_crossentropy,	
														optimizer=keras.optimizers.Adadelta(),	
														metrics=['accuracy'])
Fit and share deep learning models
Uploaded to https://www.openml.org/r/9204337
task	=	tasks.get_task(3573)	#MNIST	
flow	=	oml.flows.keras_to_flow(model)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Compare to state-of-the-art
reproducible, linked to data, flows, authors
and all other experiments
Experiments auto-uploaded, evaluated online
Runs
Share and reuse results
Experiments auto-uploaded, evaluated online
Download, reuse runs
myruns	=	oml.runs.list_runs(task=[14951])	
sns.violinplot(x="score",	y="flow",	data=pd.DataFrame(
Open and Automated Machine Learning
Publishing, impact tracking
OpenML Community
5600+ registered users,
120000+ yearly users
A U T O M L : M E TA - L E A R N I N G
• Find similar datasets
• 20,000+ versioned datasets, with 130+ meta-features
• Instead of starting from scratch, start from configurations
that worked well on similar datasets
A U T O M L : M E TA - L E A R N I N G
• Find similar datasets
• 20,000+ versioned datasets, with 130+ meta-features
• Instead of starting from scratch, start from configurations
that worked well on similar datasets
• Auto-sklearn (AutoML challenge winner, NIPS 2016)
• Lookup similar datasets, start with best pipelines
Matthias Feurer et al. (2016) NIPS
A U T O M L : M E TA - L E A R N I N G
• Find similar datasets
• 20,000+ versioned datasets, with 130+ meta-features
• Instead of starting from scratch, start from configurations
that worked well on similar datasets
• Auto-sklearn (AutoML challenge winner, NIPS 2016)
• Lookup similar datasets, start with best pipelines
A U T O M L : M E TA - L E A R N I N G
• Reuse (millions of) prior model evaluations:
• Benchmark new algorithms against state-of-the-art
• Meta-models: E.g. predict performance or training time
• MIT AutoML system (ICBD 2017)
• Uses and compares against OpenML results
A U T O M L : M E TA - L E A R N I N G
• Reuse (millions of) prior model evaluations:
• Benchmark new algorithms against state-of-the-art
• Meta-models: E.g. predict performance or training time
• Runtime prediction
A U T O M L : M E TA - L E A R N I N G
• Reuse (millions of) prior model evaluations:
• Benchmark new algorithms against state-of-the-art
• Meta-models: E.g. predict performance or training time
• Faster TPOT (in progress)
• Build meta-models (Random Forest works well)
• Focus on fast configurations first
A U T O M L : M E TA - L E A R N I N G
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Study hyperparameter effects/importance
• Amazon’s multi-task learning AutoML (NIPS 2017)
• Trains surrogate models per task
• On new tasks: learns how to combine them with neural net
A U T O M L : M E TA - L E A R N I N G
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Study hyperparameter effects/importance
• Hyperparameter space design
• Use OpenML data to learn which hyperparameters to tune
Jan van Rijn et al. (2017) AutoML@ICML
A U T O M L : M E TA - L E A R N I N G
• Never-ending Automatic Machine Learning:
• AutoML methods built on top of OpenML get increasingly
better as more meta-data is added
• Faster drug discovery (QSAR)
• Meta-learning to build better models that recommend drug
candidates for rare diseases
ChEMBL DB: 1.4M compounds,
10k proteins,12.8M activities
Molecule
representations
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
16,000 regression datasets
x52 pipelines (on OpenML)
meta-model
all data on
new protein
optimal models
to predict activity
(Olier et al., Machine Learning 107(1), 2018)
Learning to learn
Bots that learn from all prior experiments
Automate drudge work, help people build models
Join us! (and change the world)
Active open source community
We need more bright people
- ML/DB experts
- Developers
- UX
Support is welcome!
Workshop sponsorship (hackathons 2x/year)
Donations: OpenML foundation
Compute time
Project ideas
E I N D H O V E N U N I V E R S I T Y
Looking for:
• PhD Students
• Scientific programmer
O P E N M L H A C K AT H O N
Paris, September 17-21, 2018
meet.openml.org
Co-located with COSEAL
Thank you!
谢谢
@open_ml
OpenML
Questions?

More Related Content

What's hot (20)

PDF
GLM & GBM in H2O
Sri Ambati
 
PDF
Le Machine Learning de A à Z
Alexia Audevart
 
PDF
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
PDF
QCon Rio - Machine Learning for Everyone
Dhiana Deva
 
PPTX
Automated Machine Learning
safa cimenli
 
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
PPTX
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
PDF
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
PDF
Automated Machine Learning
Yuriy Guts
 
PPTX
Machine learning 101 dkom 2017
fredverheul
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PDF
Machine Learning for Everyone
Aly Abdelkareem
 
PPTX
Ferruzza g automl deck
Eric Dill
 
PPTX
Demystifying Machine and Deep Learning for Developers
Microsoft Tech Community
 
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
PPTX
Graph Based Machine Learning on Relational Data
Benjamin Bengfort
 
PDF
The Evolution of AutoML
Ning Jiang
 
PDF
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
GLM & GBM in H2O
Sri Ambati
 
Le Machine Learning de A à Z
Alexia Audevart
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
QCon Rio - Machine Learning for Everyone
Dhiana Deva
 
Automated Machine Learning
safa cimenli
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
MLconf
 
Automated Machine Learning
Yuriy Guts
 
Machine learning 101 dkom 2017
fredverheul
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Machine Learning for Everyone
Aly Abdelkareem
 
Ferruzza g automl deck
Eric Dill
 
Demystifying Machine and Deep Learning for Developers
Microsoft Tech Community
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Graph Based Machine Learning on Relational Data
Benjamin Bengfort
 
The Evolution of AutoML
Ning Jiang
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 

Similar to Open and Automated Machine Learning (20)

PDF
OpenML Tutorial ECMLPKDD 2015
Joaquin Vanschoren
 
PDF
OpenML data@Sheffield
Joaquin Vanschoren
 
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
PDF
OpenML Reproducibility in Machine Learning ICML2017
Joaquin Vanschoren
 
PDF
Scalable Automatic Machine Learning in H2O
Sri Ambati
 
PPTX
How to automate Machine Learning pipeline ?
Axel de Romblay
 
PDF
“Houston, we have a model...” Introduction to MLOps
Rui Quintino
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
PDF
OpenML DALI
Joaquin Vanschoren
 
PDF
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
PDF
Challenges on Distributed Machine Learning
jie cao
 
PPTX
Everything you need to know about AutoML
Arpitha Gurumurthy
 
PDF
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
PDF
Reproducible AI Using PyTorch and MLflow
Databricks
 
PDF
The Why and What of Pattern Lab
Dave Olsen
 
PDF
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
PDF
Analysis Mechanical system using Artificial intelligence
anishahmadgrd222
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
OpenML Tutorial ECMLPKDD 2015
Joaquin Vanschoren
 
OpenML data@Sheffield
Joaquin Vanschoren
 
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
OpenML Reproducibility in Machine Learning ICML2017
Joaquin Vanschoren
 
Scalable Automatic Machine Learning in H2O
Sri Ambati
 
How to automate Machine Learning pipeline ?
Axel de Romblay
 
“Houston, we have a model...” Introduction to MLOps
Rui Quintino
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
OpenML DALI
Joaquin Vanschoren
 
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Challenges on Distributed Machine Learning
jie cao
 
Everything you need to know about AutoML
Arpitha Gurumurthy
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Reproducible AI Using PyTorch and MLflow
Databricks
 
The Why and What of Pattern Lab
Dave Olsen
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
Analysis Mechanical system using Artificial intelligence
anishahmadgrd222
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Ad

More from Joaquin Vanschoren (9)

PDF
Designed Serendipity
Joaquin Vanschoren
 
PDF
OpenML Tutorial: Networked Science in Machine Learning
Joaquin Vanschoren
 
PDF
Data science
Joaquin Vanschoren
 
PDF
OpenML 2014
Joaquin Vanschoren
 
PDF
Open Machine Learning
Joaquin Vanschoren
 
PDF
Hadoop tutorial
Joaquin Vanschoren
 
PDF
Hadoop sensordata part2
Joaquin Vanschoren
 
PDF
Hadoop sensordata part1
Joaquin Vanschoren
 
PDF
Hadoop sensordata part3
Joaquin Vanschoren
 
Designed Serendipity
Joaquin Vanschoren
 
OpenML Tutorial: Networked Science in Machine Learning
Joaquin Vanschoren
 
Data science
Joaquin Vanschoren
 
OpenML 2014
Joaquin Vanschoren
 
Open Machine Learning
Joaquin Vanschoren
 
Hadoop tutorial
Joaquin Vanschoren
 
Hadoop sensordata part2
Joaquin Vanschoren
 
Hadoop sensordata part1
Joaquin Vanschoren
 
Hadoop sensordata part3
Joaquin Vanschoren
 
Ad

Recently uploaded (20)

PPTX
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
PPTX
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
PDF
Service innovation with AI: Transformation of value proposition and market se...
Selcen Ozturkcan
 
PDF
Phosphates reveal high pH ocean water on Enceladus
Sérgio Sacani
 
PDF
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
PDF
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
PPTX
Bacillus thuringiensis.crops & golden rice
priyadharshini87125
 
PDF
Pharmakon of algorithmic alchemy: Marketing in the age of AI
Selcen Ozturkcan
 
PPTX
Envenomation AND ANIMAL BITES DETAILS.pptx
HARISH543351
 
PDF
Primordial Black Holes and the First Stars
Sérgio Sacani
 
PPTX
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
PDF
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
PPTX
Cooking Oil Tester How to Measure Quality of Frying Oil.pptx
M-Kube Enterprise
 
PPTX
Different formulation of fungicides.pptx
MrRABIRANJAN
 
DOCX
Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"
Sahmiral Amri Rajagukguk
 
PDF
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
PDF
Insect Behaviour : Patterns And Determinants
SheikhArshaqAreeb
 
PDF
soil and environmental microbiology.pdf
Divyaprabha67
 
PDF
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
PPTX
Ghent University Global Campus: Overview
Ghent University Global Campus
 
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
Service innovation with AI: Transformation of value proposition and market se...
Selcen Ozturkcan
 
Phosphates reveal high pH ocean water on Enceladus
Sérgio Sacani
 
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
Bacillus thuringiensis.crops & golden rice
priyadharshini87125
 
Pharmakon of algorithmic alchemy: Marketing in the age of AI
Selcen Ozturkcan
 
Envenomation AND ANIMAL BITES DETAILS.pptx
HARISH543351
 
Primordial Black Holes and the First Stars
Sérgio Sacani
 
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
Cooking Oil Tester How to Measure Quality of Frying Oil.pptx
M-Kube Enterprise
 
Different formulation of fungicides.pptx
MrRABIRANJAN
 
Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"
Sahmiral Amri Rajagukguk
 
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
Insect Behaviour : Patterns And Determinants
SheikhArshaqAreeb
 
soil and environmental microbiology.pdf
Divyaprabha67
 
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
Ghent University Global Campus: Overview
Ghent University Global Campus
 

Open and Automated Machine Learning

  • 1. OpenML O P E N , A U T O M AT E D M A C H I N E L E A R N I N G J O A Q U I N VA N S C H O R E N , T U / E @open_mlwww.openml.org
  • 2. OpenML You can be part of this presentation :) Follow the code examples: • On Google Colab: goo.gl/VwbKb4 • On Github: https://git.io/fA3eL J O A Q U I N VA N S C H O R E N , T U / E @open_mlwww.openml.org
  • 5. (Not so) Automatic Machine Learning It’s hard to find and learn from prior machine learning data
  • 6. (Auto)ML: manual work, unnecessary friction Hard to find and reuse prior results No standards / hubs for sharing and organizing results Scattered, ill-described, datasets Manual searching, reformatting, making assumptions Hard to automate model building end-to-end Requires automated data organization, clean APIs,… Myriad algorithms, versions, languages Write code, set up experiments, store results,… Reproducibility is hard Manually tracking every detail is error-prone
  • 7. Easy to use: Integrated in many ML tools/environments Easy to contribute: Automated sharing of data, code, results Organized data: APIs to find & reuse data, models, experiments Reward structure: Track your impact, build reputation Self-learning: Learn from many experiments to help people OpenML S H A R E A N D R E U S E M A C H I N E L E A R N I N G D ATA O N L I N E
  • 9. OpenML: Components Flows: Pipelines/code that build ML models Run locally (or wherever), auto-upload all results Datasets: Auto-annotated, organized, well-formatted Find the datasets you need, share your own Tasks: Auto-generated, machine-readable Everyone’s results are directly comparable Runs:All results from running flows on tasks All details needed for tracking and reproducibility Evaluations can be queried, compared, reused
  • 11. Data (tabular) easily uploaded or referenced (URL) It starts with data
  • 12. Data can remain in existing repositories -> registered via URL, transparent to users interoperability For now: only tabular data -> ARFF or CSV import (auto-annotate features) -> FrictionlessData support in the works
  • 18. Get (API) X, y, attribute_names = dataset.get_data( return_attribute_names=True) Python, R, Java, C# eeg = pd.DataFrame(X, columns=attribute_names) eeg['class'] = y
  • 20. Fit (API) Python, R, Java, C# from sklearn import neighbors clf = neighbors.KNeighborsClassifier() clf.fit(X, y)
  • 21. Complete code to build a model, automatically, anywhere import openml as oml from sklearn import neighbors, tree dataset = oml.datasets.get_dataset(1471) X, y = dataset.get_data() clf = neighbors.KNeighborsClassifier() clf.fit(X, y)
  • 22. Tasks contain data, goals, procedures. Auto-build + evaluate models correctly All evaluations are directly comparable optimize accuracy Predict target T Tasks benchmarking and collaboration 10-fold Xval
  • 23. 10-fold Xval Predict target T Collaborate in real time online optimize accuracy
  • 27. Auto-run algorithms/workflows on any task Integrated in many machine learning tools (+ APIs) Flows Run experiments locally, share them globally
  • 28. Integrated in many machine learning tools (+ APIs) import openml as oml from sklearn import tree task = oml.tasks.get_task(14951) clf = tree.ExtraTreeClassifier() flow = oml.flows.sklearn_to_flow(clf) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 29. Fit and share (complete code) Uploaded to http://www.openml.org/r/9204488 import openml as oml from sklearn import tree task = oml.tasks.get_task(14951) clf = tree.ExtraTreeClassifier() flow = oml.flows.sklearn_to_flow(clf) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 30. Fit and share pipelines Uploaded to http://www.openml.org/r/7943199 from sklearn import pipeline, ensemble, preprocessing from openml import tasks,runs, datasets task = tasks.get_task(59) pipe = pipeline.Pipeline(steps=[ ('Imputer', preprocessing.Imputer()), ('OneHotEncoder', preprocessing.OneHotEncoder(), ('Classifier', ensemble.RandomForestClassifier()) ]) flow = oml.flows.sklearn_to_flow(pipe) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 31. Fit and share deep learning models import keras from keras.models import Sequential, from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling from keras.layers.core import Activation model = Sequential() model.add(Reshape((28, 28, 1), input_shape=(784,))) model.add(Conv2D(20, (5, 5), padding=“same", input_shape=(28,28,1), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(Conv2D(50, (5, 5), padding="same", activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(Flatten()) model.add(Dense(500)) model.add(Activation(‘relu')) model.add(Dense(10)) model.add(Activation('softmax')) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
  • 32. Fit and share deep learning models Uploaded to https://www.openml.org/r/9204337 task = tasks.get_task(3573) #MNIST flow = oml.flows.keras_to_flow(model) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 34. reproducible, linked to data, flows, authors and all other experiments Experiments auto-uploaded, evaluated online Runs Share and reuse results
  • 39. OpenML Community 5600+ registered users, 120000+ yearly users
  • 40. A U T O M L : M E TA - L E A R N I N G • Find similar datasets • 20,000+ versioned datasets, with 130+ meta-features • Instead of starting from scratch, start from configurations that worked well on similar datasets
  • 41. A U T O M L : M E TA - L E A R N I N G • Find similar datasets • 20,000+ versioned datasets, with 130+ meta-features • Instead of starting from scratch, start from configurations that worked well on similar datasets • Auto-sklearn (AutoML challenge winner, NIPS 2016) • Lookup similar datasets, start with best pipelines Matthias Feurer et al. (2016) NIPS
  • 42. A U T O M L : M E TA - L E A R N I N G • Find similar datasets • 20,000+ versioned datasets, with 130+ meta-features • Instead of starting from scratch, start from configurations that worked well on similar datasets • Auto-sklearn (AutoML challenge winner, NIPS 2016) • Lookup similar datasets, start with best pipelines
  • 43. A U T O M L : M E TA - L E A R N I N G • Reuse (millions of) prior model evaluations: • Benchmark new algorithms against state-of-the-art • Meta-models: E.g. predict performance or training time • MIT AutoML system (ICBD 2017) • Uses and compares against OpenML results
  • 44. A U T O M L : M E TA - L E A R N I N G • Reuse (millions of) prior model evaluations: • Benchmark new algorithms against state-of-the-art • Meta-models: E.g. predict performance or training time • Runtime prediction
  • 45. A U T O M L : M E TA - L E A R N I N G • Reuse (millions of) prior model evaluations: • Benchmark new algorithms against state-of-the-art • Meta-models: E.g. predict performance or training time • Faster TPOT (in progress) • Build meta-models (Random Forest works well) • Focus on fast configurations first
  • 46. A U T O M L : M E TA - L E A R N I N G • Reuse results on many hyperparameter settings • Surrogate models: predict best hyperparameter settings • Study hyperparameter effects/importance • Amazon’s multi-task learning AutoML (NIPS 2017) • Trains surrogate models per task • On new tasks: learns how to combine them with neural net
  • 47. A U T O M L : M E TA - L E A R N I N G • Reuse results on many hyperparameter settings • Surrogate models: predict best hyperparameter settings • Study hyperparameter effects/importance • Hyperparameter space design • Use OpenML data to learn which hyperparameters to tune Jan van Rijn et al. (2017) AutoML@ICML
  • 48. A U T O M L : M E TA - L E A R N I N G • Never-ending Automatic Machine Learning: • AutoML methods built on top of OpenML get increasingly better as more meta-data is added • Faster drug discovery (QSAR) • Meta-learning to build better models that recommend drug candidates for rare diseases ChEMBL DB: 1.4M compounds, 10k proteins,12.8M activities Molecule representations MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 16,000 regression datasets x52 pipelines (on OpenML) meta-model all data on new protein optimal models to predict activity (Olier et al., Machine Learning 107(1), 2018)
  • 49. Learning to learn Bots that learn from all prior experiments Automate drudge work, help people build models
  • 50. Join us! (and change the world) Active open source community We need more bright people - ML/DB experts - Developers - UX
  • 51. Support is welcome! Workshop sponsorship (hackathons 2x/year) Donations: OpenML foundation Compute time Project ideas
  • 52. E I N D H O V E N U N I V E R S I T Y Looking for: • PhD Students • Scientific programmer
  • 53. O P E N M L H A C K AT H O N Paris, September 17-21, 2018 meet.openml.org Co-located with COSEAL