SlideShare a Scribd company logo
Practical Machine Learning
Pipelines with MLlib
Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
•  Shipped with Spark 0.8
Currently (Spark 1.3)
•  Contributions from 50+ orgs, 100+ individuals
•  Good coverage of algorithms
classifica'on	
  
regression	
  
clustering	
  
recommenda'on	
  
feature	
  extrac'on,	
  selec'on	
  
frequent	
  itemsets	
  
sta's'cs	
  
linear	
  algebra	
  
MLlib’s Mission
How	
  can	
  we	
  move	
  beyond	
  this	
  list	
  of	
  algorithms	
  
and	
  help	
  users	
  developer	
  real	
  ML	
  workflows?	
  
MLlib’s mission is to make practical
machine learning easy and scalable.
•  Capable of learning from large-scale datasets
•  Easy to build machine learning applications
Outline
ML workflows
Pipelines
Roadmap
Outline
ML workflows
Pipelines
Roadmap
Example: Text Classification
Set Footer from Insert Dropdown Menu 6
Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
1:	
  about	
  science	
  
0:	
  not	
  about	
  science	
  
Label	
  Features	
  
text,	
  image,	
  vector,	
  ...	
  
CTR,	
  inches	
  of	
  rainfall,	
  ...	
  
Dataset:	
  “20	
  Newsgroups”	
  
From	
  UCI	
  KDD	
  Archive	
  
Training & Testing
Set Footer from Insert Dropdown Menu 7
Training	
   Tes*ng/Produc*on	
  
Given	
  labeled	
  data:	
  
	
  	
  	
  	
  	
  RDD	
  of	
  (features,	
  label)	
  
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help...!
Subject: RIPEM FAQ!
RIPEM is a program which
performs Privacy Enhanced...!
...	
  
Label 0!
Label 1!
Learn	
  a	
  model.	
  
Given	
  new	
  unlabeled	
  data:	
  
	
  	
  	
  	
  	
  RDD	
  of	
  features	
  
Subject: Apollo Training!
The Apollo astronauts also
trained at (in) Meteor...!
Subject: A demo of Nonsense!
How can you lie about
something that no one...!
Use	
  model	
  to	
  make	
  predic'ons.	
  
Label 1!
Label 0!
Example ML Workflow
Training
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
labels	
  +	
  feature	
  vectors	
  
Extract	
  features	
  
Explicitly	
  unzip	
  &	
  zip	
  RDDs	
  
labels.zip(predictions).map {
if (_._1 == _._2) ...
}
val features: RDD[Vector]
val predictions: RDD[Double]
Create	
  many	
  RDDs	
  
val labels: RDD[Double] =
data.map(_.label)
Pain	
  point	
  
Example ML Workflow
Write	
  as	
  a	
  script	
  
Pain	
  point	
  
•  Not	
  modular	
  
•  Difficult	
  to	
  re-­‐use	
  workflow	
  
Training
labels	
  +	
  feature	
  vectors	
  
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
Extract	
  features	
  
Example ML Workflow
Training
labels	
  +	
  feature	
  vectors	
  
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
Extract	
  features	
  
Testing/Production
feature	
  vectors	
  
Predict	
  using	
  model	
  
predicEons	
  
Act	
  on	
  predic'ons	
  
Load	
  new	
  data	
  
plain	
  text	
  
Extract	
  features	
  
Almost	
  
iden-cal	
  
workflow	
  
Example ML Workflow
Training
labels	
  +	
  feature	
  vectors	
  
Train	
  model	
  
labels	
  +	
  predicEons	
  
Evaluate	
  
Load	
  data	
  
labels	
  +	
  plain	
  text	
  
Extract	
  features	
  
Pain	
  point	
  
Parameter	
  tuning	
  
•  Key	
  part	
  of	
  ML	
  
•  Involves	
  training	
  many	
  models	
  
•  For	
  different	
  splits	
  of	
  the	
  data	
  
•  For	
  different	
  sets	
  of	
  parameters	
  
Pain Points
Create	
  &	
  handle	
  many	
  RDDs	
  and	
  data	
  types	
  
Write	
  as	
  a	
  script	
  
Tune	
  parameters	
  
Enter...
Pipelines!	
   in	
  Spark	
  1.2	
  &	
  1.3	
  
Outline
ML workflows
Pipelines
Roadmap
Key Concepts
DataFrame: The ML Dataset
Abstractions: Transformers, Estimators, & Evaluators
Parameters: API & tuning
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named	
  columns	
  with	
  types	
  
label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label	
   text	
   words	
   features	
  
0	
   This	
  is	
  ...	
   [“This”,	
  “is”,	
  …]	
   [0.5,	
  1.2,	
  …]	
  
0	
   When	
  we	
  ...	
   [“When”,	
  ...]	
   [1.9,	
  -­‐0.8,	
  …]	
  
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named	
  columns	
  with	
  types	
   Domain-­‐Specific	
  Language	
  
# Select science articles
sciDocs =
data.filter(“label” == 1)
# Scale labels
data(“label”) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
• Shipped	
  with	
  Spark	
  1.3	
  
• APIs	
  for	
  Python,	
  Java	
  &	
  Scala	
  (+R	
  in	
  dev)	
  
• Integra'on	
  with	
  Spark	
  SQL	
  
• Data	
  import/export	
  
• Internal	
  op'miza'ons	
  
Named	
  columns	
  with	
  types Domain-­‐Specific	
  Language	
  
Pain	
  point:	
  Create	
  &	
  handle	
  
many	
  RDDs	
  and	
  data	
  types	
  
BIG	
  data	
  
Abstractions
Set Footer from Insert Dropdown Menu 18
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
Abstraction: Transformer
Set Footer from Insert Dropdown Menu 19
Training
Train	
  model	
  
Evaluate	
  
Extract	
  features	
  
def transform(DataFrame): DataFrame
label: Double
text: String
label: Double
text: String
features: Vector
Abstraction: Estimator
Set Footer from Insert Dropdown Menu 20
Training
Train	
  model	
  
Evaluate	
  
Extract	
  features	
  
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Train	
  model	
  
Abstraction: Evaluator
Set Footer from Insert Dropdown Menu 21
Training
Evaluate	
  
Extract	
  features	
  
label: Double
text: String
features: Vector
prediction: Double
Metric:	
  
accuracy
AUC
MSE
...
def evaluate(DataFrame): Double
Act	
  on	
  predic'ons	
  
Abstraction: Model
Set Footer from Insert Dropdown Menu 22
Model	
  is	
  a	
  type	
  of	
  Transformer	
  
def transform(DataFrame): DataFrame
text: String
features: Vector
Testing/Production
Predict	
  using	
  model	
  
Extract	
  features	
   text: String
features: Vector
prediction: Double
(Recall) Abstraction: Estimator
Set Footer from Insert Dropdown Menu 23
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Abstraction: Pipeline
Set Footer from Insert Dropdown Menu 24
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Double
text: String
PipelineModel
Pipeline	
  is	
  a	
  type	
  of	
  Es*mator	
  
def fit(DataFrame): Model
Abstraction: PipelineModel
Set Footer from Insert Dropdown Menu 25
text: String
PipelineModel	
  is	
  a	
  type	
  of	
  Transformer	
  
def transform(DataFrame): DataFrame
Testing/Production
Predict	
  using	
  model	
  
Load	
  data	
  
Extract	
  features	
   text: String
features: Vector
prediction: Double
Act	
  on	
  predic'ons	
  
Abstractions: Summary
Set Footer from Insert Dropdown Menu 26
Training
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  Transformer
DataFrame
Estimator
Evaluator
Testing
Predict	
  using	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
Demo
Set Footer from Insert Dropdown Menu 27
Transformer
DataFrame
Estimator
Evaluator
label: Double
text: String
features: Vector
Current	
  data	
  schema	
  
prediction: Double
Training
Logis'cRegression	
  
BinaryClassifica'on	
  
Evaluator	
  
Load	
  data	
  
Tokenizer	
  
Transformer HashingTF	
  
words: Seq[String]
Demo
Set Footer from Insert Dropdown Menu 28
Transformer
DataFrame
Estimator
Evaluator
Training
Logis'cRegression	
  
BinaryClassifica'on	
  
Evaluator	
  
Load	
  data	
  
Tokenizer	
  
Transformer HashingTF	
  
Pain	
  point:	
  Write	
  as	
  a	
  script	
  
Parameters
Set Footer from Insert Dropdown Menu 29
> hashingTF.numFeaturesStandard	
  API	
  
•  Typed	
  
•  Defaults	
  
•  Built-­‐in	
  doc	
  
•  Autocomplete	
  
org.apache.spark.ml.param.IntParam =
numFeatures: number of features
(default: 262144)
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures
Parameter Tuning
Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000}
Logis'cRegression	
  
Tokenizer	
  
HashingTF	
  
BinaryClassifica'on	
  
Evaluator	
  
CrossValidator
Parameter Tuning
Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
Logis'cRegression	
  
Tokenizer	
  
HashingTF	
  
BinaryClassifica'on	
  
Evaluator	
  
CrossValidator
Pain	
  point:	
  Tune	
  parameters	
  
Pipelines: Recap
Inspira'ons	
  
	
  
scikit-­‐learn	
  
	
  	
  +	
  Spark	
  DataFrame,	
  Param	
  API	
  
	
  
MLBase	
  (Berkeley	
  AMPLab)	
  
	
  	
  Ongoing	
  collaboraEons	
  
Create	
  &	
  handle	
  many	
  RDDs	
  and	
  data	
  types	
  
Write	
  as	
  a	
  script	
  
Tune	
  parameters	
  
DataFrame	
  
Abstrac'ons	
  
Parameter	
  API	
  
*	
  Groundwork	
  done;	
  full	
  support	
  WIP.	
  
Also	
  
•  Python,	
  Scala,	
  Java	
  APIs	
  
•  Schema	
  valida'on	
  
•  User-­‐Defined	
  Types*	
  
•  Feature	
  metadata*	
  
•  Mul'-­‐model	
  training	
  op'miza'ons*	
  
Outline
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib:	
  Primary	
  ML	
  package	
  
	
  
spark.ml:	
  High-­‐level	
  Pipelines	
  API	
  for	
  algorithms	
  in	
  spark.mllib
(experimental	
  in	
  Spark	
  1.2-­‐1.3)	
  
Near	
  future	
  
•  Feature	
  aoributes	
  
•  Feature	
  transformers	
  
•  More	
  algorithms	
  under	
  Pipeline	
  API	
  
	
  
Farther	
  ahead	
  
•  Ideas	
  from	
  AMPLab	
  MLBase	
  (auto-­‐tuning	
  models)	
  
•  SparkR	
  integra'on	
  
Thank you!
Outline	
  
•  ML	
  workflows	
  
•  Pipelines	
  
•  DataFrame	
  
•  Abstrac*ons	
  
•  Parameter	
  tuning	
  
•  Roadmap	
  
Spark	
  documenta'on	
  
	
  	
  	
  	
  hop://spark.apache.org/	
  
	
  
Pipelines	
  blog	
  post	
  
	
  	
  	
  	
  hops://databricks.com/blog/2015/01/07	
  

More Related Content

What's hot (20)

PDF
UNIT 1- Data Warehouse.pdf
Nancykumari47
 
PDF
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
Amazon Web Services Korea
 
PPTX
Continuous Delivery using AWS CodePipeline, AWS Lambda & AWS ElasticBeanstalk
Thomas Shaw
 
PDF
Session 1. 디지털 트렌스포메이션의 핵심, 클라우드 마이그레이션 A to Z - 베스핀글로벌 이근우 위원
BESPIN GLOBAL
 
PPTX
Text clustering
KU Leuven
 
PDF
Azure Machine Learning
Mostafa
 
PDF
PyTorch Introduction
Yash Kawdiya
 
PPT
3.1 clustering
Krish_ver2
 
PPT
Unit 04 dbms
anuragmbst
 
PDF
PostgreSQL Tutorial for Beginners | Edureka
Edureka!
 
PPTX
SQL
Vineeta Garg
 
PPTX
Cluster validation
RohitPaul52
 
PPTX
Power BI - Row Level Security
JAZ Rathor
 
PDF
Introduction to Oracle Cloud
johnnhernandez
 
PDF
A Practical Enterprise Feature Store on Delta Lake
Databricks
 
PDF
Sql Basics | Edureka
Edureka!
 
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Salah Amean
 
DOCX
MSF process model
Muhammad Taqi Hassan Bukhari
 
PPTX
Azure Migration Program Pitch Deck
Nicholas Vossburg
 
PDF
Accelerate Your ML Pipeline with AutoML and MLflow
Databricks
 
UNIT 1- Data Warehouse.pdf
Nancykumari47
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
Amazon Web Services Korea
 
Continuous Delivery using AWS CodePipeline, AWS Lambda & AWS ElasticBeanstalk
Thomas Shaw
 
Session 1. 디지털 트렌스포메이션의 핵심, 클라우드 마이그레이션 A to Z - 베스핀글로벌 이근우 위원
BESPIN GLOBAL
 
Text clustering
KU Leuven
 
Azure Machine Learning
Mostafa
 
PyTorch Introduction
Yash Kawdiya
 
3.1 clustering
Krish_ver2
 
Unit 04 dbms
anuragmbst
 
PostgreSQL Tutorial for Beginners | Edureka
Edureka!
 
Cluster validation
RohitPaul52
 
Power BI - Row Level Security
JAZ Rathor
 
Introduction to Oracle Cloud
johnnhernandez
 
A Practical Enterprise Feature Store on Delta Lake
Databricks
 
Sql Basics | Edureka
Edureka!
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Salah Amean
 
MSF process model
Muhammad Taqi Hassan Bukhari
 
Azure Migration Program Pitch Deck
Nicholas Vossburg
 
Accelerate Your ML Pipeline with AutoML and MLflow
Databricks
 

Similar to Practical Machine Learning Pipelines with MLlib (20)

PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
PPTX
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Turi, Inc.
 
PDF
Scala and Spring
Eberhard Wolff
 
PDF
Spring Day | Spring and Scala | Eberhard Wolff
JAX London
 
PPTX
Linq to sql
Muhammad Younis
 
PDF
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
PDF
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
PPTX
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
PDF
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Paris Data Engineers !
 
PDF
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
Fwdays
 
PDF
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
PPT
Data Mining for Developers
llangit
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PPT
9781285852744 ppt ch13
Terry Yoast
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Turi, Inc.
 
Scala and Spring
Eberhard Wolff
 
Spring Day | Spring and Scala | Eberhard Wolff
JAX London
 
Linq to sql
Muhammad Younis
 
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Paris Data Engineers !
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
Fwdays
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Data Mining for Developers
llangit
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
9781285852744 ppt ch13
Terry Yoast
 
Spark DataFrames and ML Pipelines
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Executive Business Intelligence Dashboards
vandeslie24
 

Practical Machine Learning Pipelines with MLlib

  • 1. Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015
  • 2. About Spark MLlib Started in UC Berkeley AMPLab •  Shipped with Spark 0.8 Currently (Spark 1.3) •  Contributions from 50+ orgs, 100+ individuals •  Good coverage of algorithms classifica'on   regression   clustering   recommenda'on   feature  extrac'on,  selec'on   frequent  itemsets   sta's'cs   linear  algebra  
  • 3. MLlib’s Mission How  can  we  move  beyond  this  list  of  algorithms   and  help  users  developer  real  ML  workflows?   MLlib’s mission is to make practical machine learning easy and scalable. •  Capable of learning from large-scale datasets •  Easy to build machine learning applications
  • 6. Example: Text Classification Set Footer from Insert Dropdown Menu 6 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1:  about  science   0:  not  about  science   Label  Features   text,  image,  vector,  ...   CTR,  inches  of  rainfall,  ...   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  • 7. Training & Testing Set Footer from Insert Dropdown Menu 7 Training   Tes*ng/Produc*on   Given  labeled  data:            RDD  of  (features,  label)   Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help...! Subject: RIPEM FAQ! RIPEM is a program which performs Privacy Enhanced...! ...   Label 0! Label 1! Learn  a  model.   Given  new  unlabeled  data:            RDD  of  features   Subject: Apollo Training! The Apollo astronauts also trained at (in) Meteor...! Subject: A demo of Nonsense! How can you lie about something that no one...! Use  model  to  make  predic'ons.   Label 1! Label 0!
  • 8. Example ML Workflow Training Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   labels  +  feature  vectors   Extract  features   Explicitly  unzip  &  zip  RDDs   labels.zip(predictions).map { if (_._1 == _._2) ... } val features: RDD[Vector] val predictions: RDD[Double] Create  many  RDDs   val labels: RDD[Double] = data.map(_.label) Pain  point  
  • 9. Example ML Workflow Write  as  a  script   Pain  point   •  Not  modular   •  Difficult  to  re-­‐use  workflow   Training labels  +  feature  vectors   Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   Extract  features  
  • 10. Example ML Workflow Training labels  +  feature  vectors   Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   Extract  features   Testing/Production feature  vectors   Predict  using  model   predicEons   Act  on  predic'ons   Load  new  data   plain  text   Extract  features   Almost   iden-cal   workflow  
  • 11. Example ML Workflow Training labels  +  feature  vectors   Train  model   labels  +  predicEons   Evaluate   Load  data   labels  +  plain  text   Extract  features   Pain  point   Parameter  tuning   •  Key  part  of  ML   •  Involves  training  many  models   •  For  different  splits  of  the  data   •  For  different  sets  of  parameters  
  • 12. Pain Points Create  &  handle  many  RDDs  and  data  types   Write  as  a  script   Tune  parameters   Enter... Pipelines!   in  Spark  1.2  &  1.3  
  • 14. Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning
  • 15. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named  columns  with  types   label: Double text: String words: Seq[String] features: Vector prediction: Double label   text   words   features   0   This  is  ...   [“This”,  “is”,  …]   [0.5,  1.2,  …]   0   When  we  ...   [“When”,  ...]   [1.9,  -­‐0.8,  …]  
  • 16. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named  columns  with  types   Domain-­‐Specific  Language   # Select science articles sciDocs = data.filter(“label” == 1) # Scale labels data(“label”) * 0.5
  • 17. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL • Shipped  with  Spark  1.3   • APIs  for  Python,  Java  &  Scala  (+R  in  dev)   • Integra'on  with  Spark  SQL   • Data  import/export   • Internal  op'miza'ons   Named  columns  with  types Domain-­‐Specific  Language   Pain  point:  Create  &  handle   many  RDDs  and  data  types   BIG  data  
  • 18. Abstractions Set Footer from Insert Dropdown Menu 18 Training Train  model   Evaluate   Load  data   Extract  features  
  • 19. Abstraction: Transformer Set Footer from Insert Dropdown Menu 19 Training Train  model   Evaluate   Extract  features   def transform(DataFrame): DataFrame label: Double text: String label: Double text: String features: Vector
  • 20. Abstraction: Estimator Set Footer from Insert Dropdown Menu 20 Training Train  model   Evaluate   Extract  features   label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 21. Train  model   Abstraction: Evaluator Set Footer from Insert Dropdown Menu 21 Training Evaluate   Extract  features   label: Double text: String features: Vector prediction: Double Metric:   accuracy AUC MSE ... def evaluate(DataFrame): Double
  • 22. Act  on  predic'ons   Abstraction: Model Set Footer from Insert Dropdown Menu 22 Model  is  a  type  of  Transformer   def transform(DataFrame): DataFrame text: String features: Vector Testing/Production Predict  using  model   Extract  features   text: String features: Vector prediction: Double
  • 23. (Recall) Abstraction: Estimator Set Footer from Insert Dropdown Menu 23 Training Train  model   Evaluate   Load  data   Extract  features   label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 24. Abstraction: Pipeline Set Footer from Insert Dropdown Menu 24 Training Train  model   Evaluate   Load  data   Extract  features   label: Double text: String PipelineModel Pipeline  is  a  type  of  Es*mator   def fit(DataFrame): Model
  • 25. Abstraction: PipelineModel Set Footer from Insert Dropdown Menu 25 text: String PipelineModel  is  a  type  of  Transformer   def transform(DataFrame): DataFrame Testing/Production Predict  using  model   Load  data   Extract  features   text: String features: Vector prediction: Double Act  on  predic'ons  
  • 26. Abstractions: Summary Set Footer from Insert Dropdown Menu 26 Training Train  model   Evaluate   Load  data   Extract  features  Transformer DataFrame Estimator Evaluator Testing Predict  using  model   Evaluate   Load  data   Extract  features  
  • 27. Demo Set Footer from Insert Dropdown Menu 27 Transformer DataFrame Estimator Evaluator label: Double text: String features: Vector Current  data  schema   prediction: Double Training Logis'cRegression   BinaryClassifica'on   Evaluator   Load  data   Tokenizer   Transformer HashingTF   words: Seq[String]
  • 28. Demo Set Footer from Insert Dropdown Menu 28 Transformer DataFrame Estimator Evaluator Training Logis'cRegression   BinaryClassifica'on   Evaluator   Load  data   Tokenizer   Transformer HashingTF   Pain  point:  Write  as  a  script  
  • 29. Parameters Set Footer from Insert Dropdown Menu 29 > hashingTF.numFeaturesStandard  API   •  Typed   •  Defaults   •  Built-­‐in  doc   •  Autocomplete   org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures
  • 30. Parameter Tuning Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Logis'cRegression   Tokenizer   HashingTF   BinaryClassifica'on   Evaluator   CrossValidator
  • 31. Parameter Tuning Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters Logis'cRegression   Tokenizer   HashingTF   BinaryClassifica'on   Evaluator   CrossValidator Pain  point:  Tune  parameters  
  • 32. Pipelines: Recap Inspira'ons     scikit-­‐learn      +  Spark  DataFrame,  Param  API     MLBase  (Berkeley  AMPLab)      Ongoing  collaboraEons   Create  &  handle  many  RDDs  and  data  types   Write  as  a  script   Tune  parameters   DataFrame   Abstrac'ons   Parameter  API   *  Groundwork  done;  full  support  WIP.   Also   •  Python,  Scala,  Java  APIs   •  Schema  valida'on   •  User-­‐Defined  Types*   •  Feature  metadata*   •  Mul'-­‐model  training  op'miza'ons*  
  • 34. Roadmap spark.mllib:  Primary  ML  package     spark.ml:  High-­‐level  Pipelines  API  for  algorithms  in  spark.mllib (experimental  in  Spark  1.2-­‐1.3)   Near  future   •  Feature  aoributes   •  Feature  transformers   •  More  algorithms  under  Pipeline  API     Farther  ahead   •  Ideas  from  AMPLab  MLBase  (auto-­‐tuning  models)   •  SparkR  integra'on  
  • 35. Thank you! Outline   •  ML  workflows   •  Pipelines   •  DataFrame   •  Abstrac*ons   •  Parameter  tuning   •  Roadmap   Spark  documenta'on          hop://spark.apache.org/     Pipelines  blog  post          hops://databricks.com/blog/2015/01/07