SlideShare a Scribd company logo
Practical Machine Learning
Pipelines with Spark MLlib
Joseph K. Bradley
June 2015
Hadoop Summit
Who am I?
Joseph K. Bradley
Ph.D. in Machine Learning from CMU, postdoc at Berkeley
Apache Spark committer
Software Engineer @ Databricks Inc.
2
3
Concise APIs in Python, Java, Scala
… and R in Spark 1.4!
500+ enterprises using or planning
to use Spark in production (blog)
Spark
SparkSQL Streaming MLlib GraphX
Distributed computing engine
• Built for speed, ease of use,
and sophisticated analytics
• Apache open source
Beyond Hadoop
4
Early adopters (Data) Engineers
MapReduce &
functional API
Data Scientists
& Statisticians
Spark for Data Science
DataFrames
Intuitive manipulation of distributed structured data
5
Machine Learning Pipelines
Simple construction and tuning of ML workflows
Google Trends for “dataframe”
6
DataFrames
7
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
RDD API
DataFrame API
Data grouped into
named columns
DataFrames
8
dept age name
Bio 48 H Smith
CS 54 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
• Project, filter, aggregate, join, …
• Metadata
• UDFs
Spark DataFrames
9
API inspired by R and Python Pandas
• Python, Scala, Java (+ R in dev)
• Pandas integration
Distributed DataFrame
Highly optimized
10
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime of aggregating 10 million int pairs (secs)
Spark DataFrames are fast
better
Uses SparkSQL
Catalyst optimizer
11
Demo: DataFrames
in Databricks Cloud
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Spark for Data Science
DataFrames
• Structured data
• Familiar API based on R & Python Pandas
• Distributed, optimized implementation
18
Machine Learning Pipelines
Simple construction and tuning of ML workflows
About Spark MLlib
Started @ Berkeley
• Spark 0.8
Now (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphX
19
About Spark MLlib
Classification
• Logistic regression
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
Regression
• Ordinary least squares
• Ridge regression
• Lasso
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
Frequent itemsets
• FP-growth
20
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors &
matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Recommendation
• Alternating Least Squares
Feature extraction & selection
• Word2Vec
• Chi-Squared selection
• Hashing term frequency
• Inverse document frequency
• Normalizer
• Standard scaler
• Tokenizer
• One-Hot Encoder
• StringIndexer
• VectorIndexer
• VectorAssembler
• Binarizer
• Bucketizer
• ElementwiseProduct
• PolynomialExpansion
List based on upcoming release 1.4
ML Workflows are complex
21
Train model
Evaluate
Load data
Extract features
ML Workflows are complex
22
Train model
Evaluate
Datasource 1
Extract features
Datasource 2
Datasource 2
ML Workflows are complex
23
Train model
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
ML Workflows are complex
24
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
ML Workflows are complex
25
Specify pipeline
Inspect & debug
Re-run on new data
Tune parameters
Example: Text Classification
26
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
Dataset: “20 Newsgroups”
From UCI KDD Archive
ML Workflow
27
Train model
Evaluate
Load data
Extract features
Load Data
28
Train model
Evaluate
Load data
Extract features
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
Load Data
29
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
30
Train model
Evaluate
Load data
Extract features
label: Int
text: String
Current data schema
Extract Features
31
Train model
Evaluate
Load data
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
Transformer
DataFrame
DataFrame
Train a Model
32
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Estimator
DataFrame
Model
Evaluate the Model
33
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
Evaluator
DataFrame
metric
Data Flow
34
Logistic Regression
Evaluate
label: Int
text: String
Current data schema
Tokenizer
Hashed Term Freq.
features: Vector
words: Seq[String]
prediction: Int
Load data
By default, always
append new columns
 Can go back & inspect
intermediate results
 Made efficient by
DataFrames
ML Pipelines
35
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
Load data
Pipeline
Test data
Logistic Regression
Tokenizer
Hashed Term Freq.
Evaluate
Re-run exactly
the same way
Parameter Tuning
36
Logistic Regression
Evaluate
Tokenizer
Hashed Term Freq.
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
CrossValidator
37
Demo: ML Pipelines
in Databricks Cloud
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
Recap
DataFrames
• Structured data
• Familiar API based on R & Python
Pandas
• Distributed, optimized
implementation
Machine Learning Pipelines
• Integration with DataFrames
• Familiar API based on scikit-learn
• Simple parameter tuning 47
Composable & DAG
Pipelines
Schema validation
User-defined Pipeline
components
Looking Ahead
48
Spark 1.4
• Spark R
• Pipelines graduating from
alpha
• Many more feature
transformers
• More complete Python API
Future
• API for R DataFrames &
Pipelines
• More ML algorithms &
pluggability
• Improved model inspection
Learn more next week
at the Spark Summit!
spark-summit.org/2015
Databricks Inc.
49
Founded by the creators of Spark
& driving its development
Databricks Cloud: the best place to run Spark
Guess what…we’re hiring!
databricks.com/company/careers
Thank you!
Spark documentation
spark.apache.org
Pipelines blog post
databricks.com/blog/2015/01/07
DataFrames blog post
databricks.com/blog/2015/02/17
Databricks Cloud Platform
databricks.com/product
Spark MOOCs on edX
Intro to Spark & ML with Spark
Spark Packages
spark-packages.org

More Related Content

What's hot (20)

PDF
Spark overview
Lisa Hua
 
PDF
Apache spark
shima jafari
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PPTX
Google Protocol Buffers
Sergey Podolsky
 
PDF
The delta architecture
Prakash Chockalingam
 
PPTX
Sizing Your MongoDB Cluster
MongoDB
 
PPT
7. Key-Value Databases: In Depth
Fabio Fumarola
 
PPTX
Spark
Heena Madan
 
PDF
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Jena Programming
Myungjin Lee
 
PDF
Big Data Analytics with Spark
Mohammed Guller
 
PPTX
FIWARE Wednesday Webinars - How to Design DataModels
FIWARE
 
PPTX
MongoDB
Anthony Slabinck
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPT
Neo4J : Introduction to Graph Database
Mindfire Solutions
 
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
DataWorks Summit
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PPTX
Spark
Koushik Mondal
 
Spark overview
Lisa Hua
 
Apache spark
shima jafari
 
Introduction to Spark with Python
Gokhan Atil
 
Google Protocol Buffers
Sergey Podolsky
 
The delta architecture
Prakash Chockalingam
 
Sizing Your MongoDB Cluster
MongoDB
 
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Neo4j GraphDay Seattle- Sept19- neo4j basic training
Neo4j
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Jena Programming
Myungjin Lee
 
Big Data Analytics with Spark
Mohammed Guller
 
FIWARE Wednesday Webinars - How to Design DataModels
FIWARE
 
Programming in Spark using PySpark
Mostafa
 
Neo4J : Introduction to Graph Database
Mindfire Solutions
 
Best practices and lessons learnt from Running Apache NiFi at Renault
DataWorks Summit
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 

Viewers also liked (20)

PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
PPTX
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
PDF
Inspiring Travel at Airbnb [WIP]
DataWorks Summit
 
PDF
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
DataWorks Summit
 
PPTX
Hadoop in Validated Environment - Data Governance Initiative
DataWorks Summit
 
PPTX
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
PDF
50 Shades of SQL
DataWorks Summit
 
PDF
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
PPT
Hadoop for Genomics__HadoopSummit2010
Yahoo Developer Network
 
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
PPTX
Running Spark and MapReduce together in Production
DataWorks Summit
 
PPTX
Karta an ETL Framework to process high volume datasets
DataWorks Summit
 
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
PPTX
Spark Application Development Made Easy
DataWorks Summit
 
PPTX
NoSQL Needs SomeSQL
DataWorks Summit
 
PPTX
Open Source SQL for Hadoop: Where are we and Where are we Going?
DataWorks Summit
 
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
DataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
DataWorks Summit
 
Hadoop in Validated Environment - Data Governance Initiative
DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
50 Shades of SQL
DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
Hadoop for Genomics__HadoopSummit2010
Yahoo Developer Network
 
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Running Spark and MapReduce together in Production
DataWorks Summit
 
Karta an ETL Framework to process high volume datasets
DataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
Spark Application Development Made Easy
DataWorks Summit
 
NoSQL Needs SomeSQL
DataWorks Summit
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
DataWorks Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Ad

Similar to Practical Distributed Machine Learning Pipelines on Hadoop (20)

PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
PPTX
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
PPTX
Azure Databricks for Data Scientists
Richard Garris
 
PDF
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PPTX
What’s New in the Berkeley Data Analytics Stack
Turi, Inc.
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark DataFrames and ML Pipelines
Databricks
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Azure Databricks for Data Scientists
Richard Garris
 
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
What’s New in the Berkeley Data Analytics Stack
Turi, Inc.
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Digital Circuits, important subject in CS
contactparinay1
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Practical Distributed Machine Learning Pipelines on Hadoop

  • 1. Practical Machine Learning Pipelines with Spark MLlib Joseph K. Bradley June 2015 Hadoop Summit
  • 2. Who am I? Joseph K. Bradley Ph.D. in Machine Learning from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
  • 3. 3 Concise APIs in Python, Java, Scala … and R in Spark 1.4! 500+ enterprises using or planning to use Spark in production (blog) Spark SparkSQL Streaming MLlib GraphX Distributed computing engine • Built for speed, ease of use, and sophisticated analytics • Apache open source
  • 4. Beyond Hadoop 4 Early adopters (Data) Engineers MapReduce & functional API Data Scientists & Statisticians
  • 5. Spark for Data Science DataFrames Intuitive manipulation of distributed structured data 5 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 6. Google Trends for “dataframe” 6
  • 7. DataFrames 7 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy RDD API DataFrame API Data grouped into named columns
  • 8. DataFrames 8 dept age name Bio 48 H Smith CS 54 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks • Project, filter, aggregate, join, … • Metadata • UDFs
  • 9. Spark DataFrames 9 API inspired by R and Python Pandas • Python, Scala, Java (+ R in dev) • Pandas integration Distributed DataFrame Highly optimized
  • 10. 10 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast better Uses SparkSQL Catalyst optimizer
  • 18. Spark for Data Science DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation 18 Machine Learning Pipelines Simple construction and tuning of ML workflows
  • 19. About Spark MLlib Started @ Berkeley • Spark 0.8 Now (Spark 1.3) • Contributions from 50+ orgs, 100+ individuals • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphX 19
  • 20. About Spark MLlib Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods Frequent itemsets • FP-growth 20 Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Model import/export Pipelines Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion List based on upcoming release 1.4
  • 21. ML Workflows are complex 21 Train model Evaluate Load data Extract features
  • 22. ML Workflows are complex 22 Train model Evaluate Datasource 1 Extract features Datasource 2 Datasource 2
  • 23. ML Workflows are complex 23 Train model Evaluate Datasource 1 Datasource 2 Datasource 2 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3
  • 24. ML Workflows are complex 24 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 2 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble
  • 25. ML Workflows are complex 25 Specify pipeline Inspect & debug Re-run on new data Tune parameters
  • 26. Example: Text Classification 26 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures Dataset: “20 Newsgroups” From UCI KDD Archive
  • 28. Load Data 28 Train model Evaluate Load data Extract features built-in external { JSON } JDBC and more … Data sources for DataFrames
  • 29. Load Data 29 Train model Evaluate Load data Extract features label: Int text: String Current data schema
  • 30. Extract Features 30 Train model Evaluate Load data Extract features label: Int text: String Current data schema
  • 31. Extract Features 31 Train model Evaluate Load data label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] Transformer DataFrame DataFrame
  • 32. Train a Model 32 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data Estimator DataFrame Model
  • 33. Evaluate the Model 33 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data Evaluator DataFrame metric
  • 34. Data Flow 34 Logistic Regression Evaluate label: Int text: String Current data schema Tokenizer Hashed Term Freq. features: Vector words: Seq[String] prediction: Int Load data By default, always append new columns  Can go back & inspect intermediate results  Made efficient by DataFrames
  • 35. ML Pipelines 35 Logistic Regression Evaluate Tokenizer Hashed Term Freq. Load data Pipeline Test data Logistic Regression Tokenizer Hashed Term Freq. Evaluate Re-run exactly the same way
  • 36. Parameter Tuning 36 Logistic Regression Evaluate Tokenizer Hashed Term Freq. lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: • Estimator • Parameter grid • Evaluator Find best parameters CrossValidator
  • 37. 37 Demo: ML Pipelines in Databricks Cloud
  • 47. Recap DataFrames • Structured data • Familiar API based on R & Python Pandas • Distributed, optimized implementation Machine Learning Pipelines • Integration with DataFrames • Familiar API based on scikit-learn • Simple parameter tuning 47 Composable & DAG Pipelines Schema validation User-defined Pipeline components
  • 48. Looking Ahead 48 Spark 1.4 • Spark R • Pipelines graduating from alpha • Many more feature transformers • More complete Python API Future • API for R DataFrames & Pipelines • More ML algorithms & pluggability • Improved model inspection Learn more next week at the Spark Summit! spark-summit.org/2015
  • 49. Databricks Inc. 49 Founded by the creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
  • 50. Thank you! Spark documentation spark.apache.org Pipelines blog post databricks.com/blog/2015/01/07 DataFrames blog post databricks.com/blog/2015/02/17 Databricks Cloud Platform databricks.com/product Spark MOOCs on edX Intro to Spark & ML with Spark Spark Packages spark-packages.org

Editor's Notes

  • #4: Contributions plot from: https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html Daytona GraySort contest (100TB sort) (blog)
  • #5: TODO: REMOVE SLIDE?
  • #8: For those coming from Hadoop, this is a huge improvement: simpler code, runs on a laptop and on a huge cluster, very efficient. Can you spot the bug in the code using the RDD API?
  • #20: Contributions estimated from github commit logs, with some effort to de-duplicate entities.
  • #27: Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html *Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
  • #38: TODO: Include schema validation in the demo? (Select wrong columns to pass to Pipeline.fit().)
  • #48: No time to mention: User-defined functions (UDFs) Optimizations: code gen, predicate pushdown