SlideShare a Scribd company logo
‹#›© Cloudera, Inc. All rights reserved.
Introduction to
Apache Spark
& Spark MlLib
Juliet Hougland
‹#›© Cloudera, Inc. All rights reserved.
What are we doing here?
‹#›© Cloudera, Inc. All rights reserved.
Spark Overview
‹#›© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java,
Scala, Python
• Interactive shell
• Fast to Run
• General execution
graphs
• In-memory storage
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Apache Spark: Ecosystem
•Dataframes
•ML Lib
•Streaming
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Spark in CDH
YARN
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
MapReduce2
SQL
SearchImpala
Spark
Pyspark
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
What do you mean, lazy evaluation?
‹#›© Cloudera, Inc. All rights reserved.
• map
• flatmap
• filter
• distinct
• sample
• union
• intersection
• subtract
• cartesian
Transformations
‹#›© Cloudera, Inc. All rights reserved.
• collect()
• count()
• take(num)
• takeOrdered(num)(ordering)
• reduce(function)
• aggregate(zeroValue)(seqOp,
combOp)
• foreach(function)
Actions
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
Complex, In-Memory Processing
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√Ω
map
A:
map
take
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Predicting Churn:
Machine Learning with
Spark MlLib
‹#›© Cloudera, Inc. All rights reserved.
Modeling Churn
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False
The Dataset
‹#›© Cloudera, Inc. All rights reserved.
Create Dataframe
‹#›© Cloudera, Inc. All rights reserved.
Why DataFrames?
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
‹#›© Cloudera, Inc. All rights reserved.
Specify Feature Extraction
‹#›© Cloudera, Inc. All rights reserved.
Specify Feature Extraction
‹#›© Cloudera, Inc. All rights reserved.
Evaluating Classifiers: ROC
‹#›© Cloudera, Inc. All rights reserved.
Model Evaluation
‹#›© Cloudera, Inc. All rights reserved.
Evaluating Classifiers
‹#›© Cloudera, Inc. All rights reserved.
Model Evaluation
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Find the entire example at
github.com/jhlch/ds-for-telco
‹#›© Cloudera, Inc. All rights reserved.
Thank You
Juliet Hougland
@j_houg
github.com/jhlch/ds-for-telco

More Related Content

Similar to Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera (20)

PDF
PySpark Best Practices
Cloudera, Inc.
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PPTX
Spark etl
Imran Rashid
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PDF
Hive on spark berlin buzzwords
Szehon Ho
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
TriHUG Feb: Hive on spark
trihug
 
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
PPTX
5 Apache Spark Tips in 5 Minutes
Cloudera, Inc.
 
PPTX
Spark Application Development Made Easy
DataWorks Summit
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Empower Hive with Spark
DataWorks Summit
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
LanceShivnathHadoopSummit2015
Lance Co Ting Keh
 
PPTX
Apache Spark Operations
Cloudera, Inc.
 
PySpark Best Practices
Cloudera, Inc.
 
Intro to Apache Spark
Cloudera, Inc.
 
Spark etl
Imran Rashid
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Hive on spark berlin buzzwords
Szehon Ho
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
TriHUG Feb: Hive on spark
trihug
 
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
5 Apache Spark Tips in 5 Minutes
Cloudera, Inc.
 
Spark Application Development Made Easy
DataWorks Summit
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Empower Hive with Spark
DataWorks Summit
 
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Sohil Jain
 
LanceShivnathHadoopSummit2015
Lance Co Ting Keh
 
Apache Spark Operations
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Tally software_Introduction_Presentation
AditiBansal54083
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Ad

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Introduction to Apache Spark & Spark MlLib Juliet Hougland
  • 2. ‹#›© Cloudera, Inc. All rights reserved. What are we doing here?
  • 3. ‹#›© Cloudera, Inc. All rights reserved. Spark Overview
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage
  • 5. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Ecosystem •Dataframes •ML Lib •Streaming
  • 6. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Spark in CDH YARN Spark Streaming GraphX MLlib HDFS, HBase HivePig MapReduce2 SQL SearchImpala Spark Pyspark
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Spark Execution Model sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 8. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. What do you mean, lazy evaluation?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. • map • flatmap • filter • distinct • sample • union • intersection • subtract • cartesian Transformations
  • 10. ‹#›© Cloudera, Inc. All rights reserved. • collect() • count() • take(num) • takeOrdered(num)(ordering) • reduce(function) • aggregate(zeroValue)(seqOp, combOp) • foreach(function) Actions
  • 11. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis
  • 12. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 13. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 14. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 15. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Complex, In-Memory Processing join filter groupBy B: B: C: D: E: F: Ç√Ω map A: map take
  • 17. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Predicting Churn: Machine Learning with Spark MlLib
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Modeling Churn
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False. OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False. NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False. OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False. OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False The Dataset
  • 21. ‹#›© Cloudera, Inc. All rights reserved. Create Dataframe
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Why DataFrames?
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Modeling Lifecycle
  • 24. ‹#›© Cloudera, Inc. All rights reserved. Specify Feature Extraction
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Specify Feature Extraction
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Evaluating Classifiers: ROC
  • 27. ‹#›© Cloudera, Inc. All rights reserved. Model Evaluation
  • 28. ‹#›© Cloudera, Inc. All rights reserved. Evaluating Classifiers
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Model Evaluation
  • 30. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Find the entire example at github.com/jhlch/ds-for-telco
  • 31. ‹#›© Cloudera, Inc. All rights reserved. Thank You Juliet Hougland @j_houg github.com/jhlch/ds-for-telco

Editor's Notes

  • #21: Each of these records is a person