Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera

27 likes3,325 views

The document provides an introduction to Apache Spark and Spark MLlib, highlighting its features such as ease of development, speed, and rich APIs in multiple programming languages. It discusses key components like DataFrames, RDDs, machine learning processes, and model evaluation through examples. Ultimately, it emphasizes the capabilities of Spark for big data processing and machine learning applications.

Software

‹#›© Cloudera, Inc. All rights reserved.
Introduction to
Apache Spark
& Spark MlLib
Juliet Hougland

‹#›© Cloudera, Inc. All rights reserved.
What are we doing here?

‹#›© Cloudera, Inc. All rights reserved.
Spark Overview

‹#›© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java,
Scala, Python
• Interactive shell
• Fast to Run
• General execution
graphs
• In-memory storage

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Apache Spark: Ecosystem
•Dataframes
•ML Lib
•Streaming

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Spark in CDH
YARN
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
MapReduce2
SQL
SearchImpala
Spark
Pyspark

‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
What do you mean, lazy evaluation?

‹#›© Cloudera, Inc. All rights reserved.
• map
• flatmap
• filter
• distinct
• sample
• union
• intersection
• subtract
• cartesian
Transformations

‹#›© Cloudera, Inc. All rights reserved.
• collect()
• count()
• take(num)
• takeOrdered(num)(ordering)
• reduce(function)
• aggregate(zeroValue)(seqOp,
combOp)
• foreach(function)
Actions

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()

‹#›© Cloudera, Inc. All rights reserved.
Complex, In-Memory Processing
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√Ω
map
A:
map
take

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Predicting Churn:
Machine Learning with
Spark MlLib

‹#›© Cloudera, Inc. All rights reserved.
Modeling Churn

‹#›© Cloudera, Inc. All rights reserved.

‹#›© Cloudera, Inc. All rights reserved.
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False
The Dataset

‹#›© Cloudera, Inc. All rights reserved.
Create Dataframe

‹#›© Cloudera, Inc. All rights reserved.
Why DataFrames?

‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle

‹#›© Cloudera, Inc. All rights reserved.
Specify Feature Extraction

‹#›© Cloudera, Inc. All rights reserved.
Evaluating Classifiers: ROC

‹#›© Cloudera, Inc. All rights reserved.
Model Evaluation

‹#›© Cloudera, Inc. All rights reserved.
Evaluating Classifiers

‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Find the entire example at
github.com/jhlch/ds-for-telco

‹#›© Cloudera, Inc. All rights reserved.
Thank You
Juliet Hougland
@j_houg
github.com/jhlch/ds-for-telco

More Related Content

Similar to Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera (20)

PDF

PySpark Best PracticesCloudera, Inc.

PPTX

Intro to Apache SparkCloudera, Inc.

PPTX

Spark etlImran Rashid

PPTX

The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.

PDF

Hive on spark berlin buzzwordsSzehon Ho

PPTX

Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.

PPTX

Real Time Data Processing Using Spark StreamingHari Shreedharan

PDF

TriHUG Feb: Hive on sparktrihug

PPTX

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

PPTX

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime

PPTX

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

PPTX

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

PPTX

5 Apache Spark Tips in 5 MinutesCloudera, Inc.

PPTX

Spark Application Development Made EasyDataWorks Summit

PDF

Simplifying Big Data Analytics with Apache SparkDatabricks

PPTX

Empower Hive with SparkDataWorks Summit

PPTX

Spark introduction and architectureSohil Jain

PPTX

Spark introduction and architectureSohil Jain

PPTX

LanceShivnathHadoopSummit2015Lance Co Ting Keh

PPTX

Apache Spark OperationsCloudera, Inc.

PySpark Best PracticesCloudera, Inc.

Intro to Apache SparkCloudera, Inc.

Spark etlImran Rashid

The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.

Hive on spark berlin buzzwordsSzehon Ho

Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.

Real Time Data Processing Using Spark StreamingHari Shreedharan

TriHUG Feb: Hive on sparktrihug

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

5 Apache Spark Tips in 5 MinutesCloudera, Inc.

Spark Application Development Made EasyDataWorks Summit

Simplifying Big Data Analytics with Apache SparkDatabricks

Empower Hive with SparkDataWorks Summit

Spark introduction and architectureSohil Jain

LanceShivnathHadoopSummit2015Lance Co Ting Keh

Apache Spark OperationsCloudera, Inc.

More from Cloudera, Inc. (20)

PPTX

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

PPTX

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

PPTX

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

PPTX

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

PPTX

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

PPTX

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

PPTX

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

PPTX

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

PPTX

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

PPTX

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

PPTX

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

PPTX

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

PPTX

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

PPTX

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

PPTX

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

PPTX

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

PPTX

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

PPTX

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

PPTX

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

PPTX

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Recently uploaded (20)

PDF

Online Queue Management System for Public Service Offices in Nepal [Focused i...Rishab Acharya

PDF

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

PPTX

Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...SatishKumar2651

PPTX

Tally software_Introduction_PresentationAditiBansal54083

PDF

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

PDF

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

PPTX

Hardware(Central Processing Unit ) CU and ALURizwanaKalsoom2

PDF

Odoo CRM vs Zoho CRM: Honest Comparison 2025 Odiware Technologies Private Limited

PPTX

OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...Shane Coughlan

PPTX

Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...klpathrudu

PDF

AI + DevOps = Smart Automation with devseccops.ai.pdfDevseccops.ai

PDF

vMix Pro 28.0.0.42 Download vMix Registration key Bundlekulindacore

PDF

Digger Solo: Semantic search and maps for your local filesseanpedersen96

PDF

4K Video Downloader Plus Pro Crack for MacOS New Download 2025bashirkhan333g

PPTX

Human Resources Information System (HRIS)Amity University, Patna

PDF

Open Chain Q2 Steering Committee Meeting - 2025-06-25Shane Coughlan

PPTX

Home Care Tools: Benefits, features and moreThird Rock Techkno

PPTX

Tally_Basic_Operations_Presentation.pptxAditiBansal54083

PDF

Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...VictoriaMetrics

PPTX

Help for Correlations in IBM SPSS Statistics.pptxVersion 1 Analytics

Online Queue Management System for Public Service Offices in Nepal [Focused i...Rishab Acharya

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...SatishKumar2651

Tally software_Introduction_PresentationAditiBansal54083

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

Hardware(Central Processing Unit ) CU and ALURizwanaKalsoom2

Odoo CRM vs Zoho CRM: Honest Comparison 2025 Odiware Technologies Private Limited

OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...Shane Coughlan

Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...klpathrudu

AI + DevOps = Smart Automation with devseccops.ai.pdfDevseccops.ai

vMix Pro 28.0.0.42 Download vMix Registration key Bundlekulindacore

Digger Solo: Semantic search and maps for your local filesseanpedersen96

4K Video Downloader Plus Pro Crack for MacOS New Download 2025bashirkhan333g

Human Resources Information System (HRIS)Amity University, Patna

Open Chain Q2 Steering Committee Meeting - 2025-06-25Shane Coughlan

Home Care Tools: Benefits, features and moreThird Rock Techkno

Tally_Basic_Operations_Presentation.pptxAditiBansal54083

Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...VictoriaMetrics

Help for Correlations in IBM SPSS Statistics.pptxVersion 1 Analytics

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera

4. ‹#›© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage

11. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis

12. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

13. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

14. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

15. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()

20. ‹#›© Cloudera, Inc. All rights reserved. KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False. OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False. NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False. OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False. OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False The Dataset

Editor's Notes

#21: Each of these records is a person