SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Hadoop for the Data Scientist:
Spark in Cloudera 5.5
Anand Iyer | Senior Product Manager | Cloudera
Sandy Ryza | Senior Data Scientist | Cloudera
2© Cloudera, Inc. All rights reserved.
Agenda
• Apache Spark Overview
• Machine Learning with Hadoop and Spark
• Machine Learning Use Cases
• What’s Next
3© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
4© Cloudera, Inc. All rights reserved.
One Platform, Many Workloads
Batch, Interactive,
and Real-Time.
Leading performance and
usability in one platform.
• End-to-end analytic workflows
• Access more data
• Work with data in new ways
• Enable new users
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
5© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
6© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH & STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …
7© Cloudera, Inc. All rights reserved.
Easy Machine Learning
on data distributed over a large cluster of machines
8© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
9© Cloudera, Inc. All rights reserved.
What is Mllib?
Library of machine learning and data mining algorithms and utilities
• Implemented in Spark
• Invoked within Java, Scala, or Python Spark applications
MLlib applications are Spark applications
• Requires Spark knowledge to effectively run
• Recommended deployment on YARN
• MLlib apps require the same set of parameters Spark applications require
(number of executors, memory per executor, etc)
10© Cloudera, Inc. All rights reserved.
What Does MLlib Contain?
• Machine learning models for classification and regression
• Recommender System
• Clustering Algorithms
• Feature Engineering Algorithms and Utilities
• Data Mining Algorithms & Basic Statistical Analysis Utilities
11© Cloudera, Inc. All rights reserved.
Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
12© Cloudera, Inc. All rights reserved.
Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
Next-Gen Models
• Gradient Boosted Trees
• Random Forests
13© Cloudera, Inc. All rights reserved.
Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
14© Cloudera, Inc. All rights reserved.
Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
Textual data clustering i.e. identifying “topics” from a corpus of documents:
• Latent Dirichlet Allocation (LDA)
15© Cloudera, Inc. All rights reserved.
• Predicting the interests of a user, by
collecting partial list of preferences
from many users
• Predicting missing items of a user-item
association matrix
• Algorithm used: Alternating Least Squares
• Admittedly limited choice of algorithms
?
?
?
?
?
?
?
?
?
?
Collaborative Filtering
For Building Recommender Systems
16© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
17© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
18© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
Textual Feature Generation:
• Word2Vec
• Term Frequency – Inverse Document
Frequency (TF-IDF)
19© Cloudera, Inc. All rights reserved.
Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
20© Cloudera, Inc. All rights reserved.
Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
Algorithms in MLlib:
• Frequent Pattern-Growth
• Association Rule Mining
• PrefixSpan
21© Cloudera, Inc. All rights reserved.
What about “Deep Learning”?
Deep Learning is an umbrella term for large complex Multi-
Layer Neural Networks
• MLlib contains a robust Multilayer Neural Network implementation
22© Cloudera, Inc. All rights reserved.
Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
23© Cloudera, Inc. All rights reserved.
Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
24© Cloudera, Inc. All rights reserved.
Pipeline API: Hooking the pieces together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
Hyper-Parameter Tuning  Repeat Sequence with different parameter values
25© Cloudera, Inc. All rights reserved.
Overview of Pipeline API
• Create Pipeline as a sequence of Stages:
• Transformers: Transform or augment features
• Estimators: Fit a model
• Re-use Pipeline
• Basic save and load functionality available
• Invoke Pipeline with different set of parameters passed as ParamMap
26© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
27© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
28© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
29© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
30© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
31© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
Score streaming events in
Spark Streaming.
32© Cloudera, Inc. All rights reserved.
Machine Learning Use Case
33© Cloudera, Inc. All rights reserved.
Predicting Influencers at a Large Telco
• Customer loyalty difficult and expensive
• Aggressive competition
34© Cloudera, Inc. All rights reserved.
Social Churn
• Churn is not an isolated event!
• When influential subscribers leave, they
take their friends with them
35© Cloudera, Inc. All rights reserved.
Casting This as a Data Science Problem
• Can we quantify: Which lost users were the most influential?
• Can we predict: Which current subscribers have as much influence?
36© Cloudera, Inc. All rights reserved.
The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections
37© Cloudera, Inc. All rights reserved.
The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections
38© Cloudera, Inc. All rights reserved.
Calculating Influencer Scores
• Connection: pair of users with communication both ways
• Influencer score: number of connected users that churn after user X churns
39© Cloudera, Inc. All rights reserved.
Predicting Influencer Scores
MLlib!
• Regression model
• Linear regression
• Random forests
• Features
• # of connections, # calls to connections
• Internal vs. External
40© Cloudera, Inc. All rights reserved.
Breaking Down the Work
Building User and Connection
Tables
Computing Historical
Influencer Scores
Feature Generation
Model Fitting
Model Evaluation
41© Cloudera, Inc. All rights reserved.
What’s Next
42© Cloudera, Inc. All rights reserved.
Roadmap Update
MANAGEMENT
Initial Spark-on-YARN
integration for shared
resource management
SECURITY SCALE STREAMING
New metrics for easier
diagnosis
Improved Spark-on-YARN for
better multi-tenancy,
performance, ease of use
Automated configurations
to optimize over time
Visibility into resource
utilization
Improved PySpark
integration for Python access
Kerberos-based
authorization
Fine-grained
access control
Auditing and lineage
(Governance)
Integration with Intel’s
Advanced Encryption
libraries
Full PCI compliance
Improved integration with
HDFS to enable scheduling
Reduced memory pressure
on larger jobs
Dynamic resource utilization
and prioritization
Stress test at scale with
mixed multi-tenant
workloads
Spark Streaming resiliency
for zero data loss
Data ingest integration for
Kafka and Flume
Improved state management
for better performance
Higher-level language
extensions
✔
✔✔
✔
✔✔
✔
43© Cloudera, Inc. All rights reserved.
Download Cloudera 5.5
cloudera.com/downloads
44© Cloudera, Inc. All rights reserved.
Data Science & Spark Training Courses
university.cloudera.com
45© Cloudera, Inc. All rights reserved.
Thank You
46© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog: blog.cloudera.com/spark
• Spark Page: cloudera.com/spark
• Get Trained
• Cloudera Spark Training: university.cloudera.com
• Try it Out
• Cloudera Live Spark Tutorial: cloudera.com/live
• Download Cloudera 5.5: cloudera.com/downloads

More Related Content

What's hot (20)

PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
PPTX
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
PPTX
Cloudbreak - Technical Deep Dive
DataWorks Summit/Hadoop Summit
 
PDF
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
PDF
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
PPTX
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Machine Learning Loves Hadoop
Cloudera, Inc.
 
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
PDF
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
PPTX
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
 
PDF
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
PDF
Impala use case @ Zoosk
Cloudera, Inc.
 
PPTX
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
PPTX
Provisioning Big Data Platform using Cloudbreak & Ambari
DataWorks Summit/Hadoop Summit
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PPTX
Solr consistency and recovery internals
Cloudera, Inc.
 
PPTX
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
PDF
Hybrid is the New Normal
DataWorks Summit
 
PPTX
Hadoop AWS infrastructure cost evaluation
mattlieber
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
Cloudbreak - Technical Deep Dive
DataWorks Summit/Hadoop Summit
 
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
DataWorks Summit/Hadoop Summit
 
Machine Learning Loves Hadoop
Cloudera, Inc.
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
Impala use case @ Zoosk
Cloudera, Inc.
 
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
Provisioning Big Data Platform using Cloudbreak & Ambari
DataWorks Summit/Hadoop Summit
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Solr consistency and recovery internals
Cloudera, Inc.
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
Hybrid is the New Normal
DataWorks Summit
 
Hadoop AWS infrastructure cost evaluation
mattlieber
 

Similar to Hadoop for the Data Scientist: Spark in Cloudera 5.5 (20)

PDF
Machine Learning and Hadoop: Present and Future
Data Science London
 
PPTX
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
PPTX
Hadoop and Machine Learning
joshwills
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PPTX
Spark and Deep Learning Frameworks at Scale 7.19.18
Cloudera, Inc.
 
PDF
Machine Learning Model Deployment: Strategy to Implementation
DataWorks Summit
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PPTX
Machine Learning Models: From Research to Production 6.13.18
Cloudera, Inc.
 
PPTX
Deep Learning with Cloudera
Cloudera, Inc.
 
PDF
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
PPTX
The Edge to AI Deep Dive Barcelona Meetup March 2019
Timothy Spann
 
PDF
Train, predict, serve: How to go into production your machine learning model
Cloudera Japan
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PPTX
A practical guidance of the enterprise machine learning
Jesus Rodriguez
 
PDF
仕事ではじめる機械学習
Aki Ariga
 
PPTX
The Vision & Challenge of Applied Machine Learning
Cloudera, Inc.
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PPTX
Next-Gen ML/AI Platform
Josh Yeh
 
Machine Learning and Hadoop: Present and Future
Data Science London
 
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
Hadoop and Machine Learning
joshwills
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Cloudera, Inc.
 
Machine Learning Model Deployment: Strategy to Implementation
DataWorks Summit
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Machine Learning Models: From Research to Production 6.13.18
Cloudera, Inc.
 
Deep Learning with Cloudera
Cloudera, Inc.
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
The Edge to AI Deep Dive Barcelona Meetup March 2019
Timothy Spann
 
Train, predict, serve: How to go into production your machine learning model
Cloudera Japan
 
Apache Spark MLlib
Zahra Eskandari
 
A practical guidance of the enterprise machine learning
Jesus Rodriguez
 
仕事ではじめる機械学習
Aki Ariga
 
The Vision & Challenge of Applied Machine Learning
Cloudera, Inc.
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Next-Gen ML/AI Platform
Josh Yeh
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Designing Production-Ready AI Agents
Kunal Rai
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 

Hadoop for the Data Scientist: Spark in Cloudera 5.5

  • 1. 1© Cloudera, Inc. All rights reserved. Hadoop for the Data Scientist: Spark in Cloudera 5.5 Anand Iyer | Senior Product Manager | Cloudera Sandy Ryza | Senior Data Scientist | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • Apache Spark Overview • Machine Learning with Hadoop and Spark • Machine Learning Use Cases • What’s Next
  • 3. 3© Cloudera, Inc. All rights reserved. Cloudera Enterprise Making Hadoop Fast, Easy, and Secure A new kind of data platform: • One place for unlimited data • Unified, multi-framework data access Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise OPERATIONS DATA MANAGEMENT STRUCTURED UNSTRUCTURED PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT SECURITY FILESYSTEM RELATIONAL NoSQL STORE INTEGRATE BATCH STREAM SQL SEARCH SDK
  • 4. 4© Cloudera, Inc. All rights reserved. One Platform, Many Workloads Batch, Interactive, and Real-Time. Leading performance and usability in one platform. • End-to-end analytic workflows • Access more data • Work with data in new ways • Enable new users OPERATIONS Cloudera Manager Cloudera Director DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  • 5. 5© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 6. 6© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE SQL Impala SEARCH Solr SDK Kite BATCH & STREAM Spark Spark Streaming Spark SQL DataFrames MLlib …
  • 7. 7© Cloudera, Inc. All rights reserved. Easy Machine Learning on data distributed over a large cluster of machines
  • 8. 8© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 9. 9© Cloudera, Inc. All rights reserved. What is Mllib? Library of machine learning and data mining algorithms and utilities • Implemented in Spark • Invoked within Java, Scala, or Python Spark applications MLlib applications are Spark applications • Requires Spark knowledge to effectively run • Recommended deployment on YARN • MLlib apps require the same set of parameters Spark applications require (number of executors, memory per executor, etc)
  • 10. 10© Cloudera, Inc. All rights reserved. What Does MLlib Contain? • Machine learning models for classification and regression • Recommender System • Clustering Algorithms • Feature Engineering Algorithms and Utilities • Data Mining Algorithms & Basic Statistical Analysis Utilities
  • 11. 11© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines
  • 12. 12© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines Next-Gen Models • Gradient Boosted Trees • Random Forests
  • 13. 13© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means
  • 14. 14© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means Textual data clustering i.e. identifying “topics” from a corpus of documents: • Latent Dirichlet Allocation (LDA)
  • 15. 15© Cloudera, Inc. All rights reserved. • Predicting the interests of a user, by collecting partial list of preferences from many users • Predicting missing items of a user-item association matrix • Algorithm used: Alternating Least Squares • Admittedly limited choice of algorithms ? ? ? ? ? ? ? ? ? ? Collaborative Filtering For Building Recommender Systems
  • 16. 16© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
  • 17. 17© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD)
  • 18. 18© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) Textual Feature Generation: • Word2Vec • Term Frequency – Inverse Document Frequency (TF-IDF)
  • 19. 19© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!”
  • 20. 20© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!” Algorithms in MLlib: • Frequent Pattern-Growth • Association Rule Mining • PrefixSpan
  • 21. 21© Cloudera, Inc. All rights reserved. What about “Deep Learning”? Deep Learning is an umbrella term for large complex Multi- Layer Neural Networks • MLlib contains a robust Multilayer Neural Network implementation
  • 22. 22© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier
  • 23. 23© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring
  • 24. 24© Cloudera, Inc. All rights reserved. Pipeline API: Hooking the pieces together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring Hyper-Parameter Tuning  Repeat Sequence with different parameter values
  • 25. 25© Cloudera, Inc. All rights reserved. Overview of Pipeline API • Create Pipeline as a sequence of Stages: • Transformers: Transform or augment features • Estimators: Fit a model • Re-use Pipeline • Basic save and load functionality available • Invoke Pipeline with different set of parameters passed as ParamMap
  • 26. 26© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 27. 27© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 28. 28© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 29. 29© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 30. 30© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 31. 31© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc Score streaming events in Spark Streaming.
  • 32. 32© Cloudera, Inc. All rights reserved. Machine Learning Use Case
  • 33. 33© Cloudera, Inc. All rights reserved. Predicting Influencers at a Large Telco • Customer loyalty difficult and expensive • Aggressive competition
  • 34. 34© Cloudera, Inc. All rights reserved. Social Churn • Churn is not an isolated event! • When influential subscribers leave, they take their friends with them
  • 35. 35© Cloudera, Inc. All rights reserved. Casting This as a Data Science Problem • Can we quantify: Which lost users were the most influential? • Can we predict: Which current subscribers have as much influence?
  • 36. 36© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  • 37. 37© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  • 38. 38© Cloudera, Inc. All rights reserved. Calculating Influencer Scores • Connection: pair of users with communication both ways • Influencer score: number of connected users that churn after user X churns
  • 39. 39© Cloudera, Inc. All rights reserved. Predicting Influencer Scores MLlib! • Regression model • Linear regression • Random forests • Features • # of connections, # calls to connections • Internal vs. External
  • 40. 40© Cloudera, Inc. All rights reserved. Breaking Down the Work Building User and Connection Tables Computing Historical Influencer Scores Feature Generation Model Fitting Model Evaluation
  • 41. 41© Cloudera, Inc. All rights reserved. What’s Next
  • 42. 42© Cloudera, Inc. All rights reserved. Roadmap Update MANAGEMENT Initial Spark-on-YARN integration for shared resource management SECURITY SCALE STREAMING New metrics for easier diagnosis Improved Spark-on-YARN for better multi-tenancy, performance, ease of use Automated configurations to optimize over time Visibility into resource utilization Improved PySpark integration for Python access Kerberos-based authorization Fine-grained access control Auditing and lineage (Governance) Integration with Intel’s Advanced Encryption libraries Full PCI compliance Improved integration with HDFS to enable scheduling Reduced memory pressure on larger jobs Dynamic resource utilization and prioritization Stress test at scale with mixed multi-tenant workloads Spark Streaming resiliency for zero data loss Data ingest integration for Kafka and Flume Improved state management for better performance Higher-level language extensions ✔ ✔✔ ✔ ✔✔ ✔
  • 43. 43© Cloudera, Inc. All rights reserved. Download Cloudera 5.5 cloudera.com/downloads
  • 44. 44© Cloudera, Inc. All rights reserved. Data Science & Spark Training Courses university.cloudera.com
  • 45. 45© Cloudera, Inc. All rights reserved. Thank You
  • 46. 46© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog: blog.cloudera.com/spark • Spark Page: cloudera.com/spark • Get Trained • Cloudera Spark Training: university.cloudera.com • Try it Out • Cloudera Live Spark Tutorial: cloudera.com/live • Download Cloudera 5.5: cloudera.com/downloads