Hadoop for the Data Scientist: Spark in Cloudera 5.5

1© Cloudera, Inc. All rights reserved.
Hadoop for the Data Scientist:
Spark in Cloudera 5.5
Anand Iyer | Senior Product Manager | Cloudera
Sandy Ryza | Senior Data Scientist | Cloudera

Agenda
• Apache Spark Overview
• Machine Learning with Hadoop and Spark
• Machine Learning Use Cases
• What’s Next

Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK

One Platform, Many Workloads
Batch, Interactive,
and Real-Time.
Leading performance and
usability in one platform.
• End-to-end analytic workflows
• Access more data
• Work with data in new ways
• Enable new users
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching

The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH & STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …

Easy Machine Learning
on data distributed over a large cluster of machines

Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc

What is Mllib?
Library of machine learning and data mining algorithms and utilities
• Implemented in Spark
• Invoked within Java, Scala, or Python Spark applications
MLlib applications are Spark applications
• Requires Spark knowledge to effectively run
• Recommended deployment on YARN
• MLlib apps require the same set of parameters Spark applications require
(number of executors, memory per executor, etc)

What Does MLlib Contain?
• Machine learning models for classification and regression
• Recommender System
• Clustering Algorithms
• Feature Engineering Algorithms and Utilities
• Data Mining Algorithms & Basic Statistical Analysis Utilities

Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines

Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
Next-Gen Models
• Gradient Boosted Trees
• Random Forests

Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means

Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
Textual data clustering i.e. identifying “topics” from a corpus of documents:
• Latent Dirichlet Allocation (LDA)

• Predicting the interests of a user, by
collecting partial list of preferences
from many users
• Predicting missing items of a user-item
association matrix
• Algorithm used: Alternating Least Squares
• Admittedly limited choice of algorithms
?
?
?
?
?
?
?
?
?
?
Collaborative Filtering
For Building Recommender Systems

Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc

Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)

Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
Textual Feature Generation:
• Word2Vec
• Term Frequency – Inverse Document
Frequency (TF-IDF)

Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”

Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
Algorithms in MLlib:
• Frequent Pattern-Growth
• Association Rule Mining
• PrefixSpan

What about “Deep Learning”?
Deep Learning is an umbrella term for large complex Multi-
Layer Neural Networks
• MLlib contains a robust Multilayer Neural Network implementation

Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier

Pipeline API
Hooking the Pieces Together
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring

Pipeline API: Hooking the pieces together
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
Hyper-Parameter Tuning  Repeat Sequence with different parameter values

Overview of Pipeline API
• Create Pipeline as a sequence of Stages:
• Transformers: Transform or augment features
• Estimators: Fit a model
• Re-use Pipeline
• Basic save and load functionality available
• Invoke Pipeline with different set of parameters passed as ParamMap

Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring

Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
Score streaming events in
Spark Streaming.

Machine Learning Use Case

Predicting Influencers at a Large Telco
• Customer loyalty difficult and expensive
• Aggressive competition

Social Churn
• Churn is not an isolated event!
• When influential subscribers leave, they
take their friends with them

Casting This as a Data Science Problem
• Can we quantify: Which lost users were the most influential?
• Can we predict: Which current subscribers have as much influence?

The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections

Calculating Influencer Scores
• Connection: pair of users with communication both ways
• Influencer score: number of connected users that churn after user X churns

Predicting Influencer Scores
MLlib!
• Regression model
• Linear regression
• Random forests
• Features
• # of connections, # calls to connections
• Internal vs. External

Breaking Down the Work
Building User and Connection
Tables
Computing Historical
Influencer Scores
Feature Generation
Model Fitting
Model Evaluation

What’s Next

Roadmap Update
MANAGEMENT
Initial Spark-on-YARN
integration for shared
resource management
SECURITY SCALE STREAMING
New metrics for easier
diagnosis
Improved Spark-on-YARN for
better multi-tenancy,
performance, ease of use
Automated configurations
to optimize over time
Visibility into resource
utilization
Improved PySpark
integration for Python access
Kerberos-based
authorization
Fine-grained
access control
Auditing and lineage
(Governance)
Integration with Intel’s
Advanced Encryption
libraries
Full PCI compliance
Improved integration with
HDFS to enable scheduling
Reduced memory pressure
on larger jobs
Dynamic resource utilization
and prioritization
Stress test at scale with
mixed multi-tenant
workloads
Spark Streaming resiliency
for zero data loss
Data ingest integration for
Kafka and Flume
Improved state management
for better performance
Higher-level language
extensions
✔
✔✔
✔
✔✔
✔

Download Cloudera 5.5
cloudera.com/downloads

Data Science & Spark Training Courses
university.cloudera.com

Thank You

Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog: blog.cloudera.com/spark
• Spark Page: cloudera.com/spark
• Get Trained
• Cloudera Spark Training: university.cloudera.com
• Try it Out
• Cloudera Live Spark Tutorial: cloudera.com/live
• Download Cloudera 5.5: cloudera.com/downloads

Hadoop for the Data Scientist: Spark in Cloudera 5.5

More Related Content

What's hot (20)

Similar to Hadoop for the Data Scientist: Spark in Cloudera 5.5 (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hadoop for the Data Scientist: Spark in Cloudera 5.5