Spark machine learning predicting customer churn

© 2017 MapR Technologies
Spark Machine Learning
Carol McDonald
@caroljmcdonald

Agenda
•  Introduction to Machine Learning Techniques
–  Classification
–  Clustering
•  Use Decision Tree to Predict Customer Churn

What is Machine Learning?
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns

Examples of ML Algorithms
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD

Supervised Algorithms use labeled data
Data
features
Build Model
New Data
features
Predict
Use Model

Supervised Machine Learning: Classification & Regression
Classification
Identifies
category for item

Classification: Definition
Form of ML that:
•  Identifies which category an item belongs to
•  Uses supervised learning algorithms
–  Data is labeled
Sentiment

If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:

Car Insurance Fraud Example
•  What are we trying to predict?
–  This is the Label or Target outcome:
–  The amount of Fraud
•  What are the “if questions” or properties we can use to predict?
–  These are the Features:
–  The claim Amount

Label:
Amount of Fraud
Y
X
Feature: claimed amount
Data point: fraud amount,
claimed amount
AmntFraud = intercept + coeff * claimedAmnt
Car Insurance Fraud Regression Example

Credit Card Fraud Example
–  This is the Label:
–  The probability of Fraud
–  transaction amount, type of merchant, distance from and time since last transaction

Label
Probabilty
of Fraud 1
X
Features: trans amount, type of store,
Time Location difference last trans.
Fraud
0
Not Fraud
.5
Credit Card Fraud Logistic Regression Example

Supervised Learning: Classification & Regression
•  Classification:
–  identifies which category (eg fraud or not fraud)
•  Linear Regression:
–  predicts a value (eg amount of fraud)
•  Logistic Regression:
–  predicts a probability (eg probability of fraud)

Examples of ML Algorithms
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic

Unsupervised Algorithms use Unlabeled data
Customer GroupsBuild ModelTrain Algorithm
Finds patterns
New Customer
Purchase Data
Use Model
(prediction function) Predict Group
Contains patterns Recognizes patterns
Customer purchase
data

Unsupervised Machine Learning: Clustering
Clustering
group news articles into different categories

Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity

Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity
–  Search results grouping
–  Grouping of customers
–  Anomaly detection
–  Text categorization

Clustering: Example
•  Group similar objects

Clustering: Example
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
x
x
x
x
x

Clustering: Example
2.  Assign all points to nearest
centroid
x
x
x
x
x

Clustering: Example
centroid
3.  Update centroids to center of
points
x
x
x
x
x

Clustering: Example
centroid
3.  Update centroids to center of
points
4.  Repeat until conditions met
x
x
x
x
x

Predict Churn

ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Predictions
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
New
Data
Customer Data
Call Center
Records
Web
Clickstream
Server Logs
●  Churn Modelling

Telecom Customer Churn Data
•  State: string
•  Account length: integer
•  Area code: integer
•  International plan: string
•  Voice mail plan: string
•  Number vmail messages: integer
•  Total day minutes: double
•  Total day calls: integer
•  Total day charge: double
•  Total eve minutes: double
•  Total eve calls: integer
•  Total eve charge: double
•  Total night minutes: double
•  Total night calls: integer
•  Total night charge: double
•  Total intl minutes: double
•  Total intl calls: integer
•  Total intl charge: double
•  Customer service calls: integer

Customer Churn Example
–  This is the Label:
–  Did the customer churn? True or False
–  Number of Customer service calls, Total day minutes …

Decision Trees
•  Decision Tree for Classification
prediction
•  Represents tree with nodes
•  IF THEN ELSE questions using
features at each node
•  Answers branch to child nodes
If the number of customer
service calls < 3
If the total day
minutes > 200
Churned: T
If the total day
minutes < 200
Churned: F
T
Churned: T Churned: F
F
FF TT

Example Decision Tree

Spark ML workflow

Spark ML workflow with a Pipeline
Pipeline
Estimator
Extract
Features
Load
Data
Train
Model
Estimator
Data
frame
Transformer
Cross
Validate
Pipeline Model
TransformerTest
Data
frame
Evaluate
fit
Train
Load
Data
Evaluator
Predict
With model
Extract
Features Evaluator
transform

Zeppelin Notebook with Spark
Data
Engineer
Data
Scientist

Load the data into a Dataframe: Define the Schema
case class Account(state: String, len: Integer, acode: String,
intlplan: String, vplan: String, numvmail: Double,
tdmins: Double, tdcalls: Double, tdcharge: Double,
temins: Double, tecalls: Double, techarge: Double,
tnmins: Double, tncalls: Double, tncharge: Double,
timins: Double, ticalls: Double, ticharge: Double,
numcs: Double, churn: String)
Input CSV File sample:
KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False

Data
Frame
Load data
Load the data into a Dataset
val train: Dataset[Account] = spark.read.option("inferSchema", "false")
.schema(schema).csv("/user/user01/data/churn-bigml-80.csv").as[Account]

Dataset merged with Dataframe
in Spark 2.0, DataFrame APIs merged with Datasets APIs

Extract the Features
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors and Label Model
Featurization Training
Model
Evaluation
Best Model
Label:
Churned=T
Features:
Number customer
Service calls
Number day minutes
Training Data
Label:
Churned=F
Features:
Number customer
Service calls
Number day minutes
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶

Data
Frame
Add column
Use StringIndexer to map Strings to Numbers
val ipindexer = new StringIndexer()
.setInputCol("intlplan")
.setOutputCol("iplanIndex”)
Data
Frame

Data
Frame
Add column
Use StringIndexer to map churn True False to Numbers
Val labelindexer = new StringIndexer()
.setInputCol(”churn")
.setOutputCol(”label”)
Data
Frame

Data
Frame
Load data Add column DataFrame +
Features
Use VectorAssembler to put features in vector column
val featureCols = Array(”temins", "iplanIndex", "tdmins", "tdcalls”…)
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")

Data
Frame
Load data transform
Estimator
val dTree = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
Create DecisionTree Estimator, Set Label and Features
DataFrame +
Features

val pipeline = new Pipeline()
.setStages(Array(ipindexer, labelindexer, assembler, dTree))
Put Feature Transformers and Estimator in Pipeline
Pipeline
ipIndexer
feature
transform
assembler
Dtree
estimatorlabelindexer
feature
transform
assemble
Features
Produce
model

Spark ML workflow with a Pipeline
Pipeline
Transfomers
Load
Data
estimator
Train model
Data
frame
Extract
Features
evaluator
Pipeline
Model
Test
Data
frame
evaluator
Use fitted model
Train
Load
Data
fit
transform

K-fold Cross-Validation Process
Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
data is randomly split into K partition training and test dataset pairs

Data
Model
Training
Training
Set
Test Model
Predictions
Test
Set
Train algorithm with training dataset

ML Cross-Validation Process
Data Model
Training
Set
Test Model
Predictions
Test
Set
Evaluate the model with the Test Set

Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Train/Test loop K times
Repeat K times
select the Model produced by the best-performing set of parameters

Cross Validation transformation estimation pipeline
Pipeline
Cross Validator
evaluatorParameter Grid
fit
Set up a CrossValidator with:
•  Parameter grid
•  Estimator (pipeline)
•  Evaluator
Perform grid search based model
selection

Parameter Tuning with CrossValidator with a Paramgrid
CrossValidator
•  Given:
–  Estimator
–  Parameter grid
–  Evaluator
•  Find best parameters and
model
val paramGrid = new ParamGridBuilder()
.addGrid(dTree.maxDepth,
Array(2,3,4,5,6,7)).build()
val evaluator= new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction")
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)

val cvModel = crossval.fit(ntrain)
Cross Validator fit a model to the data
Pipeline
Cross Validator
evaluatorParameter Grid
fit
Pipeline
Model
fit a model to the data
with provided parameter grid

Evaluate the fitted model
Pipeline
Transfomers
Load
Data
estimator
Train model
Data
frame
Extract
Features
evaluator
Pipeline
Model
Test
Data
frame
evaluator
transform
Train
Load
Data
Predict
With model
Extract
Features
fit

fitted
model
Evaluate the Predictions from DecisionTree Estimator
Evaluator
transform
Test
features
val predictions = cvModel.transform(test)
val accuracy = evaluator.evaluate(predictions)
evaluate
prediction accuracy

Area under the ROC curve
Accuracy is measured by the area under
the ROC curve.
The area measures correct classifications
•  An area of 1 represents a perfect test
•  an area of .5 represents a worthless
test

To Learn More:
•  Read about and download example code
•  https://mapr.com/blog/churn-prediction-sparkml/

To Learn More:
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-1/

To Learn More:
•  MapR Free ODT http://learn.mapr.com/

For Q&A :
•  https://community.mapr.com/
•  https://community.mapr.com/community/answers/pages/qa

Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
DataProcessing
Web-Scale Storage
MapR-XD MapR-DB
Search
and
Others
Real Time Unified Security Multi-tenancy Disaster
Recovery
Global NamespaceHigh Availability
MapR Streams
Cloud
and
Managed
Services
Search and
Others
UnifiedManagementandMonitoring
Search
and
Others
Event StreamingDatabase
Custom
Apps
MapR Converged Data Platform
HDFS API POSIX, NFS Kakfa APIHBase API OJAI API

Q&A
ENGAGE WITH US

Spark machine learning predicting customer churn

Recommended

More Related Content

What's hot (20)

Similar to Spark machine learning predicting customer churn (20)

More from Carol McDonald (11)

Recently uploaded (20)

Spark machine learning predicting customer churn