SlideShare a Scribd company logo
Customer Churn Analytics
using Microsoft R Open
Malaysia R User Group Meet Up
16th February 2017
Poo Kuan Hoong
https://github.com/kuanhoong/churn-r
Disclaimer: The views and opinions expressed in this slides are those of
the author and do not necessarily reflect the official policy or position
of Nielsen Malaysia. Examples of analysis performed within this slides
are only examples. They should not be utilized in real-world analytic
products as they are based only on very limited and dated open source
information. Assumptions made within the analysis are not reflective of
the position of Nielsen Malaysia.
Agenda
• Introduction
• Customer Churn Analytics
• Machine Learning Framework
• Microsoft R Open and Visual Studio
• Model Performance Comparison
• Demo
Malaysia R User Group (MyRUG)
• The Malaysia R User Group (MyRUG) was formed on June 2016.
• It is a diverse group that come together to discuss anything related to
the R programming language.
• The main aim of MyRUG is to provide members ranging from
beginners to R professionals and experts to share and learn about R
programming and gain competency as well as share new ideas or
knowledge.
https://www.meetup.com/MY-RUserGroup/
https://www.facebook.com/rusergroupmalaysia/
Customer Churn Analytics using Microsoft R Open
Introduction
• Customer churn can be defined
simply as the rate at which a
company is losing its customers
• Imagine the business as a bucket
with holes, the water flowing from
the top is the growth rate, while the
holes at the bottom is churn
• While a certain level of churn is
unavoidable, it is important to keep
it under control, as high churn rate
can potentially kill your business
Customer Churn Analytics using Microsoft R Open
Churn analytics
• Predicting who will switch mobile operator
Customer churn - who do customers change
operators?
• The top 3 reasons why
subscribers change providers:
• They want a new handset
• They believe they pay too
much for calls/data
• Providers do not offer
additional loyalty benefits
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Correlation Matrix
• correlation matrix, which
is used to investigate the
dependence between
multiple variables at the
same time.
Microsoft R Open
• Microsoft R Open, formerly known as Revolution R Open (RRO), is
the enhanced distribution of R from Microsoft Corporation.
• It is a complete open source platform for statistical analysis and
data science.
Key enhancement
• Multi-threaded math libraries that brings multi-threaded
computations to R.
• A high-performance default CRAN repository that provide a
consistent and static set of packages to all Microsoft R Open users.
• The checkpoint package that make it easy to share R code and
replicate results using specific R package versions.
R Tools for Visual Studio
• Turn Visual Studio into a powerful R development environment
• Download R Tools for Visual Studio
R Tools for Visual Studio
• Visual Studio IDE
• Intellisense
• Enhanced multi-threaded math libs, cluster scale computing, and a
high performance CRAN repo with checkpoint capabilities.
• Learn more about R Tools from here:
https://microsoft.github.io/RTVS-docs/
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Data Preprocessing
• Assign missing values
as zero
• Detect outliers
• Remove unwanted
variables
• Recode variables
• Convert categorical
variables
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Features selection
• The process of selecting a subset of relevant features (variables,
predictors) for use in model construction.
• Feature selection techniques are used for three reasons:
• simplification of models to make them easier to interpret by researchers/users,
• shorter training times,
• enhanced generalization by reducing overfitting
Correlation Matrix
Models Performance Comparison
• Logistic Regression
• is a regression model where the dependent variable (DV) is categorical.
• Support Vector Machine
• SVM is a supervised learning model with associated learning algorithms that
analyze data used for classification and regression analysis.
• RandomForest
• is an ensemble learning method for classification, regression and other tasks,
that operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the classes (classification) or mean
prediction (regression) of the individual trees.
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Training Model and Algorithm
• Split the data set into 80:20
using library(caret)
• Apply the algorithms: GLM,
SVM and RF
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Score Model
• Confusion Matrix: a table that is often used to describe the
performance of a classification model (or "classifier") on a set of test
data for which the true values are known.
• true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do churn.
• true negatives (TN): We predicted no, and they don't churn.
• false positives (FP): We predicted yes, but they don't actually churn. (Also
known as a "Type I error.")
• false negatives (FN): We predicted no, but they actually do churn. (Also
known as a "Type II error.")
Confusion Matrix: Generalized Linear Model
(glm)
n=1407
Predicted:
NO
Predicted:
YES
Actual:
NO
TN = 919
(0.653)
FP = 115
(0.082)
1034
Actual:
YES
FN = 167
(0.119)
TP = 206
(0.146)
373
1086 321
Confusion Matrix: Support Vector Machine
(SVM)
n=1407
Predicted:
NO
Predicted:
YES
Actual:
NO
TN= 929
(0.660)
FP= 105
(0.075)
1034
Actual:
YES
FN= 183
(0.130)
TP= 190
(0.135)
373
1112 295
Confusion Matrix: RandomForest
n=1407
Predicted:
NO
Predicted:
YES
Actual:
NO
TN= 939
(0.667)
FP= 95
(0.068)
1034
Actual:
YES
FN= 181
(0.129)
TP= 192
(0.136)
373
1120 287
Receiver Operating Characteristic (ROC) curve
• ROC curve is a graphical plot that illustrates the performance of a
binary classifier system as its discrimination threshold is varied. The
curve is created by plotting the true positive rate (TPR) against the
false positive rate (FPR) at various threshold settings.
Models comparison
• ROC illustrates the
performance of a binary
classifier system as its
discrimination threshold is
varied.
Microsoft R Open vs R
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Predict test data
• Based on the training
model, select the best
model to be used for test
data prediction
Thanks!
Questions?
@kuanhoong
https://www.linkedin.com/in/kuanhoong
kuanhoong@gmail.com
DEMO

More Related Content

What's hot (20)

PPT
Decision tree and random forest
Lippo Group Digital
 
PDF
Business intelligence
Roots Cast Pvt Ltd
 
PPTX
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Feng Zhu
 
PPTX
Telecom Churn Analysis
Vasudev pendyala
 
PPTX
Machine Learning with R
Barbara Fusinska
 
PPT
Business Intelligence - Intro
David Hubbard
 
PPTX
Cluster analysis
Pushkar Mishra
 
PDF
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DATAVERSITY
 
PDF
Bayesian Deep Learning
RayKim51
 
PDF
RFM Segmentation
Kamil Bartocha
 
PDF
Machine Learning in R
Alexandros Karatzoglou
 
PPTX
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
PDF
Churn Prediction in Practice
BigData Republic
 
PPTX
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
PDF
Churn prediction data modeling
Pierre Gutierrez
 
PPTX
Random Forest
Abdullah al Mamun
 
PPTX
Customer Churn Analysis and Prediction
SOUMIT KAR
 
PDF
Data Visualization in Data Science
Maloy Manna, PMP®
 
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Simplilearn
 
PDF
Introduction to data analytics
SSaudia
 
Decision tree and random forest
Lippo Group Digital
 
Business intelligence
Roots Cast Pvt Ltd
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Feng Zhu
 
Telecom Churn Analysis
Vasudev pendyala
 
Machine Learning with R
Barbara Fusinska
 
Business Intelligence - Intro
David Hubbard
 
Cluster analysis
Pushkar Mishra
 
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DATAVERSITY
 
Bayesian Deep Learning
RayKim51
 
RFM Segmentation
Kamil Bartocha
 
Machine Learning in R
Alexandros Karatzoglou
 
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
Churn Prediction in Practice
BigData Republic
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
Churn prediction data modeling
Pierre Gutierrez
 
Random Forest
Abdullah al Mamun
 
Customer Churn Analysis and Prediction
SOUMIT KAR
 
Data Visualization in Data Science
Maloy Manna, PMP®
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Simplilearn
 
Introduction to data analytics
SSaudia
 

Similar to Customer Churn Analytics using Microsoft R Open (20)

PDF
Data Analysis - Making Big Data Work
David Chiu
 
DOCX
Data Analytics Using R - Report
Akanksha Gohil
 
DOCX
Imtiaz khan data_science_analytics
imtiaz khan
 
PDF
Churn in the Telecommunications Industry
skewdlogix
 
PDF
Machine Learning - Principles
Giorgio Alfredo Spedicato
 
PPTX
Machine Learning - Startup weekend UCSB 2018
Raul Eulogio
 
PPTX
User Case.pptx
60AIVaibhavGhubade
 
DOCX
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
tangyechloe
 
PDF
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Hyderabad Scalability Meetup
 
PPTX
10 best practices in operational analytics
Decision Management Solutions
 
PPTX
Building and deploying analytics
Collin Bennett
 
PPTX
Data analytics and visualization
Vini Vasundharan
 
PPTX
Insurance Churn Prediction Data Analysis Project
Boston Institute of Analytics
 
PPTX
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 
PPT
Data mining intro-2009-v2
Prithwis Mukerjee
 
PDF
Intro to R and Data Mining 2012 09 27
Raj Kasarabada
 
PDF
Improving customer insight through prediction models
Alessandro Leona
 
PPTX
Informs presentation new ppt
Salford Systems
 
PDF
The Data Science Process
Vishal Patel
 
PDF
Customer choice probabilities
Allan D. Butler
 
Data Analysis - Making Big Data Work
David Chiu
 
Data Analytics Using R - Report
Akanksha Gohil
 
Imtiaz khan data_science_analytics
imtiaz khan
 
Churn in the Telecommunications Industry
skewdlogix
 
Machine Learning - Principles
Giorgio Alfredo Spedicato
 
Machine Learning - Startup weekend UCSB 2018
Raul Eulogio
 
User Case.pptx
60AIVaibhavGhubade
 
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
tangyechloe
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Hyderabad Scalability Meetup
 
10 best practices in operational analytics
Decision Management Solutions
 
Building and deploying analytics
Collin Bennett
 
Data analytics and visualization
Vini Vasundharan
 
Insurance Churn Prediction Data Analysis Project
Boston Institute of Analytics
 
DataAnalyticsIntroduction and its ci.pptx
PrincePatel272012
 
Data mining intro-2009-v2
Prithwis Mukerjee
 
Intro to R and Data Mining 2012 09 27
Raj Kasarabada
 
Improving customer insight through prediction models
Alessandro Leona
 
Informs presentation new ppt
Salford Systems
 
The Data Science Process
Vishal Patel
 
Customer choice probabilities
Allan D. Butler
 
Ad

More from Poo Kuan Hoong (20)

PDF
Build an efficient Machine Learning model with LightGBM
Poo Kuan Hoong
 
PDF
Tensor flow 2.0 what's new
Poo Kuan Hoong
 
PDF
The future outlook and the path to be Data Scientist
Poo Kuan Hoong
 
PDF
Data Driven Organization and Data Commercialization
Poo Kuan Hoong
 
PDF
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
PDF
Explore and Have Fun with TensorFlow: Transfer Learning
Poo Kuan Hoong
 
PDF
Deep Learning with R
Poo Kuan Hoong
 
PDF
Explore and have fun with TensorFlow: An introductory to TensorFlow
Poo Kuan Hoong
 
PDF
The path to be a Data Scientist
Poo Kuan Hoong
 
PPTX
Deep Learning with Microsoft R Open
Poo Kuan Hoong
 
PPTX
Microsoft APAC Machine Learning & Data Science Community Bootcamp
Poo Kuan Hoong
 
PDF
Machine Learning and Deep Learning with R
Poo Kuan Hoong
 
PDF
The path to be a data scientist
Poo Kuan Hoong
 
PDF
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
PDF
Big Data Malaysia - A Primer on Deep Learning
Poo Kuan Hoong
 
PDF
Handwritten Recognition using Deep Learning with R
Poo Kuan Hoong
 
PDF
An Introduction to Deep Learning
Poo Kuan Hoong
 
PDF
Machine learning and big data
Poo Kuan Hoong
 
PDF
DSRLab seminar Introduction to deep learning
Poo Kuan Hoong
 
PDF
Context Aware Road Traffic Speech Information System from Social Media
Poo Kuan Hoong
 
Build an efficient Machine Learning model with LightGBM
Poo Kuan Hoong
 
Tensor flow 2.0 what's new
Poo Kuan Hoong
 
The future outlook and the path to be Data Scientist
Poo Kuan Hoong
 
Data Driven Organization and Data Commercialization
Poo Kuan Hoong
 
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
Explore and Have Fun with TensorFlow: Transfer Learning
Poo Kuan Hoong
 
Deep Learning with R
Poo Kuan Hoong
 
Explore and have fun with TensorFlow: An introductory to TensorFlow
Poo Kuan Hoong
 
The path to be a Data Scientist
Poo Kuan Hoong
 
Deep Learning with Microsoft R Open
Poo Kuan Hoong
 
Microsoft APAC Machine Learning & Data Science Community Bootcamp
Poo Kuan Hoong
 
Machine Learning and Deep Learning with R
Poo Kuan Hoong
 
The path to be a data scientist
Poo Kuan Hoong
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
Big Data Malaysia - A Primer on Deep Learning
Poo Kuan Hoong
 
Handwritten Recognition using Deep Learning with R
Poo Kuan Hoong
 
An Introduction to Deep Learning
Poo Kuan Hoong
 
Machine learning and big data
Poo Kuan Hoong
 
DSRLab seminar Introduction to deep learning
Poo Kuan Hoong
 
Context Aware Road Traffic Speech Information System from Social Media
Poo Kuan Hoong
 
Ad

Recently uploaded (20)

PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
big data eco system fundamentals of data science
arivukarasi
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
BinarySearchTree in datastructures in detail
kichokuttu
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 

Customer Churn Analytics using Microsoft R Open

  • 1. Customer Churn Analytics using Microsoft R Open Malaysia R User Group Meet Up 16th February 2017 Poo Kuan Hoong https://github.com/kuanhoong/churn-r
  • 2. Disclaimer: The views and opinions expressed in this slides are those of the author and do not necessarily reflect the official policy or position of Nielsen Malaysia. Examples of analysis performed within this slides are only examples. They should not be utilized in real-world analytic products as they are based only on very limited and dated open source information. Assumptions made within the analysis are not reflective of the position of Nielsen Malaysia.
  • 3. Agenda • Introduction • Customer Churn Analytics • Machine Learning Framework • Microsoft R Open and Visual Studio • Model Performance Comparison • Demo
  • 4. Malaysia R User Group (MyRUG) • The Malaysia R User Group (MyRUG) was formed on June 2016. • It is a diverse group that come together to discuss anything related to the R programming language. • The main aim of MyRUG is to provide members ranging from beginners to R professionals and experts to share and learn about R programming and gain competency as well as share new ideas or knowledge.
  • 8. Introduction • Customer churn can be defined simply as the rate at which a company is losing its customers • Imagine the business as a bucket with holes, the water flowing from the top is the growth rate, while the holes at the bottom is churn • While a certain level of churn is unavoidable, it is important to keep it under control, as high churn rate can potentially kill your business
  • 10. Churn analytics • Predicting who will switch mobile operator
  • 11. Customer churn - who do customers change operators? • The top 3 reasons why subscribers change providers: • They want a new handset • They believe they pay too much for calls/data • Providers do not offer additional loyalty benefits
  • 12. Data Collection Data Preprocessing Attributes selection • Attribute 1 • Attribute 2 • Attribute 3 Algorithm Training Model Score Model Apply Data /Test Data Predicting Output Initialization Step Learn Step Apply Step Machine Learning Framework
  • 13. Correlation Matrix • correlation matrix, which is used to investigate the dependence between multiple variables at the same time.
  • 14. Microsoft R Open • Microsoft R Open, formerly known as Revolution R Open (RRO), is the enhanced distribution of R from Microsoft Corporation. • It is a complete open source platform for statistical analysis and data science. Key enhancement • Multi-threaded math libraries that brings multi-threaded computations to R. • A high-performance default CRAN repository that provide a consistent and static set of packages to all Microsoft R Open users. • The checkpoint package that make it easy to share R code and replicate results using specific R package versions.
  • 15. R Tools for Visual Studio • Turn Visual Studio into a powerful R development environment • Download R Tools for Visual Studio
  • 16. R Tools for Visual Studio • Visual Studio IDE • Intellisense • Enhanced multi-threaded math libs, cluster scale computing, and a high performance CRAN repo with checkpoint capabilities. • Learn more about R Tools from here: https://microsoft.github.io/RTVS-docs/
  • 17. Data Collection Data Preprocessing Attributes selection • Attribute 1 • Attribute 2 • Attribute 3 Algorithm Training Model Score Model Apply Data /Test Data Predicting Output Initialization Step Learn Step Apply Step Machine Learning Framework
  • 18. Data Preprocessing • Assign missing values as zero • Detect outliers • Remove unwanted variables • Recode variables • Convert categorical variables
  • 19. Data Collection Data Preprocessing Attributes selection • Attribute 1 • Attribute 2 • Attribute 3 Algorithm Training Model Score Model Apply Data /Test Data Predicting Output Initialization Step Learn Step Apply Step Machine Learning Framework
  • 20. Features selection • The process of selecting a subset of relevant features (variables, predictors) for use in model construction. • Feature selection techniques are used for three reasons: • simplification of models to make them easier to interpret by researchers/users, • shorter training times, • enhanced generalization by reducing overfitting
  • 22. Models Performance Comparison • Logistic Regression • is a regression model where the dependent variable (DV) is categorical. • Support Vector Machine • SVM is a supervised learning model with associated learning algorithms that analyze data used for classification and regression analysis. • RandomForest • is an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
  • 23. Data Collection Data Preprocessing Attributes selection • Attribute 1 • Attribute 2 • Attribute 3 Algorithm Training Model Score Model Apply Data /Test Data Predicting Output Initialization Step Learn Step Apply Step Machine Learning Framework
  • 24. Training Model and Algorithm • Split the data set into 80:20 using library(caret) • Apply the algorithms: GLM, SVM and RF
  • 25. Data Collection Data Preprocessing Attributes selection • Attribute 1 • Attribute 2 • Attribute 3 Algorithm Training Model Score Model Apply Data /Test Data Predicting Output Initialization Step Learn Step Apply Step Machine Learning Framework
  • 26. Score Model • Confusion Matrix: a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. • true positives (TP): These are cases in which we predicted yes (they have the disease), and they do churn. • true negatives (TN): We predicted no, and they don't churn. • false positives (FP): We predicted yes, but they don't actually churn. (Also known as a "Type I error.") • false negatives (FN): We predicted no, but they actually do churn. (Also known as a "Type II error.")
  • 27. Confusion Matrix: Generalized Linear Model (glm) n=1407 Predicted: NO Predicted: YES Actual: NO TN = 919 (0.653) FP = 115 (0.082) 1034 Actual: YES FN = 167 (0.119) TP = 206 (0.146) 373 1086 321
  • 28. Confusion Matrix: Support Vector Machine (SVM) n=1407 Predicted: NO Predicted: YES Actual: NO TN= 929 (0.660) FP= 105 (0.075) 1034 Actual: YES FN= 183 (0.130) TP= 190 (0.135) 373 1112 295
  • 29. Confusion Matrix: RandomForest n=1407 Predicted: NO Predicted: YES Actual: NO TN= 939 (0.667) FP= 95 (0.068) 1034 Actual: YES FN= 181 (0.129) TP= 192 (0.136) 373 1120 287
  • 30. Receiver Operating Characteristic (ROC) curve • ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
  • 31. Models comparison • ROC illustrates the performance of a binary classifier system as its discrimination threshold is varied.
  • 33. Data Collection Data Preprocessing Attributes selection • Attribute 1 • Attribute 2 • Attribute 3 Algorithm Training Model Score Model Apply Data /Test Data Predicting Output Initialization Step Learn Step Apply Step Machine Learning Framework
  • 34. Predict test data • Based on the training model, select the best model to be used for test data prediction
  • 36. DEMO