Vig SPS-5382-Telecom Customer Churn Prediction Using Watson Auto AI
Vig SPS-5382-Telecom Customer Churn Prediction Using Watson Auto AI
REPORT - IBM BUILD-A-THON
Telecom Customer Churn Prediction using Watson Auto AI
presented by
Vignesh K
email-id: [email protected]
in the month of
October 2020
TABLE OF CONTENTS
1 INTRODUCTION
1.1 Overview
1.2 Purpose
2 LITERATURE SURVEY
2.1 Existing Problem
2.2 Proposed Solution
3 THEORITICAL ANALYSIS
3.1 Block Diagram
3.2 Hardware / Software Designing
4 EXPERIMENTAL INVESTIGATIONS
5 FLOWCHART
6 RESULT
7 ADVANTAGES & DISADVANTAGES
8 APPLICATIONS
9 CONCLUSION
10 FUTURE SCOPE
11 BIBILIOGRAPHY
APPENDIX
A. Source Code
INTRODUCTION
1.1 OVERVIEW
Churn prediction is one of the most popular Big Data use cases in
business. It consists of detecting customers who are likely to cancel a
subscription to a service. This can be telecom companies, SaaS
companies, and any other company that sells a service for a monthly fee.
In the telecom industry, customers can choose from multiple service
providers and actively switch from one operator to another. In this highly
competitive market, the telecommunications industry experiences an
average of 15-25% annual churn rate. Given the fact that it costs 5-10 times
more to acquire a new customer than to retain an existing one, customer
retention has now become even more important than customer acquisition.
For many incumbent operators, retaining high profitable customers is the
number one business goal.
1.2 PURPOSE
Customer churn prediction can help you see which customers are
about to leave your service so you can develop proper strategy to re-engage
them before it is too late. This is a vital tool in a business' arsenal when it
comes to customer retention. Having the ability to accurately predict future
churn rates is essential because it helps your business gain a better
understanding of future expected revenue. Predicting churn rates can also
help your business identify and improve upon areas where customer
service is lacking.
To reduce customer churn, telecom companies need to predict which
customers are at high risk of churn.
In this project, the customer-level data of a leading telecom firm, build
predictive models to identify customers who will stay in the company (or)
who will leave the company based on a set of parameters.
LITERATURE SURVEY
3.1 BLOCK DIAGRAM
The block diagram depicts the workflow of the entire system. Watson
Studio acts the central point of computation, and is used for running python
notebooks and creating, monitoring, and managing deployments. The
runtime environment is powered by Watson Machine Learning Service. The
UI is designed using HTML and the backend process is automated using
Flask framework, which also facilitates deployment of the ML models using
the scoring endpoint.
3.2 HARDWARE / SOFTWARE DESIGNING
The following are the hardware requirements for standard users
(commodity hardware) of the proposed system:
Processor: Core i5 Quad Core
RAM:8GB
The software specification for the proposed system is as follows:
IBM Watson Studio:
Watson ML Package - 'Lite'
Instance Type - 'v2'
Environment Definition - Default Python 3.6XS
Virtual Hardware Configuration - 2 vCPU 8GB RAM
COS Instance Region - 'London'
Python Flask Application:
HTML - 5.0
Flask - 1.1.2
Python Libraries required:
scikit-learn
pandas
numpy
seaborn
json
sklearn.preprocessing
sklearn.model_selection
sklearn.feature_selection
EXPERIMENTAL INVESTIGATIONS
3.1 LOADING THE DATASET
The dataset is provided in the template provided. This data set
contains details of a company's customers and the target variable is a
binary variable reflecting the fact whether the customer left the company
(or) he continues to be a customer.
IBM has provided options to add the dataset as an asset into the
projects. The project assets can be added directly into the Python notebook
using a simple process, and the code is automatically generated.
The dataset is available here.
The metadata of the dataset is as follows:
● # of rows: 10,000
● # of columns: 14
● # of Input Variables: 13
● # of Ouput Variable(s): 1 - ["Exited"]
3.2 INFORMATION ABOUT DATASET
The info() function gives some of the basic details about the dataset.
It gives the information about the following:
● Number of entries
● Null Value status
● Datatype
for each attribute of the dataframe.
3.3 CHECKING NULL VALUES
A null value is a value that has no value exists for the particular
position. Dataset with null values affect the performance of the Machine
Learning model. Null values in the dataset can be identified and can be
eradicated.
3.4 DESCRIPTIVE ANALYTICS FOR DATA
Descriptive analytics gives you a general view of the historic data, to
provide a clear, straightforward picture of the company's operations.
Descriptive analytics is the interpretation of historical data to better
understand changes that have occurred in a business. Descriptive analytics
describes the use of a range of historic data to draw comparisons.
3.4.1 PRECISION SETUP
We can set the options in the pandas setup to make the precision to a
number of decimal points that we desire. The describe() function is used to
get the descriptive statistics of the dataset. The output of this function will
be the following:
● count
● mean
● standard devidation
● minimum
● maximum
● 25%, 50%, 75% values
for each attributes of the dataset.
3.4.2 IDENTIFYING CORRELATIONS
Correlation is a statistical measure that expresses the extent to which
two variables are linearly related (meaning they change together at a
constant rate). Correlation is a measure of the strength of a linear
relationship between two quantitative variables.
The Pearson coefficient is a type of correlation coefficient that
represents the relationship between two variables that are measured on the
same interval or ratio scale. The Pearson coefficient is a measure of the
strength of the association between two continuous variables.
The results of applying Pearson Correlation to our dataset is as
follows:
3.5 DATA VISUALIZATION
Data visualization is the discipline of trying to understand data by
placing it in a visual context so that patterns, trends and correlations that
might not otherwise be detected can be exposed. Python offers multiple
great graphing libraries that come packed with lots of different features.
Data visualization is the graphical representation of data in order to
interactively and efficiently convey insights to clients, customers, and
stakeholders in general.
The different types of data visualization techniques used in analysing
the dataset are as follows:
3.5.1 COUNT PLOTS
The countplot() method is used to show the counts of observations in
each categorical bin using bars. The countplot() is applied for the following
attributes of the dataset:
● Tenure
● Credit Score
● Geography
The output of a countplot is as follows:
3.5.2 PAIRGRID GENERATION
A pairgrid is a subplot grid for plotting pairwise relationships in a
dataset. This object maps each variable in a dataset onto a column and
row in a grid of multiple axes. Different axes-level plotting functions can be
used to draw bivariate plots in the upper and lower triangles, and the the
marginal distribution of each variable can be shown on the diagonal.
The pairgrid shows the relationship between an attribute of the
dataset with any other attribute of the dataset. The dense spots indicate
the strong relationships between the attributes. The sparse spots indicate
the weak relationships between the attributes. The pairgrid that is
generated for the dataset under consideration is as follows:
3.6 DATA DISTRIBUTION
A data distribution is a function or a listing which shows all the
possible values (or intervals) of the data. It also (and this is important) tells
you how often each value occurs. Often, the data in a distribution will be
ordered from smallest to largest, and graphs and charts allow you to easily
see both the values and the frequency with which they appear.
From a distribution you can calculate the probability of any one
particular observation in the sample space, or the likelihood that an
observation will have a value which is less than (or greater than) a point of
interest. The function of a distribution that shows the density of the values
of our data is called a probability density function, and is sometimes
abbreviated pdf. The methods of data distributions used in the project are
as follows:
3.6.1 BOXPLOTS
A box plot (or box-and-whisker plot) shows the distribution of
quantitative data in a way that facilitates comparisons between variables or
across levels of a categorical variable. The box shows the quartiles of the
dataset while the whiskers extend to show the rest of the distribution,
except for points that are determined to be “outliers” using a method that is
a function of the inter-quartile range. The boxplot for the dataset is as
follows:
3.6.2 DISTRIBUTION PLOTS
Seaborn distplot lets you show a histogram with a line on it. We use
seaborn in combination with matplotlib, the Python plotting module. A
distplot plots a univariate distribution of observations. The distplot()
function combines the matplotlib hist function with the seaborn kdeplot()
and rugplot() functions.
The histogram shows buckets of data ranges called as bins and
distributes the values of the attributes into the buckets. Then, it calculates
the probability of occurrence of each of the buckets. Thi process is done
for categorical data. For continuous data, a PDF curve is generated, which
shows the distribution of categorical data as a function of a polynomial,
which generates a curve.
A sample output for a distplot is as follows:
The distplots are generated for the following attributes of the dataset:
● Balance
● # of Products
● Estimated Salary
3.7 ONE-HOT ENCODING
Sometimes in datasets, we encounter columns that contain numbers
of no specific order of preference. The data in the column usually denotes a
category or value of the category and also when the data in the column is
label encoded. This confuses the machine learning model, to avoid this the
data in the column should be One Hot encoded. One-hot encoding refers to
splitting the column which contains numerical categorical data to many
columns depending on the number of categories present in that column.
Each column contains “0” or “1” corresponding to which column it has been
placed.
For non pre-processed data, LabelEncoder() helps to generate
one-hot encoding for the dataset. For pre-processed data,
pd.get_dummies() function helps to generate one-hot encoding for the
dataset.
An example for one-hot encoding is as follows:
3.8 OUTLIER DETECTION
The presence of outliers in a classification or regression dataset can
result in a poor fit and lower predictive modeling performance. Identifying
and removing outliers is challenging with simple statistical methods for
most machine learning datasets given the large number of input variables.
Instead, automatic outlier detection methods can be used in the modeling
pipeline and compared, just like other data preparation transforms that may
be applied to the dataset.
The methods used for outlier detection are as follows:
3.8.1 Z-SCORE
Z score is an important concept in statistics. Z score is also called
standard score. This score helps to understand if a data value is greater or
smaller than mean and how far away it is from the mean. More specifically,
Z score tells how many standard deviations away a data point is from the
mean. The Z-Score method is applied to 'EstimatedSalary' attribute, and it
showed no presence of an outlier. The formula for Z-Score is as follows:
3.8.2 INTER-QUARTILE RANGE METHOD
In descriptive statistics, the interquartile range, also called the
midspread or middle 50%, or technically H-spread, is a measure of
statistical dispersion, being equal to the difference between 75th and 25th
percentiles, or between upper and lower quartiles, IQR = Q₃ − Q₁. In our
dataset, the IQR (Inter-Quartile Range) method is applied to 'Ballance' and it
showed no evidence of outliers.
The formula for IQR method is as follows:
3.9 FEATURE ENGINEERING
All machine learning algorithms use some input data to create
outputs. This input data comprise features, which are usually in the form of
structured columns. Algorithms require features with some specific
characteristic to work properly. Here, the need for feature engineering
arises. The features we use influence more than everything else the result.
No algorithm alone can supplement the information gain given by correct
feature engineering. The method of feature engineering that is used in our
project is "Polynomial Features"
3.9.1 POLYNOMIAL FEATURES
Polynomial features are those features created by raising existing
features to an exponent. For example, if a dataset had one input feature X,
then a polynomial feature would be the addition of a new feature (column)
where values were calculated by squaring the values in X, e.g. X^2. This
process can be repeated for each input variable in the dataset, creating a
transformed version of each. As such, polynomial features are a type of
feature engineering, e.g. the creation of new input features based on the
existing features. The “degree” of the polynomial is used to control the
number of features added, e.g. a degree of 3 will add two new variables for
each input variable. Typically a small degree is used such as 2 or 3.
The output of the feature engineering had 87 attributes in our dataset,
and among them, the best 25 were selected.
3.10 FEATURE SCALING
Feature scaling is a method used to normalize the range of
independent variables or features of data. Feature Scaling is a technique to
standardize the independent features present in the data in a fixed range. It
is performed during the data pre-processing to handle highly varying
magnitudes or values or units. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing
step.
For every feature, the minimum value of that feature gets transformed
into a 0, the maximum value gets transformed into a 1, and every other
value gets transformed into a decimal between 0 and 1. Min-max
normalization has one fairly significant downside: it does not handle
outliers very well.
The output for scaling the independent attributes is as follows:
3.11 CREATING TRAIN AND TEST DATA
The training data and testing data are created from the pre-processed
dataset. The function of the training data is to train the model and improve
its understanding about the dataset and its attributes, across many epochs
and batches. The function of the test data is to evaluate the model's
understanding to the problem.
In this project, we have splitted the training and test data in the ratio
of 2:1. The output of the shape of the train and test data is as follows:
3.12 MODEL CREATION
The Machine Learning model is created by invoking appropriate
functions that are available in "scikit-learn" package in Python. There are
various parameters which can be used under different scenarios for
creating the Machine Learning model. In our project, there are 4 different
models taken into consideration. They are as follows:
● Support Vector Classifier on non pre-processed data
● Support Vector Classifier on pre-processed data
● Logistic Regression
● Multi Layer Perceptron (Neural Network)
We can get the description of the model's parametrs once we fit the
model with the training and test data. The sample output for creating an ML
model is as follows:
3.13 AUTO AI
The AutoAI graphical tool in Watson Studio automatically analyzes
data and generates candidate model pipelines customized for predictive
modeling problems. These model pipelines are created iteratively as
AutoAI analyzes your dataset and discovers data transformations,
algorithms, and parameter settings that work best for problem setting.
Results are displayed on a leaderboard, showing the automatically
generated model pipelines ranked according to problem optimization
objective. AutoAI enables AI and ML end-to-end lifecycle management.
The result of AutoAI in our dataset is as follows:
FLOW CHART
The flowchart depicts the sequential implementation of the proposed
system. The flowchart shows the dependencies, the independent and
dependent tasks. The flowchart helps to organize the system
functionalities.
Here, the proposed system is implemented by creating a Watson
Machine Learning instance. Here, the use of a python notebook is essential
as it is helpful for processes like dataset preparation, data pre-processing,
data visualization, feature engineering, model creation, and model
prediction.
Once the Machine Learning model is ready-to-use, the deployment
space is created, in which the deployment model is created and added to
the deployment assets. When the asset is deployed successfully, the
scoring endpoint is generated.
The flask application process consists of HTML form creation, flask
integration, and scoring endpoint integration. Once these processes are
done, the application can be executed and the ML model automation will be
complete.
RESULTS
6.1 MODEL EVALUATION
Model evaluation aims to estimate the generalization accuracy of a
model on future (unseen/out-of-sample) data. It helps to find the best
model that represents our data and how well the chosen model will work in
the future. Model evaluation metrics are used to assess goodness of fit
between model and data, to compare different models, in the context of
model selection, and to predict how predictions (associated with a specific
model and data set) are expected to be accurate. The three main metrics
used to evaluate a classification model are accuracy, precision, and recall.
Accuracy is defined as the percentage of correct predictions for the
test data. It can be calculated easily by dividing the number of correct
predictions by the number of total predictions.
Precision is defined as the fraction of relevant examples (true
positives) among all of the examples which were predicted to belong in a
certain class.
Recall is defined as the fraction of examples which were predicted to
belong to a class with respect to all of the examples that truly belong in the
class.
The sample output of obtaining the metrics is given as follows:
The following table shows the readings of the metrics as the output
of model evaluation:
6.3 BUILDING FLASK APPLICATION
IBM provides the option of deploying the ML models created in
Watson Studio to get deployed in real time by providing dynamic scoring
endpoint URLs. This enables users to create models and deploy them
effectively.
The scoring endpoint URL can be obtained by creating a deployment
model and adding it to the deployment space as an instance. This enables
multiple model to get deployed simultaneously. I have deployed the model
"Support Vector Classifier on pre-processed data" into a deployment space.
The scoring endpoint URL obtained for by deploying the model is:
https://eu-gb.ml.cloud.ibm.com/ml/v4/deployments/a77fd05b-67a5-40d1-
8ab1-17e160b261c8/predictions?version=2020-10-20
A flask application is built in order to perform automated deployment
of the ML model. The UI is built using HTML by creating a form to get the
independent variables of the dataset as the user inputs. The output of the
UI is given below:
After the UI is built, the python script is built and executed. The
execution of the script makes the flask app to get deployed onto the local
server (http://127.0.0.1:5000) port number 5000.
Once the user clicks the "Submit" button, the responses get recodred
and the independent attributes of the dataset are transformed into the
pattern into which it is sent into the ML model. The payload is created in
the pattern of "[fields]:[values]" and is sent along with the URL as a POST
request. The model present in the deployment makes the prediciton and
sends the result back to the local server in JSON format. The prediciton of
the model is printed in the result page.
6.4 AUTO AI PIPELINE DETAILS
The AutoAI experiment is run on IBM Watson Studio by feeding it with
the dataset. A pipeline is generated which had two different algorithms,
with different versions, by varying critical parameters of Machine Learning.
All the models created and run automatically and the results are provided
with ample amount of metrics available for comparison.
The result of pipeline comparison is shown below:
The final list with the AutoAI algorithms into consideration along with
the computed metrics is given below:
ADVANTAGES & DISADVANTAGES
7.1 ADVANTAGES
● The services provided by IBM Cloud can be leveraged to perform
complex tasks of any scale with ease.
● The interface is easy to use, with the tours guiding through all the
important aspects of the services.
● Usage of python is easy and handy when it comes to data
visualization and analysing data distribution.
● Access to a wide rande of software assets which can be incorporated
into the project in just a few clicks.
● Access to Auto AI has made a huge impact on the project. It enables
even naive users to understand Machine Learning algorithms and
many more techniques.
● Production deployments and automation using payload scoring gives
exposure of handling end-to-end application.
7.2 DISADVANTAGES
● Exceeding capacity-units per hour (CUH) imposes a bottleneck in
utilizing the capabilities.
APPLICATIONS
The proposed system will be helpful in predicting whether a customer
will leave the company or continue with the company. This will be beneficial
for Telecom companies who suffer significant losses due to customer
churn. The rate at which customers churn from a company is called as
churn rate. The churn prediciton will try to reduce the churn rate as
minimum as possible, playing an important role in the company's turnover
and reputation.
The use of an application to make adhoc predicitons will help the
users of the application to get instant results for the inputs. The
parameters chosen for predicting customer churn are spot-on, and all of
them are very critical for predicting the churn rate.
The use of churn prediciton beforehand will enable the company to
make counter attacks and try to retain more customers by introducing
optimized plans, new offers etc., It also helps the company to avoid
unnecessary loss and also adds up new customers due to improvised
workflow strategies.
CONCLUSION
I would like to extend my gratitude to IBM India Pvt. Ltd. and
SmartInternz - by SmartBridge Educational Services, for giving me an
opportunity to use the resources, study materials, tutorials provided by
them and to have me as a part of IBM Build-a-thon.
I have built a project named "Telecom Churn Prediciton using Watson
Auto AI" and have been provided free Watson Studio Desktop access for 30
days. I think I have done justice for the opportunity and the resources
provided to me.
It has been an enthaling experience for me working under this projejct
for 3 weeks. I have recorded a video to demonstrate the working of the
project. I have also added all the resources from my side to the Git.
Scoring Endpoint URL:
https://eu-gb.ml.cloud.ibm.com/ml/v4/deployments/a77fd05b-67a5-40d1-
8ab1-17e160b261c8/predictions?version=2020-10-20
GitHub link to my project:
https://github.com/SmartPracticeschool/SPS-5382-Telecom-Customer-Ch
urn-Prediction-using-Watson-Auto-AI
Link to the Project Demonstration Video:
https://drive.google.com/file/d/1zsHlecIcB76JRT8yOepPMlURsCig0nYi/vie
w?usp=sharing
FUTURE SCOPE
The project can be enhanced from different view points namely:
● Optimized Machine Learning algorithms
● More feature engineering techniques
● Analysing vital parameters for targeted customers
● Flask UI with improved functionalities
● Multiple deployments for different business scenarios
BIBILIOGRAPHY
1. Essam Shaaban, Yehia Helmy, Ayman Khedr, Mona Nasr | International
Journal of Engineering Research and Applications (IJERA) | A Proposed
Churn Prediction Model
2. Sandra Mitrović, Bart Baesens, Wilfried Lemahieu, Jochen De Weerdt |
On the Operational Efficiency of Different Feature Types for Telco Churn
Prediction
3. Veronikha Effendy, Adiwijaya, Z.K.A. Baizal. | 2014 2nd International
Conference on Information and Communication Technology (ICoICT) |
Handling Imbalanced Data in Customer Churn Prediction Using Combined
Sampling and Weighted Random Forest.
4. Yiqing Huang, Fangzhou Zhu, Mingxuan Yuan, Ke Deng, Yanhua Li, Bing
Ni, Wenyuan Dai, Qiang Yang, Jia Zeng | Advancing Computing as a Science
& Profession | Telco Churn Prediction with Big Data.
5. Sandra Mitrović, Bart Baesens, Wilfried Lemahieu, Jochen De Weerdt |
Churn Prediction using Dynamic RFM-Augmented node2vec.
APPENDIX
A. SOURCE CODE
"""# Customer Churn Prediction
## 1. Loading Libraries
"""
import json
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from itertools import combinations
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder,
StandardScaler
import sklearn.feature_selection
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics
import pickle
"""### The Dataset
From a telecommunications company. It includes information about:
- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines,
internet, online security, online backup, device protection, tech support, and
streaming TV and movies
- Customer account information – how long they’ve been a customer,
contract, payment method, paperless billing, monthly charges, and total
charges
- Demographic info about customers – gender, age range, and if they have
partners and dependents
### 2. Loading Our Dataset
Click on the cell below to highlight it.
Then go to the `Files` section to the right of this notebook and click `Insert
to code` for the data you have uploaded. Choose `Insert pandas
DataFrame`.
"""
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It
includes your credentials.
# You might want to remove those credentials before you share the
notebook.
client_b874c30054d441ffacbe02cbcc8859e6 =
ibm_boto3.client(service_name='s3',
ibm_api_key_id='***',
ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
config=Config(signature_version='oauth'),
endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')
body =
client_b874c30054d441ffacbe02cbcc8859e6.get_object(Bucket='telcochur
nprediciton-donotdelete-pr-2use5r9izvml7k',Key='Churn_Modelling.csv')['Bo
dy']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__,
body )
df_data_2 = pd.read_csv(body)
df_data_2.head()
customer_data = df_data_2
# Checking that everything is correct
pd.set_option('display.max_columns', 30)
customer_data.head(10)
"""### 3. Get some info about our Dataset and whether we have missing
values"""
# After running this cell we will see that we have no missing values
customer_data.info()
customer_data.shape
# Drop customerID column
customer_data = customer_data.drop('RowNumber', axis=1)
customer_data = customer_data.drop('CustomerId', axis=1)
customer_data = customer_data.drop('Surname', axis=1)
customer_data.head(5)
# Check if we have any NaN values
customer_data.isnull().values.any()
customer_data.info()
"""### 4. Descriptive analytics for our data"""
# Describe columns with numerical values
pd.set_option('precision', 3)
customer_data.describe()
# Describe columns with objects
customer_data.describe(exclude=np.number)
# Find correlations
customer_data.corr(method='pearson')
"""### 5. Visualize our Data to understand it better
#### Plot Relationships
"""
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="Tenure", data=customer_data)
# Plot CreditScore Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="CreditScore", data=customer_data)
# Plot Geography Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="Geography", data=customer_data)
# Create Grid for pairwise relationships
gr = sns.PairGrid(customer_data, size=5)
gr = gr.map_diag(plt.hist)
gr = gr.map_offdiag(plt.scatter)
gr = gr.add_legend()
"""#### Understand Data Distribution"""
# Set up plot size
fig, ax = plt.subplots(figsize=(6,6))
# Attributes destribution
a = sns.boxplot(orient="v", palette="hls", data=customer_data.iloc[:],
fliersize=14)
# Tenure data distribution
histogram = sns.distplot(customer_data.iloc[:, 5], hist=True)
plt.show()
# Monthly Charges data distribution
histogram = sns.distplot(customer_data.iloc[:, 6], hist=True)
plt.show()
# Total Charges data distribution
histogram = sns.distplot(customer_data.iloc[:, 7], hist=True)
plt.show()
customer_data1 = customer_data
customer_data1 = customer_data1.drop('Exited', axis=1)
customer_data1.head(5)
"""### 6. Encode string values in data into numerical values"""
# Use pandas get_dummies
customer_data_encoded = pd.get_dummies(customer_data1)
print(customer_data_encoded.head(10))
customer_data_encoded.shape
"""### 7. Create Training Set and Labels"""
# Create training data for non-preprocessed approach
X_npp = customer_data.iloc[:, :-1].apply(LabelEncoder().fit_transform)
pd.DataFrame(X_npp).head(5)
# Create training data for that will undergo preprocessing
X = customer_data_encoded
X.head()
print(X.shape)
# Extract labels
y_unenc = customer_data['Exited']
# Convert strings of 'yes' and 'no' to binary values of 0 or 1
le = preprocessing.LabelEncoder()
le.fit(y_unenc)
y_le = le.transform(y_unenc)
pd.DataFrame(y_le)
"""### 8. Detect outliers in numerical values"""
# Calculate the Z-score using median value and median absolute deviation
for more robust calculations
# Working on EstimatedSalary column
threshold = 3
median = np.median(X['EstimatedSalary'])
median_absolute_deviation = np.median([np.abs(x - median) for x in
X['EstimatedSalary']])
modified_z_scores = [0.6745 * (x - median) / median_absolute_deviation
for x in X['EstimatedSalary']]
results = np.abs(modified_z_scores) > threshold
print(np.any(results))
# Do the same for Balance column but using the interquartile method
quartile_1, quartile_3 = np.percentile(X['Balance'], [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)
print(np.where((X['Balance'] > upper_bound) | (X['Balance'] < lower_bound)))
print(X)
X.shape
# Find interactions between current features and append them to the
dataframe
def add_interactions(dataset):
# Get feature names
comb = list(combinations(list(dataset.columns), 2))
col_names = list(dataset.columns) + ['_'.join(x) for x in comb]
# Find interactions
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
dataset = poly.fit_transform(dataset)
dataset = pd.DataFrame(dataset)
dataset.columns = col_names
# Remove interactions with 0 values
no_inter_indexes = [i for i, x in enumerate(list((dataset ==0).all())) if x]
dataset = dataset.drop(dataset.columns[no_inter_indexes], axis=1)
return dataset
print(X)
X.shape
X_inter = add_interactions(X)
X_inter.head(15)
# Select best features
select = sklearn.feature_selection.SelectKBest(k=25)
selected_features = select.fit(X_inter, y_le)
indexes = selected_features.get_support(indices=True)
col_names_selected = [X_inter.columns[i] for i in indexes]
X_selected = X_inter[col_names_selected]
X_selected.head(10)
"""### 10. Split our dataset into train and test datasets
#### Split non-preprocessed data
"""
X_train_npp, X_test_npp, y_train_npp, y_test_npp = train_test_split(X_npp,
y_unenc,\
test_size=0.33, random_state=42)
print(X_train_npp.shape, y_train_npp.shape)
print(X_test_npp.shape, y_test_npp.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y_unenc,\
test_size=0.33, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
X_test.head()
"""#### Trying to send data to the endpoint will return predictions with
probabilities
### 11. Scale our data
"""
# Use StandardScaler
scaler = preprocessing.StandardScaler().fit(X_train, y_train)
X_train_scaled = scaler.transform(X_train)
pd.DataFrame(X_train_scaled, columns=X_train.columns).head()
pd.DataFrame(y_train).head()
"""### 12. Start building a classifier
#### Support Vector Macines on non-preprocessed data
"""
from sklearn.svm import SVC
# Run classifier
clf_svc_npp = svm.SVC(random_state=42)
clf_svc_npp.fit(X_train_npp, y_train_npp)
"""#### Support Vector Machines on preprocessed data"""
from sklearn.linear_model import LogisticRegression
# Run classifier
clf_svc = svm.SVC(random_state=42)
clf_svc.fit(X_train_scaled, y_train)
"""#### Logestic Regression on preprocessed data"""
from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression()
model = clf_lr.fit(X_train_scaled, y_train)
model
"""#### Multilayer Perceptron (Neural Network) on preprocessed data"""
from sklearn.neural_network import MLPClassifier
clf_mlp = MLPClassifier(verbose=0)
clf_mlp.fit(X_train_scaled, y_train)
# Note: MLP as a NN, can use data without the feature engineering step, as
the NN will handle that automatically
"""### 13. Evaluate our model"""
# Use the scaler fit on trained data to scale our test data
X_test_scaled = scaler.transform(X_test)
pd.DataFrame(X_test_scaled, columns=X_train.columns).head()
"""#### Evaluate SVC on non-preprocessed data"""
# Predict confidence scores for data
y_score_svc_npp = clf_svc_npp.decision_function(X_test_npp)
pd.DataFrame(y_score_svc_npp)
# Get accuracy score
from sklearn.metrics import accuracy_score
y_pred_svc_npp = clf_svc_npp.predict(X_test_npp)
acc_svc_npp = accuracy_score(y_test_npp, y_pred_svc_npp)
print(acc_svc_npp)
# Get Precision vs. Recall score
from sklearn.metrics import average_precision_score
average_precision_svc_npp = average_precision_score(y_test_npp,
y_score_svc_npp)
print('Average precision-recall score: {0:0.2f}'.format(
average_precision_svc_npp))
"""#### Evaluate SVC on preprocessed data"""
# Get model confidence of predictions
y_score_svc = clf_svc.decision_function(X_test_scaled)
y_score_svc
# Get accuracy score
y_pred_svc = clf_svc.predict(X_test_scaled)
acc_svc = accuracy_score(y_test, y_pred_svc)
print(acc_svc)
# Get Precision vs. Recall score
average_precision_svc = average_precision_score(y_test, y_score_svc)
print('Average precision-recall score: {0:0.2f}'.format(
average_precision_svc))
"""#### Evaluate Logistic Regression on preprocessed data"""
y_score_lr = clf_lr.decision_function(X_test_scaled)
y_score_lr
y_pred_lr = clf_lr.predict(X_test_scaled)
acc_lr = accuracy_score(y_test, y_pred_lr)
print(acc_lr)
average_precision_lr = average_precision_score(y_test, y_score_lr)
print('Average precision-recall score: {0:0.2f}'.format(
average_precision_lr))
"""#### Evaluate MLP on preprocessed data"""
y_score_mlp = clf_mlp.predict_proba(X_test_scaled)[:, 1]
y_score_mlp
y_pred_mlp = clf_mlp.predict(X_test_scaled)
acc_mlp = accuracy_score(y_test, y_pred_mlp)
print(acc_mlp)
average_precision_mlp = average_precision_score(y_test, y_score_mlp)
print('Average precision-recall score: {0:0.2f}'.format(
average_precision_mlp))
"""### 14. ROC Curve and models comparisons"""
# Plot SVC ROC Curve
plt.figure(0, figsize=(20,15)).clf()
fpr_svc_npp, tpr_svc_npp, thresh_svc_npp = metrics.roc_curve(y_test_npp,
y_score_svc_npp)
auc_svc_npp = metrics.roc_auc_score(y_test_npp, y_score_svc_npp)
plt.plot(fpr_svc_npp, tpr_svc_npp, label="SVC Non-Processed, auc=" +
str(auc_svc_npp))
fpr_svc, tpr_svc, thresh_svc = metrics.roc_curve(y_test, y_score_svc)
auc_svc = metrics.roc_auc_score(y_test, y_score_svc)
plt.plot(fpr_svc, tpr_svc, label="SVC Processed, auc=" + str(auc_svc))
fpr_mlp, tpr_mlp, thresh_mlp = metrics.roc_curve(y_test, y_score_mlp)
auc_mlp = metrics.roc_auc_score(y_test, y_score_mlp)
plt.plot(fpr_mlp, tpr_mlp, label="MLP, auc=" + str(auc_mlp))
fpr_lr, tpr_lr, thresh_lr = metrics.roc_curve(y_test, y_score_lr)
auc_lr = metrics.roc_auc_score(y_test, y_score_lr)
plt.plot(fpr_lr, tpr_lr, label="Logistic Regression, auc=" + str(auc_lr))
plt.legend(loc=0)
filename = 'clf_svc.pkl'
pickle.dump(clf_svc, open(filename, 'wb'))
#!mkdir C:\Users\Palani\Downloads\model
!cp clf_svc.pkl C:\Users\Palani\Downloads
!tar -zcvf clf_svc.tar.gz clf_svc.pkl
from ibm_watson_machine_learning import APIClient
wml_credentials = {
"url":"https://eu-gb.ml.cloud.ibm.com",
"apikey":"***"
}
client = APIClient(wml_credentials)
metadata = {
client.spaces.ConfigurationMetaNames.NAME:"Telco Churn DS",
client.spaces.ConfigurationMetaNames.DESCRIPTION:"To predict
customers who exit the company",
client.spaces.ConfigurationMetaNames.STORAGE:{
"type":"bmcos_object_storage",
"resource_crn":"***"
},
client.spaces.ConfigurationMetaNames.COMPUTE:{
"name":"WatsonMachineLearning",
"crn":"***"
},
}
space_details = client.spaces.store(meta_props=metadata)
space_details
space_id = space_details["metadata"]["id"]
space_id
#space_id = "***"
client.set.default_space(space_id)
client.software_specifications.list()
import sklearn
sklearn.__version__
spec_id =
client.software_specifications.get_id_by_name("scikit-learn_0.20-py3.6")
#spec_id = "***"
model_details = client.repository.store_model(model=clf_svc,meta_props={
client.repository.ModelMetaNames.NAME:"Churn Prediction",
client.repository.ModelMetaNames.SOFTWARE_SPEC_UID:spec_id,
client.repository.ModelMetaNames.TYPE:"scikit-learn_0.20"
})
model_id = model_details["metadata"]["id"]
model_id
#model_id = "***"
deployment_metadata = {
client.deployments.ConfigurationMetaNames.NAME:"Churn Prediction
Deployment",
client.deployments.ConfigurationMetaNames.ONLINE:{}
}
deployment_details = client.deployments.create(artifact_uid=model_id,
meta_props=deployment_metadata)
deployment_id = deployment_details["metadata"]["id"]
col = X.columns
col = list(col)
col
score_list = ['589','39','6','163520.37','3','1','0','75238.55','0','1','0','1','0']
payload = {
client.deployments.ScoringMetaNames.INPUT_DATA:[{
"fields":col,
"values":[score_list],
}]
}
deployment_details =
client.deployments.score(deployment_id=deployment_id,
meta_props=payload)
deployment_details