0% found this document useful (0 votes)
12 views

S&ML Unit 6- Q & A

The document covers classification models, focusing on decision trees and Naïve Bayes classifiers. It explains the structure and construction of decision trees, feature selection, and differences between classification and regression trees, alongside examples and calculations. Additionally, it discusses the working of Naïve Bayes classifiers and their advantages and disadvantages.

Uploaded by

baip1066
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

S&ML Unit 6- Q & A

The document covers classification models, focusing on decision trees and Naïve Bayes classifiers. It explains the structure and construction of decision trees, feature selection, and differences between classification and regression trees, alongside examples and calculations. Additionally, it discusses the working of Naïve Bayes classifiers and their advantages and disadvantages.

Uploaded by

baip1066
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit VIClassification Models (08 Hours)

Decision tree representation, Constructing Decision Trees, Classification and Regression Trees, hypothesis space search in
decision tree learning Bayes' Theorem, Working of Naïve Bayes' Classifier, Types of Naïve Bayes Model, Advantages,
Disadvantages and Application of Naïve Bayes Model.
#Exemplar/Case Studies: Explore decision tree model for customer churns
Questions and Answers (from previous question papers)
1 What is Decision tree? Write various steps for constructing a decision tree?
How feature selection can be done using decision tree? [8]
A Decision Tree is a tree-based supervised machine learning algorithm used for classification and regression tasks. It models
decision-making using a series of hierarchical rules based on feature values.
Structure of a Decision Tree
A Decision Tree consists of:
1. Root Node – The starting point containing the entire dataset.
2. Internal Nodes (Decision Nodes) – Points where data is split based on feature conditions.
3. Branches – Connections between nodes representing decisions or outcomes.
4. Leaf Nodes – Final nodes that contain the prediction (class label or regression value).

If Credit Score > 700, loan is Approved or Not Approved.


If Credit Score ≤ 700, decision depends on Income.
If Income > 50K, loan is Approved, otherwise Not Approved.
Steps to Construct a Decision Tree
1. Data Preparation: Import libraries, load data, preprocess missing values, and split into training and testing sets.
2. Choose Splitting Criteria: Use metrics such as Information Gain (IG), Gini Impurity, or Variance Reduction.
3. Build the Tree: Compute entropy, split on features with highest Information Gain, and create child nodes.
4. Pruning (Optional): Prevent overfitting using pre-pruning (max depth) or post-pruning (removing less significant branches).
5. Evaluate the Tree: Use metrics such as Accuracy (Classification) and Mean Squared Error (Regression).

Feature Selection in Decision Trees


Feature selection is automatically performed in decision trees by selecting the most informative features. Common methods
include:
1. Information Gain (IG) – Measures reduction in entropy after a split.
2. Gini Impurity – Measures impurity in classification tasks.
3. Variance Reduction – Used in regression trees to minimize output variance.
Example: Feature Selection using Information Gain
Consider a dataset predicting customer churn based on 'Age' and 'Monthly Bill'.
Dataset: Predicting Customer Churn
Compute entropy before splitting (H(S)); Compute weighted entropy after splitting on 'Age' and 'Monthly Bill'; Compute
Information Gain (IG = H(S) - weighted entropy); Select the feature with the highest Information Gain for splitting.

1
Customer Age Monthly Bill Churn (Yes/No)
C1 25 50 No
C2 45 200 Yes
C3 32 70 No
C4 60 150 Yes
C5 50 100 No
Entropy(S) = - ∑ P(i) log₂ P(i)
- P(i) is the probability of class i in the dataset
- log₂ is the logarithm to base 2
Information Gain = Entropy(Parent) - Weighted Entropy(Children)
Weighted Entropy = ∑ ( |Si| / |S| ) * Entropy(Si)
Where:
- Si is a subset after splitting
- |Si| is the number of elements in subset Si
- |S| is the total number of elements before splitting
Step 1: Calculate Parent Entropy (before splitting)
Total instances: 5 (3 No, 2 Yes)
Entropy = -P(Yes)log2(P(Yes)) - P(No)log2(P(No))
Entropy = 0.971
Step 2: Calculate Information Gain for 'Age'
Splitting into Age ≤ 40 and Age > 40:
For Age ≤ 40:
Instances: 2 (All 'No')
Entropy = -(0/2)log2(0/2) - (2/2)log2(2/2) = 0
For Age > 40:
Instances: 3 (1 No, 2 Yes)
Entropy = -(2/3)log2(2/3) - (1/3)log2(1/3)
Entropy(Age > 40) = 0.918
Weighted Entropy Calculation:
Weighted Entropy = (2/5)*0 + (3/5)*0.918
Weighted Entropy = 0.551
Information Gain(Age) = 0.971 - 0.551 = 0.420
Step 3: Calculate Information Gain for 'Monthly Bill'
Splitting into Monthly Bill ≤ 100 and > 100:
For Bill ≤ 100:
Instances: 3 (All 'No')
Entropy = -(0/3)log2(0/3) - (3/3)log2(3/3) = 0
For Bill > 100:
Instances: 2 (All 'Yes')
Entropy = -(2/2)log2(2/2) - (0/2)log2(0/2) = 0
Weighted Entropy Calculation:
Weighted Entropy = (3/5)*0 + (2/5)*0
Weighted Entropy = 0.000
Information Gain(Bill) = 0.971 - 0.000 = 0.971
Decision Tree:

2
A Decision Tree is a powerful tool for classification and regression that works by selecting features based on splitting criteria
such as Information Gain or Gini Impurity. Feature selection happens naturally as the tree prioritizes informative attributes,
making it an efficient method for decision-making.
2 What is the difference between a classification tree and a regression tree? [8]
Comparison of Classification Trees and Regression Trees
Feature Classification Trees Regression Trees
Purpose Classifies data into predefined categories Predicts a continuous target variable
Target Variable Categorical (e.g., 'Yes' or 'No') Continuous (e.g., numerical values)
Output Class label (e.g., Spam or Not Spam) Continuous value (e.g., price, temperature)
Evaluation Metric Accuracy, Precision, Recall, F1-score Mean Squared Error (MSE), R-squared
Leaf Nodes Contain class labels Contain average or predicted target values
Splitting Criterion Maximizes separation using Gini impurity or Entropy: Minimizes variance using Least Squares:
Entropy = - Σ P(i) log₂ P(i) MSE = (1/n) Σ (yᵢ - ŷ)²
Gini impurity = 1 - Σ P(i)²
Sensitivity More sensitive to small changes in data Less sensitive but still affected by outliers
Outliers Effect Can significantly influence splits Outliers can still affect accuracy
Complexity Simpler for categorical data interpretation Can become complex with many features
Best for Handling categorical data Modeling non-linear relationships
Interpretability Easier due to categorical splits Harder due to continuous splits
Example 1 Spam Filtering: Classifies emails as spam or not Stock Price Prediction: Predicts future stock
prices
Example 2 Fraud Detection: Detects fraudulent transactions Customer Churn Prediction: Predicts the
likelihood of a customer leaving
Example 3 Customer Segmentation: Groups customers based on Real Estate Valuation: Predicts house prices
behavior
3 Explain Classification and Regression Trees (CART) with an example. [8]
Classification and Regression Trees (CART)
1. Introduction to CART
Classification and Regression Trees (CART) is a machine learning algorithm used for both classification and regression problems.
It constructs a decision tree that recursively splits the dataset into smaller subsets to make predictions. The main objective of
CART is to find the best splits that minimize impurity in classification or reduce error in regression.
2. CART for Classification
In classification problems, CART splits the data based on a chosen metric such as the Gini Index or Entropy (Information Gain).
2.1 Gini Index Formula
The Gini Index measures the impurity of a node:
Gini = 1 - Σ (p_i^2)

3
where:
- p_i is the probability of class i in the node.
Example Calculation:
Consider a dataset with two classes (Yes, No) split into two groups:
Class Count

Yes 6

No 4

Gini_parent = 1 - (6/10)^2 - (4/10)^2 = 0.48


After splitting:
Group Yes No Total
Left 5 1 6
Right 1 3 4
Gini_left = 0.278, Gini_right = 0.375
Weighted Gini_split = 0.317
3. CART for Regression
For regression tasks, CART minimizes errors using metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE).
3.1 Mean Squared Error (MSE) Formula
MSE = (1/N) Σ (y_i - ŷ)^2
Consider predicting house prices based on size:
Size (sqft) Price ($1000s)
1000 150
1200 180
1400 210
1600 250
1800 270
Splitting at Size = 1500:
MSE_left = 600, MSE_right = 100
Weighted MSE_split = 400
4. Decision Tree Example
Classification Tree for Purchase Prediction:
Age < 30?
/ \
No Salary < 70K?
/ \
No Yes
Regression Tree for House Price Prediction:
Size < 1500?
/ \
Avg=180 Avg=260
5. Advantages and Disadvantages
Feature Advantages Disadvantages

CART Easy to interpret Prone to overfitting

Handles numerical & categorical data Sensitive to small data changes

No assumptions about data distribution Requires pruning for better generalization

CART is a powerful decision tree algorithm that works for both classification and regression problems. It recursively splits the
data to maximize homogeneity and minimizes errors using Gini Index or MSE. However, overfitting is a concern that can be
mitigated using pruning techniques.
3.2 Detailed MSE Calculation

4
Mean Squared Error (MSE) for the left and right nodes after splitting at Size = 1500.
Dataset:
Size (sqft) Price ($1000s)
1000 150
1200 180
1400 210
1600 250
1800 270
Splitting at Size = 1500:
Left Node (Size < 1500): 1000, 1200, 1400 -> Mean Price = (150 + 180 + 210) / 3 = 180
Right Node (Size >= 1500): 1600, 1800 -> Mean Price = (250 + 270) / 2 = 260
MSE_left = ( (150 - 180)^2 + (180 - 180)^2 + (210 - 180)^2 ) / 3
= (900 + 0 + 900) / 3 = 600
MSE_right = ( (250 - 260)^2 + (270 - 260)^2 ) / 2
= (100 + 100) / 2 = 100
Weighted MSE_split = ( (3/5) * 600 ) + ( (2/5) * 100 )
= 360 + 40 = 400
4 Explain hypothesis space search in decision tree learning. Give suitable example.
Hypothesis Space in Decision Trees:
The hypothesis space in decision tree learning refers to the set of all possible decision trees that can be generated from a given
dataset. The learning process searches this space to find the best tree that predicts the target variable accurately while generalizing
well to unseen data.
Key Characteristics of Hypothesis Space Search:
 It is a Greedy Search: The algorithm selects the best feature locally at each step instead of searching for the best global tree.
 No Backtracking: Once a split is made, the tree does not reconsider previous decisions.
 It is Hierarchical: The search starts from the root node and proceeds recursively down the tree.
 Stopping Conditions: The process stops when the tree reaches a predefined depth, the node contains only one class, or
splitting no longer improves accuracy.

Step-by-Step Process of Hypothesis Space Search


➤Start with the entire dataset at the root node (initial hypothesis).
➤Evaluate all possible features to determine the best split using Information Gain, Gini Impurity, or Variance Reduction.
➤Choose the best feature for splitting based on the highest Information Gain or lowest impurity.
➤Recursively repeat the process for each child node, choosing the best feature at each level.
➤Stop splitting when a predefined stopping condition is met (pure nodes, max depth, minimum sample size).
➤Optionally prune the tree to remove unnecessary branches and prevent overfitting.

Example: Predicting Customer Purchases Using a Decision Tree


Imagine we have a dataset containing customer information (Age, Income, Location) and whether they buy a product or not. The
decision tree algorithm searches the hypothesis space to find the best way to classify new customers.
Sample Dataset:
Customer Age Income Buys Product
C1 25 Low No
C2 50 High Yes
C3 25 Medium No
C4 50 High Yes
C5 50 Medium No
Step-by-Step Decision Tree Construction
➤Evaluate all features (Age, Income, Location) to determine the best split.
➤Choose 'Income' as the first split since it has the highest Information Gain.
➤Split data into Low, Medium, and High Income groups.

5
➤Further split Medium and High Income groups based on Age, as it provides the next best split.
➤Continue splitting until stopping criteria are met, forming a final decision tree.

Key Takeaways from Hypothesis Space Search


 Stepwise Greedy Approach: The tree picks the best feature at each step instead of searching for the best full tree.
 Recursive Partitioning: The process continues until stopping conditions are met.
 Feature Selection is Automatic: The algorithm naturally chooses the most important features.
 Pruning Helps Generalization: Removing unnecessary branches prevents overfitting to the training data.

The hypothesis space search in decision tree learning explores different ways to split data, aiming to find the best decision tree
structure. The greedy nature of the algorithm ensures quick decision-making but does not guarantee the globally optimal tree.
Pruning techniques help refine the final tree for better accuracy on unseen data.
For reference:

https://colab.research.google.com/drive/1gAjhSD2f2bd3cagsyjJdLEHClMATL4h0?usp=sharing

5 Explain working of Naive Bayes Classifier? [7]


Naive Bayes theorem is a fundamental concept in probability and statistics used in the context of machine learning, particularly
for Naive Bayes classifiers.
 Bayes' theorem, which is fundamental to Naive Bayes classifiers, is a method for calculating the conditional probability of
an event (hypothesis) occurring, given that another event (evidence) has already happened.
 This theorem is powerful because it allows us to update our beliefs about an event (B) after considering new evidence (A).
This is the core principle behind Naive Bayes classifiers, which utilize Bayes' theorem to make predictions based on
features (evidence) and class labels (events).
 The theorem states:
P(B|A) = ( P(A|B) * P(B) ) / P(A)
where:
P(A | B) is the posterior probability of event A occurring given that event B has already happened. (This is what we want
to find)
P(B | A) is the likelihood of event B happening given that event A has already occurred.
P(A) is the prior probability of event A happening, independent of any other event.
P(B) is the prior probability of event B happening, independent of any other event.
6 What are advantages and disadvantages of Naive Bayes (NB) model. Describe applications of Naive Bayes model. [8]
Advantages and Disadvantages of Naïve Bayes Model
Advantages Disadvantages
Simplicity and Efficiency: Easy to understand and Independence Assumption: Assumes feature independence,
implement with low computational cost. which is often unrealistic.
Performance: Works well with large datasets and Zero-Frequency Problem: Assigns zero probability to unseen
performs surprisingly well despite simplicity. features unless smoothing is applied.
Scalability: Can efficiently handle large-scale data due Sensitivity to Feature Scaling: Large-range features can
to fast computation. dominate unless normalized.
Probability Outputs: Provides probability estimates Limited Interpretability: Becomes less interpretable when many
useful for decision-making. features are involved.
Handling Missing Data: Can handle missing values Not Ideal for Continuous Features: Performs better with

6
without complex imputation. categorical rather than continuous data.
Strong with Categorical Features: Works well with Average Performance on Complex Problems: May be
categorical data like text classification. outperformed by advanced models in complex tasks.

Applications of Naïve Bayes Model


Application Area Description Example
Text Classification Classifies text into categories such as spam detection Spam filtering in Gmail.
and sentiment analysis.
Real-time Predictions Performs fast classifications for real-time Instant email spam detection.
applications.
Recommendation Systems Helps in product or content recommendations. Amazon product recommendations.
Credit Scoring Assesses loan applicants' creditworthiness based on Bank loan approvals.
income and payment history.
Medical Diagnosis Used for preliminary diagnosis based on symptoms Predicting diseases based on
and patient history. symptoms.
Fraud Detection Identifies fraudulent transactions using past patterns. Credit card fraud detection.
Email Spam Filtering Filters spam emails based on content analysis. Spam detection in Yahoo Mail.
8 Explain working of Naive Bayes models, such as Gaussian, Multinomial, and Bernoulli? [8]
Working of Naïve Bayes Models: Gaussian, Multinomial, and Bernoulli
Overview of Naïve Bayes
Naïve Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It is widely used for classification tasks,
including spam detection, sentiment analysis, and medical diagnosis. The model assumes that features are conditionally
independent given the class label, making it computationally efficient and easy to interpret.
Working Process of Naïve Bayes
Step 1: Training the Model
 The algorithm is trained on a labeled dataset where each instance has a set of features and a corresponding class label.
 It calculates the prior probability of each class based on its frequency in the training data.
 It learns the conditional probability of each feature given a class (i.e., how likely a feature appears in a specific
category).
 The exact probability distribution depends on the type of Naïve Bayes classifier: Gaussian (continuous features),
Multinomial (discrete frequency-based features), or Bernoulli (binary features).

Step 2: Classification of a New Data Point


 For an unseen instance, Naïve Bayes calculates the probability that it belongs to each class using Bayes' Theorem.
 The posterior probability is computed for each class, considering both prior probability (likelihood of the class
occurring) and likelihood (how well the given features match each class).
 The instance is assigned to the class with the highest probability.

Example: Spam Email Classification


Imagine a spam filter that classifies emails as Spam or Not Spam. Features might include word frequencies, presence of specific
terms like 'free', 'offer', or 'urgent'. Naïve Bayes calculates how frequently these words appear in spam vs. non-spam emails and
uses this information to classify new messages.
Types of Naïve Bayes Classifiers
1. Gaussian Naïve Bayes (GNB)

The probability of a continuous feature x given a class is computed using the Gaussian (Normal) distribution formula:
μ = Mean of the feature in the class
σ2= Variance of the feature in the class
Best for: Continuous numerical data
Assumption: Each feature follows a Gaussian (Normal) distribution
Example: Predicting house prices based on numerical attributes like square footage or location.

7
Strengths: Works well with continuous data, computationally efficient, handles high-dimensional data.
Weaknesses: Sensitive to outliers, assumes normality which may not always hold.
2. Multinomial Naïve Bayes (MNB)

Continuous distributions, it models the frequency of occurrences of categorical features.


xi = Feature (e.g., a word in a document)
α = Smoothing parameter (Laplace Smoothing to avoid zero probability issues)
N = Number of unique features in the dataset
Best for: Discrete features, such as word counts in text classification.
Assumption: Features follow a Multinomial distribution (categorical data with frequency counts).
Example: Spam email detection based on word frequency.
Strengths: Efficient for large text datasets, handles feature frequency well.
Weaknesses: Assumes feature independence, sensitive to feature imbalance.
3. Bernoulli Naïve Bayes (BNB)

Best for: Binary features (Yes/No, 0/1).


Assumption: Each feature is a Boolean (binary) variable.
Example: Image classification (e.g., detecting if an image contains a cat or not).
Strengths: Ideal for binary data, computationally efficient, simple and interpretable.
Weaknesses: Limited to binary features, assumes feature independence.
Comparison of Naïve Bayes Variants
Naïve Bayes Model Best for Data Type Example Use Case Key Assumption
Gaussian (GNB) Continuous numerical Continuous House price prediction Features follow a normal
data distribution
Multinomial (MNB) Text classification, Discrete Spam detection, Features follow a
word counts sentiment analysis multinomial distribution
Bernoulli (BNB) Binary features (0/1) Binary Image recognition, Features are independent
(Yes/No) fraud detection binary variables
Naïve Bayes is a simple yet powerful classifier that applies Bayes' theorem to estimate the probability of a class given a set of
features. The choice of Naïve Bayes variant depends on the type of data: Gaussian NB for continuous data, Multinomial NB for
text data, and Bernoulli NB for binary data. Despite its independence assumption, Naïve Bayes performs well in many real-world
applications, particularly in text classification, medical diagnosis, and fraud detection.
9 How can a Decision Tree model be applied to predict customer churn?
Explain the key features used in the model. [8 Marks]
Definition of Customer Churn: It is the rate at which customers stop doing business with a company.
Customer churn (also known as customer attrition) refers to the percentage of customers who stop using a company’s product
or service over a specific period. It is a critical business metric, especially for subscription-based and service-oriented industries.
Types of Customer Churn:
1. Voluntary Churn – When a customer actively decides to leave (e.g., canceling a subscription due to dissatisfaction or
switching to a competitor).
2. Involuntary Churn – When a customer leaves due to reasons beyond their control (e.g., payment failures, relocation).
Churn Rate Formula:

 Helps businesses identify at-risk customers and take proactive steps to retain them.
 Improves customer satisfaction by addressing pain points before customers leave.
 Increases revenue and profitability by reducing customer acquisition costs (retaining a customer is cheaper than
acquiring a new one).
Customer Churn Prediction Using a Decision Tree:

8
Problem Statement:
A telecom company wants to predict whether a customer will churn based on three features:
1. Monthly Bill (High/Low)
2. Contract Type (Long-term/Short-term)
3. Customer Support Calls (Few/Many)
The company collected the following customer data:
Customer Monthly Bill Contract Type Support Calls Churn (Yes/No)

1 High Short-term Many Yes

2 Low Long-term Few No

3 High Long-term Few No

4 Low Short-term Many Yes

5 High Short-term Few Yes

The goal is to build a Decision Tree to classify new customers as either "Churn" or "No Churn."
Step 1: Calculate Entropy (Before Splitting)
The entropy formula should be: H(S) = -p(Yes)log₂p(Yes) - p(No)log₂p(No)
 p(Yes) = Probability of "Churn" = 3/5 = 0.6
 p(No) = Probability of "No Churn" = 2/5 = 0.4
Calculating: H(S) = -0.6log₂(0.6) - 0.4log₂(0.4) = -0.6 × (-0.737) - 0.4 × (-1.322) = 0.442 + 0.529 = 0.971 bits
Step 2: Choose the Best Feature to Split
For "Monthly Bill":
 High Bill Customers (3 customers: 2 Yes, 1 No)
o Entropy = -2/3 × log₂(2/3) - 1/3 × log₂(1/3) = 0.918 bits
 Low Bill Customers (2 customers: 1 Yes, 1 No)
o Entropy = -1/2 × log₂(1/2) - 1/2 × log₂(1/2) = 1 bit
Weighted entropy: H(Monthly Bill) = (3/5 × 0.918) + (2/5 × 1) = 0.951 bits
Information Gain: IG(Monthly Bill) = H(S) - H(Monthly Bill) = 0.971 - 0.951 = 0.02 bits
For "Contract Type" (as you mentioned it gives the highest IG):
 Long-term (2 customers: 0 Yes, 2 No)
o Entropy = 0 bits (pure node)
 Short-term (3 customers: 3 Yes, 0 No)
o Entropy = 0 bits (pure node)
Weighted entropy: H(Contract Type) = (2/5 × 0) + (3/5 × 0) = 0 bits
Information Gain: IG(Contract Type) = H(S) - H(Contract Type) = 0.971 - 0 = 0.971 bits
Step 3: Construct the Decision Tree
Since Contract Type gives the highest IG, we split on it first:
 Long-term: (2 customers, both No Churn) → Stop (pure node)
 Short-term: (3 customers, all Churn) → Stop (pure node)
Information Gain for "Support Calls"
Let's assume we have the following distribution:
 High Support Calls (3 customers: 3 Yes, 0 No)
o Entropy = -3/3 × log₂(3/3) - 0/3 × log₂(0/3) = 0 bits (pure node)
 Low Support Calls (2 customers: 0 Yes, 2 No)
o Entropy = -0/2 × log₂(0/2) - 2/2 × log₂(2/2) = 0 bits (pure node)
Weighted entropy: H(Support Calls) = (3/5 × 0) + (2/5 × 0) = 0 bits
Information Gain: IG(Support Calls) = H(S) - H(Support Calls) = 0.971 - 0 = 0.971 bits
This means that both "Contract Type" and "Support Calls" give the same Information Gain of 0.971 bits. This happens because
both features perfectly separate the data into pure nodes.
When multiple features have the same Information Gain, we can choose either one as the first split. If we choose "Support Calls":
 High Support Calls: (3 customers, all Churn) → Stop (pure node)
 Low Support Calls: (2 customers, both No Churn) → Stop (pure node)

9
This creates a different tree structure but with the same predictive power as the one that splits on "Contract Type" first.
To summarize all Information Gains:
 IG(Monthly Bill) = 0.02 bits
 IG(Contract Type) = 0.971 bits
 IG(Support Calls) = 0.971 bits
Since both "Contract Type" and "Support Calls" give perfect splits (maximum possible Information Gain), either would be a good
choice for the first split in the decision tree.
Final Decision Tree:
Contract Type?
/ \
Long-term Short-term
(No) / \
Support Calls? (Few)
/ \
Many (Yes) Few (Yes)

Step 4: Predict for a New Customer


A new customer has:
 Short-term contract
 Many support calls
 High monthly bill
Following the tree:
1. Short-term → Move to "Support Calls" node
2. Many calls → Predict Churn (Yes)
Thus, the model predicts that this customer will churn.
Final Answer:
 Entropy before splitting: 0.970
 Best first split: Contract Type
 Final prediction for a new customer: Churn (Yes)
This method provides an easy and interpretable way to predict customer churn using Decision Trees. If needed, this model can be
further improved using pruning techniques or advanced ensemble methods like Random Forest.
10 What are the advantages of using Decision Trees for churn prediction application? [8 Marks]

Advantage Description

Provides a clear, visual representation of decision-making, making it easy to understand why a


Interpretability
customer is predicted to churn.

Handles Non-Linearity Captures complex, non-linear relationships between customer attributes and churn behavior.

Feature Importance Ranks features based on their significance, helping businesses identify key churn factors.

Handles Categorical & Works with both categorical (e.g., customer type) and numerical (e.g., monthly spending) data
Numerical Data without extensive preprocessing.

Resistant to Missing
Can handle missing values effectively by splitting data using available features.
Values

Fast Training &


Low computational cost allows efficient training and prediction, even on large datasets.
Prediction

No Need for Feature Unlike logistic regression or SVM, decision trees do not require normalization or standardization of
Scaling input features.

Additional Questions
11 A company wants to predict whether a customer will churn (Y = 1) or stay (Y = 0) based on a given feature X using
Logistic Regression. The predicted probabilities and classifications for 10 customers are recorded.
1. Construct a confusion matrix based on the given actual and predicted values.
2. Calculate the following performance metrics:
o Accuracy
o Precision (PPV)
o Recall (Sensitivity)

10
o Specificity (TNR)
o Negative Predictive Value (NPV)
o False Positive Rate (FPR)
o False Negative Rate (FNR)
o F1 Score
o ROC AUC Score
o Balanced Accuracy
3. Explain the significance of the calculated F1 Score and Balanced Accuracy in evaluating the logistic regression
model.

Logistic Regression Results

X Y (Actual) Predicted Probability Y_pred (Class)


1 0 0.3205 0
2 0 0.3579 0
3 0 0.3971 0
4 0 0.4377 0
5 0 0.4791 0
6 1 0.5209 1
7 1 0.5623 1
8 1 0.6029 1
9 1 0.6421 1
10 0 0.6795 1

Confusion Matrix
Actual / Predicted Predicted Negative (0) Predicted Positive (1)
Actual Negative (0) 3 (TN) 2 (FP)
Actual Positive (1) 2 (FN) 3 (TP)

Logistic Regression Metrics


Description Equation Calculation
Logistic Regression Equation P(y=1|X) = 1 / (1 + e^-(0.17X + 0.92)) P(y=1|X) = 1 / (1 + e^-(0.17X + 0.92))
Predicted Probability P = 1 / (1 + exp(mx + c)) Computed for each X
Accuracy (TP + TN) / (TP + TN + FP + FN) (3+3) / (3+3+2+2) = 0.60
Precision (PPV) TP / (TP + FP) 3 / (3+2) = 0.60
Recall (Sensitivity) TP / (TP + FN) 3 / (3+2) = 0.60
Specificity (TNR) TN / (TN + FP) 3 / (3+2) = 0.60
Negative Predictive Value (NPV) TN / (TN + FN) 2 / (2+3) = 0.60
False Positive Rate (FPR) FP / (FP + TN) 2 / (2+3) = 0.40
False Negative Rate (FNR) FN / (FN + TP) 2 / (2+3) = 0.40
F1 Score 2 * (Precision * Recall) / (Precision + Recall) 2 * (0.60 * 0.60) / (0.60 + 0.60) = 0.60
ROC AUC Area under ROC curve Computed as 0.64
Balanced Accuracy (Recall + Specificity) / 2 (0.60 + 0.60) / 2 = 0.60

Significance of the Calculated F1 Score and Balanced Accuracy in Evaluating the Logistic Regression Model
1. F1 Score (0.60)
 Interpretation:
o The F1 Score balances the trade-off between Precision (how many predicted positives are actually correct) and
Recall (how many actual positives are correctly identified).
o A value of 0.60 indicates that the model maintains a moderate balance between these two aspects.
o If the cost of False Positives (FP) or False Negatives (FN) is high, the F1 Score helps in making a better

11
evaluation rather than just using Accuracy.
o A higher F1 Score means a better model for imbalanced datasets where class distribution is skewed.
2. Balanced Accuracy (0.60)
 Interpretation:
o It accounts for both true positives (Sensitivity) and true negatives (Specificity), making it useful when the dataset
is imbalanced.
o The value 0.60 suggests that the model has moderate performance in identifying both churned and non-churned
customers correctly.
o A Balanced Accuracy closer to 1 would indicate a strong model, whereas closer to 0.5 suggests that the model is
making near-random predictions.
Since both F1 Score and Balanced Accuracy are 0.60, the logistic regression model is performing moderately well, but there is
room for improvement.
To improve the model, techniques such as feature engineering, hyperparameter tuning, or using more advanced algorithms
(e.g., Random Forest, SVM) can be considered.

12

You might also like