S&ML Unit 6- Q & A
S&ML Unit 6- Q & A
Decision tree representation, Constructing Decision Trees, Classification and Regression Trees, hypothesis space search in
decision tree learning Bayes' Theorem, Working of Naïve Bayes' Classifier, Types of Naïve Bayes Model, Advantages,
Disadvantages and Application of Naïve Bayes Model.
#Exemplar/Case Studies: Explore decision tree model for customer churns
Questions and Answers (from previous question papers)
1 What is Decision tree? Write various steps for constructing a decision tree?
How feature selection can be done using decision tree? [8]
A Decision Tree is a tree-based supervised machine learning algorithm used for classification and regression tasks. It models
decision-making using a series of hierarchical rules based on feature values.
Structure of a Decision Tree
A Decision Tree consists of:
1. Root Node – The starting point containing the entire dataset.
2. Internal Nodes (Decision Nodes) – Points where data is split based on feature conditions.
3. Branches – Connections between nodes representing decisions or outcomes.
4. Leaf Nodes – Final nodes that contain the prediction (class label or regression value).
1
Customer Age Monthly Bill Churn (Yes/No)
C1 25 50 No
C2 45 200 Yes
C3 32 70 No
C4 60 150 Yes
C5 50 100 No
Entropy(S) = - ∑ P(i) log₂ P(i)
- P(i) is the probability of class i in the dataset
- log₂ is the logarithm to base 2
Information Gain = Entropy(Parent) - Weighted Entropy(Children)
Weighted Entropy = ∑ ( |Si| / |S| ) * Entropy(Si)
Where:
- Si is a subset after splitting
- |Si| is the number of elements in subset Si
- |S| is the total number of elements before splitting
Step 1: Calculate Parent Entropy (before splitting)
Total instances: 5 (3 No, 2 Yes)
Entropy = -P(Yes)log2(P(Yes)) - P(No)log2(P(No))
Entropy = 0.971
Step 2: Calculate Information Gain for 'Age'
Splitting into Age ≤ 40 and Age > 40:
For Age ≤ 40:
Instances: 2 (All 'No')
Entropy = -(0/2)log2(0/2) - (2/2)log2(2/2) = 0
For Age > 40:
Instances: 3 (1 No, 2 Yes)
Entropy = -(2/3)log2(2/3) - (1/3)log2(1/3)
Entropy(Age > 40) = 0.918
Weighted Entropy Calculation:
Weighted Entropy = (2/5)*0 + (3/5)*0.918
Weighted Entropy = 0.551
Information Gain(Age) = 0.971 - 0.551 = 0.420
Step 3: Calculate Information Gain for 'Monthly Bill'
Splitting into Monthly Bill ≤ 100 and > 100:
For Bill ≤ 100:
Instances: 3 (All 'No')
Entropy = -(0/3)log2(0/3) - (3/3)log2(3/3) = 0
For Bill > 100:
Instances: 2 (All 'Yes')
Entropy = -(2/2)log2(2/2) - (0/2)log2(0/2) = 0
Weighted Entropy Calculation:
Weighted Entropy = (3/5)*0 + (2/5)*0
Weighted Entropy = 0.000
Information Gain(Bill) = 0.971 - 0.000 = 0.971
Decision Tree:
2
A Decision Tree is a powerful tool for classification and regression that works by selecting features based on splitting criteria
such as Information Gain or Gini Impurity. Feature selection happens naturally as the tree prioritizes informative attributes,
making it an efficient method for decision-making.
2 What is the difference between a classification tree and a regression tree? [8]
Comparison of Classification Trees and Regression Trees
Feature Classification Trees Regression Trees
Purpose Classifies data into predefined categories Predicts a continuous target variable
Target Variable Categorical (e.g., 'Yes' or 'No') Continuous (e.g., numerical values)
Output Class label (e.g., Spam or Not Spam) Continuous value (e.g., price, temperature)
Evaluation Metric Accuracy, Precision, Recall, F1-score Mean Squared Error (MSE), R-squared
Leaf Nodes Contain class labels Contain average or predicted target values
Splitting Criterion Maximizes separation using Gini impurity or Entropy: Minimizes variance using Least Squares:
Entropy = - Σ P(i) log₂ P(i) MSE = (1/n) Σ (yᵢ - ŷ)²
Gini impurity = 1 - Σ P(i)²
Sensitivity More sensitive to small changes in data Less sensitive but still affected by outliers
Outliers Effect Can significantly influence splits Outliers can still affect accuracy
Complexity Simpler for categorical data interpretation Can become complex with many features
Best for Handling categorical data Modeling non-linear relationships
Interpretability Easier due to categorical splits Harder due to continuous splits
Example 1 Spam Filtering: Classifies emails as spam or not Stock Price Prediction: Predicts future stock
prices
Example 2 Fraud Detection: Detects fraudulent transactions Customer Churn Prediction: Predicts the
likelihood of a customer leaving
Example 3 Customer Segmentation: Groups customers based on Real Estate Valuation: Predicts house prices
behavior
3 Explain Classification and Regression Trees (CART) with an example. [8]
Classification and Regression Trees (CART)
1. Introduction to CART
Classification and Regression Trees (CART) is a machine learning algorithm used for both classification and regression problems.
It constructs a decision tree that recursively splits the dataset into smaller subsets to make predictions. The main objective of
CART is to find the best splits that minimize impurity in classification or reduce error in regression.
2. CART for Classification
In classification problems, CART splits the data based on a chosen metric such as the Gini Index or Entropy (Information Gain).
2.1 Gini Index Formula
The Gini Index measures the impurity of a node:
Gini = 1 - Σ (p_i^2)
3
where:
- p_i is the probability of class i in the node.
Example Calculation:
Consider a dataset with two classes (Yes, No) split into two groups:
Class Count
Yes 6
No 4
CART is a powerful decision tree algorithm that works for both classification and regression problems. It recursively splits the
data to maximize homogeneity and minimizes errors using Gini Index or MSE. However, overfitting is a concern that can be
mitigated using pruning techniques.
3.2 Detailed MSE Calculation
4
Mean Squared Error (MSE) for the left and right nodes after splitting at Size = 1500.
Dataset:
Size (sqft) Price ($1000s)
1000 150
1200 180
1400 210
1600 250
1800 270
Splitting at Size = 1500:
Left Node (Size < 1500): 1000, 1200, 1400 -> Mean Price = (150 + 180 + 210) / 3 = 180
Right Node (Size >= 1500): 1600, 1800 -> Mean Price = (250 + 270) / 2 = 260
MSE_left = ( (150 - 180)^2 + (180 - 180)^2 + (210 - 180)^2 ) / 3
= (900 + 0 + 900) / 3 = 600
MSE_right = ( (250 - 260)^2 + (270 - 260)^2 ) / 2
= (100 + 100) / 2 = 100
Weighted MSE_split = ( (3/5) * 600 ) + ( (2/5) * 100 )
= 360 + 40 = 400
4 Explain hypothesis space search in decision tree learning. Give suitable example.
Hypothesis Space in Decision Trees:
The hypothesis space in decision tree learning refers to the set of all possible decision trees that can be generated from a given
dataset. The learning process searches this space to find the best tree that predicts the target variable accurately while generalizing
well to unseen data.
Key Characteristics of Hypothesis Space Search:
It is a Greedy Search: The algorithm selects the best feature locally at each step instead of searching for the best global tree.
No Backtracking: Once a split is made, the tree does not reconsider previous decisions.
It is Hierarchical: The search starts from the root node and proceeds recursively down the tree.
Stopping Conditions: The process stops when the tree reaches a predefined depth, the node contains only one class, or
splitting no longer improves accuracy.
5
➤Further split Medium and High Income groups based on Age, as it provides the next best split.
➤Continue splitting until stopping criteria are met, forming a final decision tree.
The hypothesis space search in decision tree learning explores different ways to split data, aiming to find the best decision tree
structure. The greedy nature of the algorithm ensures quick decision-making but does not guarantee the globally optimal tree.
Pruning techniques help refine the final tree for better accuracy on unseen data.
For reference:
https://colab.research.google.com/drive/1gAjhSD2f2bd3cagsyjJdLEHClMATL4h0?usp=sharing
6
without complex imputation. categorical rather than continuous data.
Strong with Categorical Features: Works well with Average Performance on Complex Problems: May be
categorical data like text classification. outperformed by advanced models in complex tasks.
The probability of a continuous feature x given a class is computed using the Gaussian (Normal) distribution formula:
μ = Mean of the feature in the class
σ2= Variance of the feature in the class
Best for: Continuous numerical data
Assumption: Each feature follows a Gaussian (Normal) distribution
Example: Predicting house prices based on numerical attributes like square footage or location.
7
Strengths: Works well with continuous data, computationally efficient, handles high-dimensional data.
Weaknesses: Sensitive to outliers, assumes normality which may not always hold.
2. Multinomial Naïve Bayes (MNB)
Helps businesses identify at-risk customers and take proactive steps to retain them.
Improves customer satisfaction by addressing pain points before customers leave.
Increases revenue and profitability by reducing customer acquisition costs (retaining a customer is cheaper than
acquiring a new one).
Customer Churn Prediction Using a Decision Tree:
8
Problem Statement:
A telecom company wants to predict whether a customer will churn based on three features:
1. Monthly Bill (High/Low)
2. Contract Type (Long-term/Short-term)
3. Customer Support Calls (Few/Many)
The company collected the following customer data:
Customer Monthly Bill Contract Type Support Calls Churn (Yes/No)
The goal is to build a Decision Tree to classify new customers as either "Churn" or "No Churn."
Step 1: Calculate Entropy (Before Splitting)
The entropy formula should be: H(S) = -p(Yes)log₂p(Yes) - p(No)log₂p(No)
p(Yes) = Probability of "Churn" = 3/5 = 0.6
p(No) = Probability of "No Churn" = 2/5 = 0.4
Calculating: H(S) = -0.6log₂(0.6) - 0.4log₂(0.4) = -0.6 × (-0.737) - 0.4 × (-1.322) = 0.442 + 0.529 = 0.971 bits
Step 2: Choose the Best Feature to Split
For "Monthly Bill":
High Bill Customers (3 customers: 2 Yes, 1 No)
o Entropy = -2/3 × log₂(2/3) - 1/3 × log₂(1/3) = 0.918 bits
Low Bill Customers (2 customers: 1 Yes, 1 No)
o Entropy = -1/2 × log₂(1/2) - 1/2 × log₂(1/2) = 1 bit
Weighted entropy: H(Monthly Bill) = (3/5 × 0.918) + (2/5 × 1) = 0.951 bits
Information Gain: IG(Monthly Bill) = H(S) - H(Monthly Bill) = 0.971 - 0.951 = 0.02 bits
For "Contract Type" (as you mentioned it gives the highest IG):
Long-term (2 customers: 0 Yes, 2 No)
o Entropy = 0 bits (pure node)
Short-term (3 customers: 3 Yes, 0 No)
o Entropy = 0 bits (pure node)
Weighted entropy: H(Contract Type) = (2/5 × 0) + (3/5 × 0) = 0 bits
Information Gain: IG(Contract Type) = H(S) - H(Contract Type) = 0.971 - 0 = 0.971 bits
Step 3: Construct the Decision Tree
Since Contract Type gives the highest IG, we split on it first:
Long-term: (2 customers, both No Churn) → Stop (pure node)
Short-term: (3 customers, all Churn) → Stop (pure node)
Information Gain for "Support Calls"
Let's assume we have the following distribution:
High Support Calls (3 customers: 3 Yes, 0 No)
o Entropy = -3/3 × log₂(3/3) - 0/3 × log₂(0/3) = 0 bits (pure node)
Low Support Calls (2 customers: 0 Yes, 2 No)
o Entropy = -0/2 × log₂(0/2) - 2/2 × log₂(2/2) = 0 bits (pure node)
Weighted entropy: H(Support Calls) = (3/5 × 0) + (2/5 × 0) = 0 bits
Information Gain: IG(Support Calls) = H(S) - H(Support Calls) = 0.971 - 0 = 0.971 bits
This means that both "Contract Type" and "Support Calls" give the same Information Gain of 0.971 bits. This happens because
both features perfectly separate the data into pure nodes.
When multiple features have the same Information Gain, we can choose either one as the first split. If we choose "Support Calls":
High Support Calls: (3 customers, all Churn) → Stop (pure node)
Low Support Calls: (2 customers, both No Churn) → Stop (pure node)
9
This creates a different tree structure but with the same predictive power as the one that splits on "Contract Type" first.
To summarize all Information Gains:
IG(Monthly Bill) = 0.02 bits
IG(Contract Type) = 0.971 bits
IG(Support Calls) = 0.971 bits
Since both "Contract Type" and "Support Calls" give perfect splits (maximum possible Information Gain), either would be a good
choice for the first split in the decision tree.
Final Decision Tree:
Contract Type?
/ \
Long-term Short-term
(No) / \
Support Calls? (Few)
/ \
Many (Yes) Few (Yes)
Advantage Description
Handles Non-Linearity Captures complex, non-linear relationships between customer attributes and churn behavior.
Feature Importance Ranks features based on their significance, helping businesses identify key churn factors.
Handles Categorical & Works with both categorical (e.g., customer type) and numerical (e.g., monthly spending) data
Numerical Data without extensive preprocessing.
Resistant to Missing
Can handle missing values effectively by splitting data using available features.
Values
No Need for Feature Unlike logistic regression or SVM, decision trees do not require normalization or standardization of
Scaling input features.
Additional Questions
11 A company wants to predict whether a customer will churn (Y = 1) or stay (Y = 0) based on a given feature X using
Logistic Regression. The predicted probabilities and classifications for 10 customers are recorded.
1. Construct a confusion matrix based on the given actual and predicted values.
2. Calculate the following performance metrics:
o Accuracy
o Precision (PPV)
o Recall (Sensitivity)
10
o Specificity (TNR)
o Negative Predictive Value (NPV)
o False Positive Rate (FPR)
o False Negative Rate (FNR)
o F1 Score
o ROC AUC Score
o Balanced Accuracy
3. Explain the significance of the calculated F1 Score and Balanced Accuracy in evaluating the logistic regression
model.
Confusion Matrix
Actual / Predicted Predicted Negative (0) Predicted Positive (1)
Actual Negative (0) 3 (TN) 2 (FP)
Actual Positive (1) 2 (FN) 3 (TP)
Significance of the Calculated F1 Score and Balanced Accuracy in Evaluating the Logistic Regression Model
1. F1 Score (0.60)
Interpretation:
o The F1 Score balances the trade-off between Precision (how many predicted positives are actually correct) and
Recall (how many actual positives are correctly identified).
o A value of 0.60 indicates that the model maintains a moderate balance between these two aspects.
o If the cost of False Positives (FP) or False Negatives (FN) is high, the F1 Score helps in making a better
11
evaluation rather than just using Accuracy.
o A higher F1 Score means a better model for imbalanced datasets where class distribution is skewed.
2. Balanced Accuracy (0.60)
Interpretation:
o It accounts for both true positives (Sensitivity) and true negatives (Specificity), making it useful when the dataset
is imbalanced.
o The value 0.60 suggests that the model has moderate performance in identifying both churned and non-churned
customers correctly.
o A Balanced Accuracy closer to 1 would indicate a strong model, whereas closer to 0.5 suggests that the model is
making near-random predictions.
Since both F1 Score and Balanced Accuracy are 0.60, the logistic regression model is performing moderately well, but there is
room for improvement.
To improve the model, techniques such as feature engineering, hyperparameter tuning, or using more advanced algorithms
(e.g., Random Forest, SVM) can be considered.
12