DM Lab 04
DM Lab 04
Decision trees
1) Useful Concepts
1. Basic Concepts of Classification
Classification is a supervised learning task where the objective is to predict the label (or class) of
a given input based on its features. In classification, the output is discrete (e.g., Yes/No, 0/1, or
categories).
Example:
• Problem: Predict whether a student will pass or fail based on study hours.
• Features: Study hours, Attendance.
• Target: Pass or Fail.
2. Introduction to Decision Trees
A decision tree is a flowchart-like structure used for classification (and regression). It breaks
down the dataset into smaller subsets by asking a sequence of questions based on feature values,
ultimately leading to a decision.
• Internal Nodes: Represent a decision based on a feature.
• Branches: Represent the outcome of that decision.
• Leaf Nodes: Represent the final class or prediction.
3. Building a Decision Tree
Decision trees are built by recursively partitioning the data based on the feature that best
separates the classes. The objective is to minimize impurity within each subset.
• Impurity: A measure of how mixed the classes are within a node.
o Gini Impurity
o Entropy
4. Hunt’s Algorithm
Hunt’s Algorithm is one of the earliest methods to build decision trees. It follows a recursive
partitioning approach.
Steps of Hunt's Algorithm:
1. If all records at a node belong to the same class, label it as a leaf node.
2. If records are mixed, select a feature to split them.
3. Recur for each subset created by the split.
5. Attribute Conditions in Decision Trees
Attributes (or features) can be either categorical or numerical. The conditions to split data differ
based on the attribute type:
• For Categorical Features: The condition checks for specific category membership.
o Example: If color == 'red'.
• For Numerical Features: The condition checks for a threshold value.
o Example: If age > 30.
6. Best Split and Decision Tree Algorithm
The best split is the one that maximizes class separation (reduces impurity the most). Commonly
used metrics to find the best split are:
• Gini Index: Measures how pure the subsets are.
• Entropy: Measures the amount of uncertainty in the subset.
• Information Gain: The reduction in entropy after the split.
7. Characteristics of Decision Tree Induction
• Easy to interpret and visualize.
• No need for data scaling.
• Sensitive to overfitting (pruning may be required).
• Handles categorical and numerical data.
Using Decision Trees in Python (Scikit-Learn). We will use Python’s Scikit-learn library to
build a decision tree.
Activity 2: Loan Approval Prediction
Problem: Predict whether a loan application will be approved based on various features
like income, loan amount, credit history, and marital status.
Dataset Features:
• Applicant Income
• Loan Amount
• Credit History (0: Bad, 1: Good)
• Marital Status (Single, Married)
• Loan Status (Approved or Not Approved - Target Variable)
Importing Libraries and Dataset:
Graded Task: Implement a Decision Tree Classifier using Scikit-Learn
Objective: In this task, you will implement a decision tree classifier using the scikit-learn library
in Python. You will work with the Iris dataset, which is a well-known dataset for classification
tasks. The goal is to train and evaluate your model, visualize the decision tree, and report the
model's accuracy.
Task Steps:
• Test the classifier on the test dataset and calculate the accuracy.
• Print the accuracy of the model.
• Experiment with different values for max_depth and min_samples_split to see how they
impact the model's accuracy and structure.
• Comment on the effects of these parameters.
6. Provide Interpretation