Decision_Tree
Decision_Tree
a) The decision tree is considered an unstable classifier. Explain this statement. How to fix
this problem.
Explanation: Decision trees are unstable because small changes in the training data can
lead to significant changes in the structure of the learned tree. This is due to the
hierarchical nature of the tree and the fact that each split is dependent on the previous
splits. A slight change in the data might lead to a different feature or threshold being
chosen at a particular node, propagating down the tree and altering subsequent splits.
Fix: Ensemble methods like Bagging (Bootstrap Aggregating) and Random Forests can
mitigate instability. Bagging trains multiple trees on different bootstrap samples of the
training data and then combines their predictions (e.g., through majority voting).
Random Forests add further randomization by considering only a random subset of
features at each split. This averaging process reduces variance and improves stability.
Spring 23
e) What is the main disadvantage of the decision boundaries generated by the decision
trees especially for real-valued features?
Answer: The main disadvantage is that decision trees, in their basic form, create axis-
aligned decision boundaries when using real-valued features. This means that the splits
are always perpendicular to one of the feature axes. This can be suboptimal when the
true decision boundary is not aligned with the axes, requiring many splits to
approximate a diagonal or curved boundary, potentially leading to a complex and
overfitted tree.
Fall 22
(3) Why the decision trees belong to the class of unstable classifiers? How to overcome
this problem?
Answer: (Same as Fall 23, a)) Decision trees are unstable because small changes in the
training data can result in significantly different tree structures. This is because the
decisions made at each node are highly dependent on the data that reaches that node,
and early splits can propagate down the tree, affecting subsequent splits.
Overcome: Ensemble methods like Bagging and Random Forests can improve stability
by training multiple trees on different subsets of the data and combining their
predictions.
Spring 21
Answer: (Same as previous answers) Decision trees are unstable because small changes
in the training data can lead to significant changes in the tree structure. This is due to
the hierarchical and sequential nature of the tree-building process, where each split
depends on the previous ones.
1. Question: Are decision trees linear or non-linear classifiers? Answer: Decision trees are non-
linear classifiers.
3. Question: What are the two possible outcomes of a query at an internal node of an OBCT?
Answer: The two possible outcomes are YES and NO, corresponding to whether the feature
value satisfies the condition (e.g., xi ≤ α).
4. Question: What is the main goal of the splitting criterion used in decision tree construction?
Answer: The main goal is to create descendant subsets that are more class-homogeneous
compared to the parent node, meaning the data points within each subset belong
predominantly to a single class.
6. Question: In the context of decision trees, what does it mean for a node to be "pure"?
Answer: A pure node is a node where all the data points belonging to it belong to the same
class.
7. Question: Why are decision trees considered "non-metric" classifiers when dealing with
nominal data? Answer: Because they can classify data based on discrete attributes without
requiring a notion of distance or similarity between the attributes. The questions asked are
based on attribute values directly.
8. Question: What is CART an acronym for? Answer: Classification and Regression Trees.
9. Question: What is one advantage of using decision trees in terms of interpretability?
Answer: Decision trees are highly interpretable as the sequence of decisions can be easily
expressed as logical rules or visualized as a tree structure.
10. Question: In the context of decision tree splitting, what does "node impurity" measure?
Answer: Node impurity measures the heterogeneity of class labels within the data points at
that node. A high impurity indicates a mix of classes, while a low impurity indicates that the
node is mostly composed of a single class.
Open-Ended Questions:
11. Question: Describe the key design elements that define a decision tree. Answer: The key
design elements of a decision tree include:
Node Splitting: Each internal node is associated with a subset of the training data and
a query (typically involving a single feature and a threshold). This query splits the data
into two disjoint subsets based on the answer (YES/NO).
Splitting Criterion: A metric used to determine the best split at each node. This
criterion aims to maximize the homogeneity of the resulting child nodes (e.g., by
maximizing the decrease in impurity).
Stop-Splitting Criterion: A rule that determines when to stop splitting a node and
declare it a leaf. This prevents the tree from becoming too complex and overfitting the
data. Common criteria include reaching a minimum number of samples in a node or a
minimum decrease in impurity.
Class Assignment Rule: A method for assigning a class label to each terminal leaf node.
This is often based on the majority class of the training samples that reach that leaf.
Set of Questions (Queries): The type of conditions tested at each node. In OBCTs, these
are typically comparisons of a feature value to a threshold.
12. Question: Explain the concept of "node impurity" and how the "decrease in node impurity"
is used as a splitting criterion in decision tree construction. Answer:
Node Impurity: Node impurity quantifies the heterogeneity of class labels within the
data points at a particular node. A high impurity value means the node contains a mix
of different classes, while a low impurity value indicates that the node is dominated by
a single class. Common measures of impurity include the Gini index and entropy. In the
slides, the impurity measure is given by I(t) = 1 - Σ [P(ωi|t)]², where P(ωi|t) is the
proportion of data points of class ωi at node t. Another measure mentioned is based
on entropy: I(t) = - Σ P(ωi|t) log₂[P(ωi|t)].
Decrease in Node Impurity: The decrease in node impurity (ΔI) measures how much
the impurity is reduced after splitting a node. It is calculated as: ΔI = I(t) - [NY/Nt * I(tY)
+ NN/Nt * I(tN)], where I(t) is the impurity of the parent node, I(tY) and I(tN) are the
impurities of the child nodes resulting from the split (YES and NO branches,
respectively), and N represents the number of data points in each node.
Use as Splitting Criterion: During the training process, for each node, the algorithm
considers all possible splits based on different features and thresholds. The split that
results in the highest decrease in node impurity is chosen as the best split for that
node. The rationale is that a larger decrease in impurity indicates that the split is more
effective in separating the classes, leading to more homogeneous child nodes. The goal
is to recursively partition the data such that the leaf nodes eventually become as pure
as possible.
13. Question: Summarize the algorithmic scheme for constructing an Ordinary Binary
Classification Tree (OBCT). Answer: The algorithmic scheme for constructing an OBCT
typically involves the following steps:
1. Start with the root node: Place all training data points at the root node.
2. Evaluate splitting criteria: For the current node, examine all possible splits based on
each feature and a set of potential threshold values.
3. Select the best split: Choose the split that maximizes the decrease in node impurity (or
minimizes some other impurity measure). This involves selecting the feature and
threshold that best separates the classes in the current node's data.
4. Create child nodes: Create two child nodes corresponding to the chosen split (YES
branch and NO branch). Distribute the data points from the parent node to the child
nodes based on whether they satisfy the split condition.
5. Check stopping criteria: Evaluate if any of the stop-splitting criteria are met for the
child nodes. These criteria might include:
The node is pure (all data points belong to the same class).
The number of data points in the node is below a threshold.
The maximum depth of the tree has been reached.
The decrease in impurity from the best split is below a certain threshold.
6. Declare leaf nodes: If a stopping criterion is met, declare the node as a terminal leaf
node and assign it a class label (e.g., the majority class of the data points in that leaf).
7. Recursively split: If no stopping criteria are met, recursively apply steps 2-6 to each
child node.
8. Tree completion: The process continues until all nodes are either pure or meet a
stopping criterion, resulting in a fully grown decision tree.
14. Question: Discuss the concept of decision tree instability and how "bagging" can be used to
overcome this issue. Answer:
Decision Tree Instability: Decision trees are considered unstable classifiers because
small changes in the training data can lead to significant changes in the structure of the
learned tree. This high variance means that different trees trained on slightly different
subsets of the data can produce different predictions. This sensitivity to the training
data can limit the generalization performance of a single decision tree.
1. Bootstrap Sampling: Create multiple (B) bootstrap samples from the original
training dataset. A bootstrap sample is created by randomly sampling data points
with replacement from the original dataset, resulting in datasets of the same size
as the original but with some data points repeated and others omitted.
2. Individual Tree Training: Train a separate decision tree on each of the B bootstrap
samples. Each tree is grown without pruning (or with minimal pruning) to
encourage diversity among the trees.
3. Aggregation (Majority Voting): To make a prediction for a new instance, each of
the B trained trees makes its own prediction. For classification tasks, the final
prediction is determined by a majority vote among the predictions of the
individual trees. For regression tasks, the predictions are typically averaged.
15. Question: Explain how decision trees can be used for classifying non-metric data, providing
an example of a scenario where this approach would be particularly useful. Answer:
Decision trees are well-suited for classifying non-metric data because they can make
decisions based on the values of discrete attributes without requiring a notion of distance
or similarity. The queries at each node involve checking the value of a specific attribute.
Process for Non-Metric Data: Instead of comparing feature values to thresholds, the
splits in a decision tree for non-metric data involve checking if a categorical attribute
has a specific value or belongs to a subset of values. For example, if classifying fruits
based on attributes like color (red, green, yellow), texture (smooth, rough), and taste
(sweet, sour), a split might be "Is color = red?" or "Is texture in {smooth, shiny}?".
Example Scenario: Consider a medical diagnosis system where patient symptoms are
recorded as categorical attributes (e.g., fever: yes/no, cough: dry/wet, rash:
present/absent). There's no natural ordering or distance metric between these
symptoms. A decision tree can be effectively used to classify the possible diseases
based on the presence or absence of these symptoms or their specific types. For
instance, the root node might ask "Does the patient have a fever?". The branches would
lead to further questions based on other symptoms, ultimately leading to a predicted
diagnosis at the leaf nodes.
In this scenario, traditional metric-based classifiers might struggle to handle the categorical
nature of the data without complex feature engineering or distance metric definitions.
Decision trees, on the other hand, can directly utilize these categorical attributes to build a
classification model that is both interpretable and effective.
16. Question: Describe the recursive tree-growing process in the CART (Classification and
Regression Trees) algorithm. What are the criteria for deciding whether to split a node
further or declare it a leaf node? Answer: The recursive tree-growing process in CART
involves repeatedly partitioning the training dataset into subsets until certain stopping
criteria are met. Here's a breakdown:
1. Start at the Root: The process begins with all the training data residing at the root
node of the tree.
2. Find the Best Split: At each node, the algorithm searches for the best split among all
possible features and their possible split points (thresholds for continuous features, or
subsets of values for categorical features). The "best" split is determined by a splitting
criterion, typically one that maximizes the reduction in impurity (e.g., Gini impurity or
entropy).
3. Create Child Nodes: Once the best split is found, the data at the current node is
partitioned into two (for binary trees) or more child nodes based on the outcome of
the split condition.
4. Recursion: The process then recursively calls itself on each of the child nodes. This
means steps 2 and 3 are repeated for each child node, attempting to find the best split
for their respective subsets of data.
5. Stopping Criteria (Leaf Node Declaration): The recursion stops, and a node is declared
a leaf node when one or more stopping criteria are met. Common stopping criteria
include:
Purity: All samples at the node belong to the same class (impurity is zero).
Minimum Samples: The number of samples at the node is below a specified
minimum threshold. This prevents the creation of very small, potentially noisy leaf
nodes.
Minimum Impurity Decrease: The best possible split at the node results in an
impurity decrease below a certain threshold. This indicates that further splitting
does not significantly improve class separation.
Maximum Depth: The tree has reached a predefined maximum depth. This limits
the complexity of the tree and helps prevent overfitting.
6. Leaf Node Labeling: Once a node is declared a leaf, it is assigned a class label. For
classification trees, this is typically the majority class of the samples residing in that leaf.
For regression trees, it might be the average value of the target variable for the
samples in the leaf.
The recursive nature of the CART algorithm allows it to build complex decision boundaries
by repeatedly partitioning the feature space based on the most informative features at each
step. The stopping criteria are crucial for controlling the size and complexity of the tree and
preventing overfitting.
17. Question: Consider the example provided with 16 data points and the entropy impurity
measure. Explain how the first split is determined in that example (the split at x1 = 7.5).
What calculations would be involved in making this decision? Answer: To determine the first
split at x1 = 7.5, the CART algorithm would have evaluated all possible splits on both
features (x0 and x1). Here's a conceptual breakdown of the calculations involved, focusing
on the split at x1 = 7.5:
1. Initial Impurity: First, the entropy impurity of the root node (containing all 16 data
points) is calculated. Let's assume there are two classes (as visually implied by different
colors in a typical example of this kind). You would count the number of points
belonging to each class in the root node and calculate the entropy using the formula:
2. Evaluate Splits on x1: The algorithm would then consider all possible split points along
the x1 axis. A potential split point is between any two adjacent data points sorted by
their x1 value. In this case, x1 = 7.5 is one such potential split point.
Left Child (x1 ≤ 7.5): Determine the data points that fall into this left child node.
Count the number of points from each class in this node and calculate its entropy:
4. Calculate Weighted Average Impurity: Calculate the weighted average impurity of the
child nodes after the split:
5. Calculate Information Gain (Decrease in Impurity): The information gain for the split
at x1 = 7.5 is the difference between the impurity of the parent node and the weighted
average impurity of the child nodes:
6. Repeat for All Possible Splits: Steps 2-5 would be repeated for all other potential split
points on the x1 axis and also for all potential split points on the x0 axis.
7. Select the Best Split: The split that yields the highest information gain (largest decrease
in entropy impurity) is chosen as the first split. In the example, the split at x1 = 7.5
resulted in the highest information gain compared to all other possible splits, making it
the chosen split for the root node. The red values in the provided example slide likely
represent the calculated entropy impurity for the nodes after the split.