Assignment 4_9c1ff867-2368-4c16-8b4b-5adfbd043c28
Assignment 4_9c1ff867-2368-4c16-8b4b-5adfbd043c28
Assignment 4
Based on Module 4B: Decision Trees and Module 5: Gaussian Mixture Modeling
1. A dataset contains 200 samples classified into two classes: 120 positive and 80 negatives.
a. Compute the Gini index before splitting.
b. If a split results in subsets:
Left: (50 positive, 10 negative)
Right: (70 positive, 70 negative)
Compute the weighted Gini index and determine whether the split improves purity.
2. Consider the given dataset with two independent variables (𝑥1 , 𝑥2 ) 𝒙𝟏 𝒙𝟐 𝒚
and one dependent variable (𝑦): 1 5 10
a. Use the sum of squared errors (SSE) to determine the best 2 6 12
splitting point for 𝑥1 . 3 8 15
b. Construct the first split of a regression tree using SSE as the 4 10 18
impurity measure.
5 12 21
3. Consider a 2-dimensional feature space with a dataset of 𝑁 = 10
6 15 25
points. A vector quantization (VQ) system maps these points into
𝐾 = 3 clusters using a codebook. The distortion function is the 7 18 28
squared Euclidean distance between the original points and their 8 20 30
assigned cluster centroids. Given the following initial cluster centroids:
𝐶1 = (2,3), 𝐶2 = (5,8), 𝐶3 = (9,4)
Assign the following data points to their closest centroid using squared Euclidean distance:
(1,2), (3,4), (6,7), (8,3), (5,5)
a. Compute the new centroids after one iteration of vector quantization.
b. Show whether the distortion decreases after this iteration.
4. Show that if we maximize the first equation with respect to Σ𝑘 and π𝑘 while keeping the
responsibilities γ(𝑧𝑛𝑘 ) fixed, we obtain the closed-form solutions given by the following
equations:
𝑁 𝐾
𝑝(𝑥) = ∑ π𝑘 𝑝( 𝑥 ∣ 𝑘 )
𝑘=1
and suppose that we partition the vector x into two parts so that x = (xa, xb). Show that the
conditional density 𝑝( 𝑥𝑏 ∣ 𝑥𝑎 ) is itself a mixture distribution and find expressions for the
mixing coefficients and component densities.
6. Consider a mixture of Gaussian distributions given by
𝐾
𝑝(𝑥|Θ) = ∑ 𝜋𝑘 𝒩(𝑥|𝜇𝑘 , Σ𝑘 )
𝑘=1
1
where:
𝐾: number of Gaussian components
𝜋𝑘 : mixing coefficients such that ∑𝐾 𝑘=1 𝜋𝑘 = 1 and 𝜋𝑘 > 0
𝒩(𝑥|𝜇𝑘 , Σ𝑘 ): Gaussian density with mean 𝜇𝑘 and covariance Σ𝑘
Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 }Kk=1 represents the parameters of the model.
a. Write down the complete log-likelihood function for a dataset {𝑥1 , 𝑥2 , … , 𝑥𝑁 } assuming
that the data points are drawn independently from the mixture model.
b. Derive the Maximum Likelihood Estimation (MLE) update rules for 𝜋𝑘 , 𝜇𝑘 and Σ𝑘
assuming that the component that generated each data point is known.
Programming Questions:
7. Write a code to obtain a fully grown regression tree for the data given in Q2 and visualize the
regression tree.
8. Binary classification tree:
a. Train a fully grown binary classification tree based on Gini impurity using the dataset
A4_train.csv and visualize it.
b. Compute the Sum of Squared Errors (SSE) on the test dataset (A4_test.csv) at each depth and
plot the variation of SSE with depth.
c. Determine the optimal pruning depth by selecting the depth where SSE change is minimal.
d. Visualize the pruned tree.