0% found this document useful (0 votes)
7 views

Assignment 4_9c1ff867-2368-4c16-8b4b-5adfbd043c28

The document outlines an assignment for a course on Data Science and Machine Intelligence, focusing on decision trees and Gaussian mixture modeling. It includes several tasks such as computing Gini indices, constructing regression trees, vector quantization, and deriving maximum likelihood estimation for Gaussian mixtures. Additionally, it contains programming questions related to building and visualizing classification and regression trees using provided datasets.

Uploaded by

Pankaj SIngh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Assignment 4_9c1ff867-2368-4c16-8b4b-5adfbd043c28

The document outlines an assignment for a course on Data Science and Machine Intelligence, focusing on decision trees and Gaussian mixture modeling. It includes several tasks such as computing Gini indices, constructing regression trees, vector quantization, and deriving maximum likelihood estimation for Gaussian mixtures. Additionally, it contains programming questions related to building and visualizing classification and regression trees using provided datasets.

Uploaded by

Pankaj SIngh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

EE708: Fundamentals of Data Science and Machine Intelligence

Assignment 4
Based on Module 4B: Decision Trees and Module 5: Gaussian Mixture Modeling

1. A dataset contains 200 samples classified into two classes: 120 positive and 80 negatives.
a. Compute the Gini index before splitting.
b. If a split results in subsets:
Left: (50 positive, 10 negative)
Right: (70 positive, 70 negative)
Compute the weighted Gini index and determine whether the split improves purity.
2. Consider the given dataset with two independent variables (𝑥1 , 𝑥2 ) 𝒙𝟏 𝒙𝟐 𝒚
and one dependent variable (𝑦): 1 5 10
a. Use the sum of squared errors (SSE) to determine the best 2 6 12
splitting point for 𝑥1 . 3 8 15
b. Construct the first split of a regression tree using SSE as the 4 10 18
impurity measure.
5 12 21
3. Consider a 2-dimensional feature space with a dataset of 𝑁 = 10
6 15 25
points. A vector quantization (VQ) system maps these points into
𝐾 = 3 clusters using a codebook. The distortion function is the 7 18 28
squared Euclidean distance between the original points and their 8 20 30
assigned cluster centroids. Given the following initial cluster centroids:
𝐶1 = (2,3), 𝐶2 = (5,8), 𝐶3 = (9,4)
Assign the following data points to their closest centroid using squared Euclidean distance:
(1,2), (3,4), (6,7), (8,3), (5,5)
a. Compute the new centroids after one iteration of vector quantization.
b. Show whether the distortion decreases after this iteration.
4. Show that if we maximize the first equation with respect to Σ𝑘 and π𝑘 while keeping the
responsibilities γ(𝑧𝑛𝑘 ) fixed, we obtain the closed-form solutions given by the following
equations:
𝑁 𝐾

𝐸𝑍 [ln 𝑝 (𝑋, 𝑍|μ, Σ, π)] = ∑ ∑ γ(𝑧𝑛𝑘 )(ln π𝑘 + ln 𝒩 (𝑥𝑛 |μ𝑘 , Σ𝑘 ))


𝑛=1 𝑘=1
𝑁
1
Σ𝑘 = ∑ γ(𝑧𝑛𝑘 )(𝑥𝑛 − μ𝑘 )(𝑥𝑛 − μ𝑘 )T
𝑁𝑘
𝑛=1
𝑁𝑘
π𝑘 =
𝑁
5. Consider a density model given by a mixture distribution
𝐾

𝑝(𝑥) = ∑ π𝑘 𝑝( 𝑥 ∣ 𝑘 )
𝑘=1
and suppose that we partition the vector x into two parts so that x = (xa, xb). Show that the
conditional density 𝑝( 𝑥𝑏 ∣ 𝑥𝑎 ) is itself a mixture distribution and find expressions for the
mixing coefficients and component densities.
6. Consider a mixture of Gaussian distributions given by
𝐾

𝑝(𝑥|Θ) = ∑ 𝜋𝑘 𝒩(𝑥|𝜇𝑘 , Σ𝑘 )
𝑘=1

1
where:
𝐾: number of Gaussian components
𝜋𝑘 : mixing coefficients such that ∑𝐾 𝑘=1 𝜋𝑘 = 1 and 𝜋𝑘 > 0
𝒩(𝑥|𝜇𝑘 , Σ𝑘 ): Gaussian density with mean 𝜇𝑘 and covariance Σ𝑘
Θ = {𝜋𝑘 , 𝜇𝑘 , Σ𝑘 }Kk=1 represents the parameters of the model.
a. Write down the complete log-likelihood function for a dataset {𝑥1 , 𝑥2 , … , 𝑥𝑁 } assuming
that the data points are drawn independently from the mixture model.
b. Derive the Maximum Likelihood Estimation (MLE) update rules for 𝜋𝑘 , 𝜇𝑘 and Σ𝑘
assuming that the component that generated each data point is known.

Programming Questions:
7. Write a code to obtain a fully grown regression tree for the data given in Q2 and visualize the
regression tree.
8. Binary classification tree:
a. Train a fully grown binary classification tree based on Gini impurity using the dataset
A4_train.csv and visualize it.
b. Compute the Sum of Squared Errors (SSE) on the test dataset (A4_test.csv) at each depth and
plot the variation of SSE with depth.
c. Determine the optimal pruning depth by selecting the depth where SSE change is minimal.
d. Visualize the pruned tree.

You might also like