10 EST Solution
10 EST Solution
Q1(a) Name and briefly outline any two measures to compute the dissimilarity
between objects described by the following: Asymmetric binary attributes and
Numeric attributes.
Asymmetric Binary Attributes - Jaccard Dissimilarity
In the context of asymmetric binary attributes, where each object is represented by a set of binary
attributes (present or absent), the Jaccard dissimilarity is adapted to account for asymmetry. The
numerator counts the asymmetric differences between the two objects, and the denominator
normalizes the dissimilarity by the total number of unique attributes present in either A or B.
(b) Compute the Hamming Distance and Jaccard Similarity between the following
two binary vectors
x=<0 1 0 1 0 1 0 0 0 1> y=<0 1 0 0 0 1 1 0 0 0>
Solution:
Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /(number of bits - number 0-0 matches) =
2/5=0.4
(d) Suppose that you are comparing how similar two organisms of different species
are in terms of the number of genes they share. Describe which measure, Hamming
or Jaccard, you think would be more appropriate for comparing the genetic makeup
of two organisms. Explain. (Assume that each animal is represented as a binary
vector, where each attribute is 1 if a particular gene is present in the organism and 0
otherwise.)
Solution: Jaccard is more appropriate for comparing the genetic makeup of two
organisms; since we want to see how many genes these two organisms share.
(e) If you wanted to compare the genetic makeup of two organisms of the same
species, e.g., two human beings, would you use the Hamming distance, the Jaccard
coefficient, or a different measure of similarity or distance? Explain. (Note that two
human beings share > 99.9% of the same genes.)
Solution: Two human beings share > 99.9% of the same genes. If we want to compare
the genetic makeup of two human beings, we should focus on their differences. Thus, the
Hamming distance / Simple Matching Coefficient is more appropriate in this situation.
Q2 Consider a dataset comprising of ten samples, where each sample has two
features: Feature A and Feature B. The samples belong to two classes, labelled class
0 and class 1. The data is given in the below table:
(a) Use k-nearest neighbor classifier with (Euclidean distance measure and k=3) to
classify a given test samples <feature A= 43.6 and feature B= 40>
Solution: Given the dataset and new test instance, we need to find the distance from the
new test instance to every training example. Here we use the euclidean distance formula
to find the distance. In the next table, you can see the calculated distance from text
example to training instances.
Feature A Feature B Label Formula Distance
31 56 1 √((43.6-31)^2+(40-56)^2 ) 20.37
Once you calculate the distance, the next step is to find the nearest neighbors based on
the value of k. In this case, the value of k is 3. Hence we need to find 3 nearest neighbors.
33.6 50 1 14.14 2
26.6 30 0 19.72
23.4 40 0 20.20
43.1 67 0 27.00
35.3 23 1 18.92
35.9 67 1 28.08
36.7 45 1 8.52 1
25.7 46 0 18.88 3
23.3 29 0 23.09
31 56 1 20.37
Now, we need to apply the majority voting technique to decide the resulting label from
the new example. Here the 1st and 2nd nearest neighbors have target label 1 and the 3rd
nearest neighbor has target label 0. Target label 1 has the majority. Hence the new
example is classified as 1.
(b) What is the impact of k on model performance (in terms of overfitting and
underfitting). Explain.
Solution: In KNN, finding the value of k is very crucial.
A small value of k means that noise will have a higher influence on the result and a large
value makes it computationally expensive.
If we choose our K = 1 , then our algorithm behaves as over-fitting and it gives a non-
smooth decision surface.
As K increases, our decision surface gets smoother and, if we choose K as very large,
then our algorithm behaves as underfitting and it gives a smooth decision surface and
everything becomes one class which is the majority class in our dataset.
So, we should choose K wisely such that it should neither be overfitting nor be
underfitting .
(c) List and explain any three possible ways to reduce the inference cost associated
with k-nearest neighbor classifiers.
Solution:
1. Subsampling: A very simple, yet often very effective way of reducing the inference
cost is to subsample the data. For example, we might have an available dataset of, say,
10M data points, but we can do a good enough job with 100K data points.
2. Dimension reduction: For some distances, such as L2 and cosine distance, it’s
possible to reduce the dimension of the data points while maintaining the distance
between a query point and any data point that is approximately the same. The quality of
the approximation depends only on the dimension of the output. Small dimension means
a crude approximation, yet it is often the case that we can obtain a good enough
approximation for the distances while reducing the dimension. The main disadvantage of
this method is that the output is dense, so for highly sparse data or data that had a low
dimension to begin with, this might not be the best technique.
3. Avoiding Regions Quickly: A common approach for disqualifying data points quickly
is through clustering. If the center of a cluster is far away from the query we can
disqualify the entire cluster without looking into all of its data points. For this technique,
the data must be pre-processed to obtain m << n centers, typically with k-means
clustering. Then, when a query arrives we compute its distance to all of the centers, and
disregard all points that belong to clusters with centers far away from the query.
Q3(a) Suppose you have been given the following five data points: A= (6, 8), B= (5,
7), C= (8, 4), D= (11, 10), E= (12, 8). Apply agglomerative clustering with single
linkage method to group the given data. Use Euclidean measure to compute the
initial distance matrix and show output at each step.
Solution:
A B C D E
A 0
B 1.41 0
C 4.47 4.24 0
D 5.38 6.71 6.71 0
E 6 7.07 5.65 2.23 0
After iteration 1
A, B C D E
A,B 0
C 4.24 0
D 5.38 6.71 0
E 6 5.65 2.23 0
After iteration 2
A, B C D, E
A,B 0
C 4.24 0
D, E 5.38 5.65 0
After iteration 3
A, B, D, E
C
A,B, C 0
D, E 5.38 0
Q3(b) Consider a database with five transactions given below. Let minimum support
= 60% and minimum confidence = 80%. Find all frequent itemsets using the Apriori
algorithm and list all strong association rules, along with their support and
confidence values.
TID Items_bought
T1 {M, O, N, K, E, Y}
T2 {D, O, N, K, E, Y}
T3 {M, A, K, E}
T4 {M, U, C, K, Y}
T5 {C, O, O, K, I, E}
L1= {E, K, M, O, Y} C2={EK, EM, EO, EY, KM, KO, KY, MO, MY, OY}
Dataset Splitting:
● The dataset is divided into K subsets or folds. Common values for K are 5
or 10, but the choice depends on the size of the dataset and the desired level
of granularity for evaluation.
Iteration:
● The model is trained and evaluated K times. In each iteration, a different
fold is used as the test set, while the remaining folds are used as the training
set.
Performance Evaluation:
● The model's performance is measured on the test set for each iteration,
typically using metrics such as accuracy, precision, recall, or F1 score,
depending on the nature of the problem (classification, regression, etc.).
Average Performance:
● The performance metrics from each iteration are averaged to obtain a single
performance estimate for the model.
Here's how bagging works, specifically in the context of the Random Forest algorithm:
Bootstrap Sampling:
● Bagging involves creating multiple subsets of the original dataset through
bootstrap sampling. Bootstrap sampling is a random sampling method
where data points are selected with replacement from the original dataset.
This means that some examples may be repeated in a subset while others
may be omitted.
Building Multiple Trees:
● For each subset created through bootstrap sampling, a decision tree is
trained. However, these trees are not ordinary decision trees; they are often
deep and can potentially overfit the training data.
Voting or Averaging:
● The predictions of each tree are combined through a voting mechanism for
classification tasks or averaging for regression tasks. In the case of Random
Forests, each tree gets an equal say in the final prediction.
Reducing Variance:
● By training multiple trees on different subsets of the data and combining
their predictions, bagging helps to reduce the variance of the model. It
makes the overall model more robust and less sensitive to fluctuations or
outliers in the training data.
● Grid Search: Grid search involves defining a grid of hyperparameter values and
training the model for each combination of values. It exhaustively searches
through all possible combinations.
● Random Search: Random search involves randomly sampling hyperparameter
values from predefined ranges. It is less computationally intensive compared to
grid search but still effective.
● Bayesian optimization is an advanced technique that uses probabilistic models to
model the surrogate function representing the performance of the model with
different hyperparameters. It efficiently explores the hyperparameter space to find
the optimal configuration.
Q.5 The table given below shows a dataset with 10 instances, 3 features <Past Trend,
Open Interest, and Trading Volume>, and one target variable <Return>. Apply the
CART algorithm and build the decision tree. Show all intermediate steps. Make
assumptions if any.
Since the past trend is positive 6 numbers of times out of 10 and negative 4 numbers of
times, the calculation will be as follows:
Coming to open interest, the open interest is high 4 times and low 6 times out of total 10
times and is calculated as follows:
Trading volume is 7 times high and 3 times low and is calculated as follows:
The winner will be past trendfeature because its cost is the lowest; hence it will be chosen
as the root node for how the decision tree works.
Gini Index of open interest for positive past trend
Open interest for positive past trend is high 2 times out of 6 and low 4 times out of 6 and
the Gini Index of open interest for positive past trend is calculated as follows:
The trading volume is high 4 out of 6 times and low 2 out of 6 times and is calculated as
follows:
· Weighted sum of the Gini Indices, Gini Index for Trading Volume = (4/6)0 + (2/6)0
=0
We will use Trading Volume’ feature as new node, as it has the minimum Gini Index.
Q6 Consider the following neural network architecture with three inputs X1, X2 and
X3 having values 1, 4 and 5, two hidden neurons h1 and h2, two output neurons o1
and o2 as shown in the figure given below. The target values of the network are 0.8
and 0.2 respectively. The bias b1 and b2 are for the hidden and output layer
respectively.
Calculate the following in reference to the information provided for the given
network.
Solution:-
a) Net h1 = x1*w1+x2*w3+x3*w4+h1*1
= 1*0.1+4*0.3+5*0.5+0.5
= 1+1.2+2.5+0.5
= 4.3
= 5.3
c) Net O1 = .986*.7+.995*.9+.5
= 2.0857
Net O2 = .986*.8+.995*.1+.5
= 1.3883
= 0.184
= 0.0086
Similarly, ∂E/∂W8 = 0.0946
W7new = 0.6957
W8new = 0.7527