0% found this document useful (0 votes)
11 views

10 EST Solution

The document outlines various machine learning concepts and techniques, including measures for dissimilarity, k-nearest neighbor classification, agglomerative clustering, the Apriori algorithm for association rule mining, K-fold cross-validation, bagging in random forests, and hyperparameter tuning for Support Vector Machines. It provides specific examples and calculations for each concept, demonstrating their application in data analysis and model evaluation. The document serves as a comprehensive guide for understanding these fundamental machine learning methods.

Uploaded by

nnooobmasterr69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

10 EST Solution

The document outlines various machine learning concepts and techniques, including measures for dissimilarity, k-nearest neighbor classification, agglomerative clustering, the Apriori algorithm for association rule mining, K-fold cross-validation, bagging in random forests, and hyperparameter tuning for Support Vector Machines. It provides specific examples and calculations for each concept, demonstrating their application in data analysis and model evaluation. The document serves as a comprehensive guide for understanding these fundamental machine learning methods.

Uploaded by

nnooobmasterr69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

EST Solution

Course Code and Name: UML501 (Machine Learning)


Date: December 12, 2023
Course Instructors: Dr. Jatin, Dr. Harpreet, Dr. Ashutosh, Dr. Jyoti, Dr. Anjula, Dr.
Arun, Dr. Sumit

Q1(a) Name and briefly outline any two measures to compute the dissimilarity
between objects described by the following: Asymmetric binary attributes and
Numeric attributes.
Asymmetric Binary Attributes - Jaccard Dissimilarity
In the context of asymmetric binary attributes, where each object is represented by a set of binary
attributes (present or absent), the Jaccard dissimilarity is adapted to account for asymmetry. The
numerator counts the asymmetric differences between the two objects, and the denominator
normalizes the dissimilarity by the total number of unique attributes present in either A or B.

Numeric Attributes - Manhattan, Euclidean, Supermum and Minkowski

(b) Compute the Hamming Distance and Jaccard Similarity between the following
two binary vectors
x=<0 1 0 1 0 1 0 0 0 1> y=<0 1 0 0 0 1 1 0 0 0>
Solution:
Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /(number of bits - number 0-0 matches) =
2/5=0.4

© Which approach, Jaccard or Hamming distance, is more similar to the Simple


Matching Coefficient, and which approach is more similar to the cosine measure?
Explain giving suitable examples.
Solution: The Hamming distance is similar to the SMC. Infact,
SMC = Hamming distance/number of bits.
The Jaccard measure is similar to the cosine measure because both ignore 0-0 matches.

(d) Suppose that you are comparing how similar two organisms of different species
are in terms of the number of genes they share. Describe which measure, Hamming
or Jaccard, you think would be more appropriate for comparing the genetic makeup
of two organisms. Explain. (Assume that each animal is represented as a binary
vector, where each attribute is 1 if a particular gene is present in the organism and 0
otherwise.)
Solution: Jaccard is more appropriate for comparing the genetic makeup of two
organisms; since we want to see how many genes these two organisms share.

(e) If you wanted to compare the genetic makeup of two organisms of the same
species, e.g., two human beings, would you use the Hamming distance, the Jaccard
coefficient, or a different measure of similarity or distance? Explain. (Note that two
human beings share > 99.9% of the same genes.)
Solution: Two human beings share > 99.9% of the same genes. If we want to compare
the genetic makeup of two human beings, we should focus on their differences. Thus, the
Hamming distance / Simple Matching Coefficient is more appropriate in this situation.

Q2 Consider a dataset comprising of ten samples, where each sample has two
features: Feature A and Feature B. The samples belong to two classes, labelled class
0 and class 1. The data is given in the below table:

Feature A Feature B Class Label


33.6 50 1
26.6 30 0
23.4 40 0
43.1 67 0
35.3 23 1
35.9 67 1
36.7 45 1
25.7 46 0
23.3 29 0
31 56 1

(a) Use k-nearest neighbor classifier with (Euclidean distance measure and k=3) to
classify a given test samples <feature A= 43.6 and feature B= 40>

Solution: Given the dataset and new test instance, we need to find the distance from the
new test instance to every training example. Here we use the euclidean distance formula
to find the distance. In the next table, you can see the calculated distance from text
example to training instances.
Feature A Feature B Label Formula Distance

33.6 50 1 √((43.6-33.6)^2+(40-50)^2 ) 14.14

26.6 30 O √((43.6-26.6)^2+(40-30)^2 ) 19.72

23.4 40 O √((43.6-23.4)^2+(40-40)^2 ) 20.20

43.1 67 O √((43.6-43.1)^2+(40-67)^2 ) 27.00

35.3 23 1 √((43.6-35.3)^2+(40-23)^2 ) 18.92

35.9 67 1 √((43.6-35.9)^2+(40-67)^2 ) 28.08

36.7 45 1 √((43.6-36.7)^2+(40-45)^2 ) 8.52

25.7 46 O √((43.6-25.7)^2+(40-46)^2 ) 18.88

23.3 29 O √((43.6-23.3)^2+(40-29)^2 ) 23.09

31 56 1 √((43.6-31)^2+(40-56)^2 ) 20.37

Once you calculate the distance, the next step is to find the nearest neighbors based on
the value of k. In this case, the value of k is 3. Hence we need to find 3 nearest neighbors.

Feature A Feature B label Distance Rank

33.6 50 1 14.14 2

26.6 30 0 19.72

23.4 40 0 20.20

43.1 67 0 27.00

35.3 23 1 18.92
35.9 67 1 28.08

36.7 45 1 8.52 1

25.7 46 0 18.88 3

23.3 29 0 23.09

31 56 1 20.37

Now, we need to apply the majority voting technique to decide the resulting label from
the new example. Here the 1st and 2nd nearest neighbors have target label 1 and the 3rd
nearest neighbor has target label 0. Target label 1 has the majority. Hence the new
example is classified as 1.

Test Example Feature A=43.6, Feature B=40, Label =1

(b) What is the impact of k on model performance (in terms of overfitting and
underfitting). Explain.
Solution: In KNN, finding the value of k is very crucial.
A small value of k means that noise will have a higher influence on the result and a large
value makes it computationally expensive.
If we choose our K = 1 , then our algorithm behaves as over-fitting and it gives a non-
smooth decision surface.
As K increases, our decision surface gets smoother and, if we choose K as very large,
then our algorithm behaves as underfitting and it gives a smooth decision surface and
everything becomes one class which is the majority class in our dataset.

So, we should choose K wisely such that it should neither be overfitting nor be
underfitting .

(c) List and explain any three possible ways to reduce the inference cost associated
with k-nearest neighbor classifiers.
Solution:
1. Subsampling: A very simple, yet often very effective way of reducing the inference
cost is to subsample the data. For example, we might have an available dataset of, say,
10M data points, but we can do a good enough job with 100K data points.
2. Dimension reduction: For some distances, such as L2 and cosine distance, it’s
possible to reduce the dimension of the data points while maintaining the distance
between a query point and any data point that is approximately the same. The quality of
the approximation depends only on the dimension of the output. Small dimension means
a crude approximation, yet it is often the case that we can obtain a good enough
approximation for the distances while reducing the dimension. The main disadvantage of
this method is that the output is dense, so for highly sparse data or data that had a low
dimension to begin with, this might not be the best technique.

3. Avoiding Regions Quickly: A common approach for disqualifying data points quickly
is through clustering. If the center of a cluster is far away from the query we can
disqualify the entire cluster without looking into all of its data points. For this technique,
the data must be pre-processed to obtain m << n centers, typically with k-means
clustering. Then, when a query arrives we compute its distance to all of the centers, and
disregard all points that belong to clusters with centers far away from the query.

Q3(a) Suppose you have been given the following five data points: A= (6, 8), B= (5,
7), C= (8, 4), D= (11, 10), E= (12, 8). Apply agglomerative clustering with single
linkage method to group the given data. Use Euclidean measure to compute the
initial distance matrix and show output at each step.

Solution:

A B C D E
A 0
B 1.41 0
C 4.47 4.24 0
D 5.38 6.71 6.71 0
E 6 7.07 5.65 2.23 0

After iteration 1

A, B C D E
A,B 0

C 4.24 0
D 5.38 6.71 0
E 6 5.65 2.23 0

After iteration 2

A, B C D, E
A,B 0

C 4.24 0
D, E 5.38 5.65 0

After iteration 3

A, B, D, E
C
A,B, C 0

D, E 5.38 0

Q3(b) Consider a database with five transactions given below. Let minimum support
= 60% and minimum confidence = 80%. Find all frequent itemsets using the Apriori
algorithm and list all strong association rules, along with their support and
confidence values.
TID Items_bought
T1 {M, O, N, K, E, Y}
T2 {D, O, N, K, E, Y}
T3 {M, A, K, E}
T4 {M, U, C, K, Y}
T5 {C, O, O, K, I, E}

Solution: Subsets found by Apriori are:

L1= {E, K, M, O, Y} C2={EK, EM, EO, EY, KM, KO, KY, MO, MY, OY}

L2={EK, EO, KM, KO, KY} C3={EKO} L3={EKO}

Q4(a) Explain K-fold cross validation giving suitable examples.


Solution: K-fold cross-validation is a technique used in machine learning to assess the
performance and generalization ability of a model. It involves dividing a dataset into K
subsets, or folds, and then training and evaluating the model K times, each time using a
different fold as the test set and the remaining folds as the training set. The results are
then averaged to provide a more robust performance estimate.
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the
model would be trained and tested 5 separate times so each group would get a chance to
be the test set.
Here's a step-by-step explanation of K-fold cross-validation:

​ Dataset Splitting:
● The dataset is divided into K subsets or folds. Common values for K are 5
or 10, but the choice depends on the size of the dataset and the desired level
of granularity for evaluation.
​ Iteration:
● The model is trained and evaluated K times. In each iteration, a different
fold is used as the test set, while the remaining folds are used as the training
set.
​ Performance Evaluation:
● The model's performance is measured on the test set for each iteration,
typically using metrics such as accuracy, precision, recall, or F1 score,
depending on the nature of the problem (classification, regression, etc.).
​ Average Performance:
● The performance metrics from each iteration are averaged to obtain a single
performance estimate for the model.

Q4 (b) Explain the concept of bagging in reference to random forest classifier


Bagging, or Bootstrap Aggregating, is a machine learning ensemble technique designed
to improve the stability and accuracy of models by reducing variance and minimizing
overfitting. The concept of bagging is commonly associated with the Random Forest
classifier.

Here's how bagging works, specifically in the context of the Random Forest algorithm:
​ Bootstrap Sampling:
● Bagging involves creating multiple subsets of the original dataset through
bootstrap sampling. Bootstrap sampling is a random sampling method
where data points are selected with replacement from the original dataset.
This means that some examples may be repeated in a subset while others
may be omitted.
​ Building Multiple Trees:
● For each subset created through bootstrap sampling, a decision tree is
trained. However, these trees are not ordinary decision trees; they are often
deep and can potentially overfit the training data.
​ Voting or Averaging:
● The predictions of each tree are combined through a voting mechanism for
classification tasks or averaging for regression tasks. In the case of Random
Forests, each tree gets an equal say in the final prediction.
​ Reducing Variance:
● By training multiple trees on different subsets of the data and combining
their predictions, bagging helps to reduce the variance of the model. It
makes the overall model more robust and less sensitive to fluctuations or
outliers in the training data.

Q4(c) What are hyper-parameters? Explain any one possible way of


hyper-parameters tuning in Support Vector Machines (SVM) giving suitable
examples.
Solution: Hyperparameters are external configurations for machine learning models that
need to be set before the training process begins and cannot be directly learned from the
data. They influence the overall behavior and performance of a model. In the context of
Support Vector Machines (SVM), common hyperparameters include:

● C (Regularization Parameter): It controls the trade-off between achieving a low


training error and a low testing error. A smaller C encourages a smoother decision
boundary, while a larger C allows for a more complex boundary that fits the
training data more closely.
● Kernel Type (kernel): SVMs can use different kernel functions (e.g., linear,
polynomial, radial basis function) to map the input data into a higher-dimensional
space.
● Gamma (gamma): This parameter influences the shape of the decision boundary.
A small gamma leads to a more gradual change in the decision boundary, while a
large gamma results in a more complex decision boundary, potentially causing
overfitting.

The three possible ways to perform hyper-parameters tuning in SVM are:

● Grid Search: Grid search involves defining a grid of hyperparameter values and
training the model for each combination of values. It exhaustively searches
through all possible combinations.
● Random Search: Random search involves randomly sampling hyperparameter
values from predefined ranges. It is less computationally intensive compared to
grid search but still effective.
● Bayesian optimization is an advanced technique that uses probabilistic models to
model the surrogate function representing the performance of the model with
different hyperparameters. It efficiently explores the hyperparameter space to find
the optimal configuration.

Q.5 The table given below shows a dataset with 10 instances, 3 features <Past Trend,
Open Interest, and Trading Volume>, and one target variable <Return>. Apply the
CART algorithm and build the decision tree. Show all intermediate steps. Make
assumptions if any.

Past Trend Open Interest Trading Volume Return


Positive Low High Up
Negative High Low Down
Positive Low High Up
Positive High High Up
Negative Low High Down
Positive Low Low Down
Negative High High Down
Negative Low High Down
Solution- Decision tree using CART

Gini Index for past trend

Since the past trend is positive 6 numbers of times out of 10 and negative 4 numbers of
times, the calculation will be as follows:

P(Past Trend=Positive): 6/10, P(Past Trend=Negative): 4/10


· If (Past Trend = Positive & Return = Up), probability = 4/6
· If (Past Trend = Positive & Return = Down), probability = 2/6

Gini Index = 1 - ((4/6)^2 + (2/6)^2) = 0.45


· If (Past Trend = Negative & Return = Up), probability = 0/4
· If (Past Trend = Negative & Return = Down), probability = 4/4

Gini Index = 1 - ((0/4)^2 + (4/4)^2) = 0


· Weighted sum of the Gini Indices:

Gini Index for Past Trend = (6/10)0.45 + (4/10)0 = 0.27

Gini Index for open interest

Coming to open interest, the open interest is high 4 times and low 6 times out of total 10
times and is calculated as follows:

P(Open Interest=High): 4/10, P(Open Interest=Low): 6/10


· If (Open Interest = High & Return = Up), probability = 2/4
· If (Open Interest = High & Return = Down), probability = 2/4

Gini Index = 1 - ((2/4)^2 + (2/4)^2) = 0.5:::


· If (Open Interest = Low & Return = Up), probability = 2/6
· If (Open Interest = Low & Return = Down), probability = 4/6
Gini Index = 1 - ((2/6)^2 + (4/6)^2) = 0.45
· Weighted sum of the Gini Indices:

Gini Index for Open Interest = (4/10)0.5 + (6/10)0.45 = 0.47

Gini Index for trading volume

Trading volume is 7 times high and 3 times low and is calculated as follows:

P(Trading Volume=High): 7/10, P(Trading Volume=Low): 3/10


· If (Trading Volume = High & Return = Up), probability = 4/7
· If (Trading Volume = High & Return = Down), probability = 3/7

Gini Index = 1 - ((4/7)^2 + (3/7)^2) = 0.49


· If (Trading Volume = Low & Return = Up), probability = 0/3
· If (Trading Volume = Low & Return = Down), probability = 3/3

Gini Index = 1 - ((0/3)^2 + (3/3)^2) = 0


· Weighted sum of the Gini Indices: Gini Index for Trading Volume = (7/10)0.49
+ (3/10)0 = 0.34

The winner will be past trendfeature because its cost is the lowest; hence it will be chosen
as the root node for how the decision tree works.
Gini Index of open interest for positive past trend

Open interest for positive past trend is high 2 times out of 6 and low 4 times out of 6 and
the Gini Index of open interest for positive past trend is calculated as follows:

P(Open Interest=High): 2/6, P(Open Interest=Low): 4/6


· If (Open Interest = High & Return = Up), probability = 2/2
· If (Open Interest = High & Return = Down), probability = 0

Gini Index = 1 - ((2/2)^2 + (0/2)^2) = 0


· If (Open Interest = Low & Return = Up), probability = 2/4
· If (Open Interest = Low & Return = Down), probability = 2/4

Gini Index = 1 - ((2/4)^2 + (2/4)^2) = 0.50


· Weighted sum of the Gini Indices, Gini Index for Open Interest = (2/6)0 +
(4/6)0.50 = 0.33

Gini Index for trading volume

The trading volume is high 4 out of 6 times and low 2 out of 6 times and is calculated as
follows:

P(Trading Volume=High): 4/6, P(Trading Volume=Low): 2/6

· If (Trading Volume = High & Return = Up), probability = 4/4


· If (Trading Volume = High & Return = Down), probability = 0/4

Gini Index = 1 - ((4/4)^2 + (0/4)^2) = 0

· If (Trading Volume = Low & Return = Up), probability = 0/2


· If (Trading Volume = Low & Return = Down), probability = 2/2

Gini Index = 1 - ((0/2)^2 + (2/2)^2) = 0

· Weighted sum of the Gini Indices, Gini Index for Trading Volume = (4/6)0 + (2/6)0
=0

We will use Trading Volume’ feature as new node, as it has the minimum Gini Index.
Q6 Consider the following neural network architecture with three inputs X1, X2 and
X3 having values 1, 4 and 5, two hidden neurons h1 and h2, two output neurons o1
and o2 as shown in the figure given below. The target values of the network are 0.8
and 0.2 respectively. The bias b1 and b2 are for the hidden and output layer
respectively.
Calculate the following in reference to the information provided for the given
network.

a) Total net input to each hidden neuron (h1 and h2).


b) Net output after applying activation function to receive net input at each
hidden neuron. (Use Sigmoid function)
c) Net Input to each output neuron.
d) Net output after applying activation function at each output neuron.
e) Total error at the output layer
f) Use calculated errors and backpropagation to find the updated value of
weights W7 and W8.

Solution:-

a) Net h1 = x1*w1+x2*w3+x3*w4+h1*1

= 1*0.1+4*0.3+5*0.5+0.5

= 1+1.2+2.5+0.5

= 4.3

Similarly, Net h2 = .2+1.6+3+0.5

= 5.3

b) Output h1 = 1/(1+e-h1) = .986

Output h2 = 1/(1+e-h2) = .995

c) Net O1 = .986*.7+.995*.9+.5

= 2.0857

Net O2 = .986*.8+.995*.1+.5

= 1.3883

d) Output O1 = 1/(1+e-O1) = 0.8895

Output O2 = 1/(1+e-O2) = 0.8003

e) Total Error = ∑1/2(Oi-Ti)2

= 0.184

f) ∂E/∂W7 = ∂E/∂(output O1) * ∂(output O1)/∂(NetO1) * ∂(NetO1) /∂W7

= 0.0895 * 0.0982 * 0.986

= 0.0086
Similarly, ∂E/∂W8 = 0.0946

W7new = W7old – α*∂E/∂W7

Putting l.rate(α) = 0.5

W7new = 0.6957

W8new = 0.7527

You might also like