PA Answers
PA Answers
1 Given the points A(1, 2), B(2, 3), C(2, 1), D(4, 5), E(5, 4), F(6, 6), G(8, 7), H(9, 9), and I(10, 8),
find the core points, outliers, and border points using DBSCAN. Take Eps = 2 and MinPts = 2. The
distance matrix showing Euclidean distance between each pair of the given points is shown below:
(8 marks)
Given:
Points: A(1, 2), B(2, 3), C(2, 1), D(4, 5), E(5, 4), F(6, 6), G(8, 7), H(9, 9), I(10, 8)
Parameters: Eps = 2, MinPts = 2
Definitions
• Core Point: At least MinPts points (including itself) within radius Eps.
• Border Point: Less than MinPts within Eps but reachable from a core point.
• Outlier (Noise): Neither core nor border.
Process
From the distance matrix (assumed provided), count how many neighbors each point has within Eps = 2.
Based on the spatial proximity:
• Cluster 1: A, B, C (close together)
• Cluster 2: D, E, F
• Cluster 3: G, H, I
Results (Based on assumed matrix):
• Core Points: B, C, E, F, G, H, I (each has at least one neighbor within Eps=2)
• Border Points: A, D (each has only 1 neighbor but is within Eps of a core)
• Outliers: None (all are either core or border)
Q.2 With respect to Deep Neural Networks answer the following questions. (3+3 = 6 marks)
1. Select an appropriate activation function for the output layers of its network and provide a
comprehensive explanation of their functionality and suitability. - Healthcare Diagnosis System:
The goal is to predict the presence or absence of various medical conditions based on patient
symptoms and medical history. For instance, the system aims to classify patients into categories
such as "Diabetes," "heart disease," and "Cancer." Clearly mention any assumptions made.
2. Select an appropriate activation function for the output layers of its network and provide a
comprehensive explanation of their functionality and suitability. - Social Media Sentiment
Analysis: The objective is to analyze user sentiment based on their posts and interactions on social
media platforms. The system categorizes each post into multiple sentiment categories such as
"Positive," "Negative," and "Neutral" to provide insights into user behavior and preferences.
Q.3 Imagine a scenario where Principal Component Analysis (PCA) is utilized to analyze a dataset
encompassing various attributes related to customer satisfaction in a retail setting. The dataset
comprises variables such as purchase frequency, average transaction amount, customer feedback
scores, and loyalty program engagement. The loading vectors generated for PC1 and PC2 are
displayed below. (7 marks)
V1 V2
Interpret the loading vectors concerning PC1 and PC2. Discuss the implications of a data point with high
PC1 and one with high PC2 regarding the customer segment it represents. Provide insights into the
similarities or differences among various characteristics of the customers indicated by the features.
Interpretation:
• PC1: High on Customer feedback (0.7), Purchase frequency (0.6)
⇒ Reflects Overall Satisfaction/Engagement
• PC2: High on Loyalty (0.6), but negative on Avg. transaction (-0.7)
⇒ Represents Loyalty vs Spend Tradeoff
Implications:
• High PC1: Highly satisfied, loyal customers making frequent purchases.
• High PC2: Low spenders but high loyalty.
• Shows two types of valued customers: high spenders vs highly loyal.
Q.4 Compare and contrast the performance of K-means clustering with K-means++ initialization
against the traditional K-means algorithm with random initialization. Give example to support your
answer. (4 marks)
Traditional K-Means:
• Initializes centroids randomly.
• Can result in poor convergence or local minima.
K-Means++:
• Selects centroids with better spacing.
• Improves clustering accuracy and speed.
Example:
Suppose we cluster customer purchase patterns.
• Random K-means might initialize centroids close together → Poor separation.
• K-means++ spreads centroids → Better cluster separation and convergence.
Q.5 Answer the following questions with respect to the following figure. (10 Marks)
a) Suppose the above figures shows decision boundaries for KNN and Logistic regression model
applied on a 2D dataset. Answer which decision boundary (A and B) is for which algorithm (KNN
and Logistic Regression). Explain why. [4 Marks]
b) What function is denoted by the following equation? In which machine learning algorithm (KNN
or Logistic regression) is it used? [1 Mark]
c) What would be the equation of the decision boundary for the ML algorithm referred to in question
b)? Explain [2 Marks]
d) With reference of question b), explain the function and its necessity in the context of classification
problems? [3 Marks]
a) Which boundary belongs to which model? (4 marks)
• Decision Boundary A: KNN (non-linear, jagged)
• Decision Boundary B: Logistic Regression (smooth, linear)
Explanation:
• KNN decision boundary depends on local neighbours → irregular shape.
• Logistic Regression uses linear combinations → smooth boundary.
•
• Used in: Logistic Regression
Q.6 Suppose you are building a multiple linear regression model to predict the selling price of cars
based on the following predictor variables: (5 Marks)
• Age of the car (in years)
• Mileage (in thousands of miles)
• Engine size (in liters)
• Horsepower
• Fuel efficiency (in miles per gallon)
You calculate the Variance Inflation Factors (VIF) for each predictor variable, and you obtain the
following values:
• Age of the car: VIF = 1.5
• Mileage: VIF = 1.8
• Engine size: VIF = 8.2
• Horsepower: VIF = 9.1
• Fuel efficiency: VIF = 7.6
Using this information, answer the following questions:
1. Based on the VIF values, which predictors show signs of multicollinearity? Explain why. Cleary
mention any assumptions. (2 marks)
2. Suggest the next steps to address this issue of multicollinearity in this model. Suggest a few
approaches to handle the issue of multicollinearity in linear regression. (3 marks)