0% found this document useful (0 votes)
6 views4 pages

PA Answers

The document covers various topics in data science, including clustering techniques like DBSCAN, activation functions for neural networks, PCA for customer satisfaction analysis, and comparisons between K-means and K-means++. It also discusses the implications of decision boundaries in KNN and Logistic Regression, as well as addressing multicollinearity in multiple linear regression models. Each section provides insights into the methodologies and their applications in real-world scenarios.

Uploaded by

sndpkmar172
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

PA Answers

The document covers various topics in data science, including clustering techniques like DBSCAN, activation functions for neural networks, PCA for customer satisfaction analysis, and comparisons between K-means and K-means++. It also discusses the implications of decision boundaries in KNN and Logistic Regression, as well as addressing multicollinearity in multiple linear regression models. Each section provides insights into the methodologies and their applications in real-world scenarios.

Uploaded by

sndpkmar172
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Q.

1 Given the points A(1, 2), B(2, 3), C(2, 1), D(4, 5), E(5, 4), F(6, 6), G(8, 7), H(9, 9), and I(10, 8),
find the core points, outliers, and border points using DBSCAN. Take Eps = 2 and MinPts = 2. The
distance matrix showing Euclidean distance between each pair of the given points is shown below:
(8 marks)

Given:
Points: A(1, 2), B(2, 3), C(2, 1), D(4, 5), E(5, 4), F(6, 6), G(8, 7), H(9, 9), I(10, 8)
Parameters: Eps = 2, MinPts = 2
Definitions
• Core Point: At least MinPts points (including itself) within radius Eps.
• Border Point: Less than MinPts within Eps but reachable from a core point.
• Outlier (Noise): Neither core nor border.
Process
From the distance matrix (assumed provided), count how many neighbors each point has within Eps = 2.
Based on the spatial proximity:
• Cluster 1: A, B, C (close together)
• Cluster 2: D, E, F
• Cluster 3: G, H, I
Results (Based on assumed matrix):
• Core Points: B, C, E, F, G, H, I (each has at least one neighbor within Eps=2)
• Border Points: A, D (each has only 1 neighbor but is within Eps of a core)
• Outliers: None (all are either core or border)

Q.2 With respect to Deep Neural Networks answer the following questions. (3+3 = 6 marks)
1. Select an appropriate activation function for the output layers of its network and provide a
comprehensive explanation of their functionality and suitability. - Healthcare Diagnosis System:
The goal is to predict the presence or absence of various medical conditions based on patient
symptoms and medical history. For instance, the system aims to classify patients into categories
such as "Diabetes," "heart disease," and "Cancer." Clearly mention any assumptions made.

2. Select an appropriate activation function for the output layers of its network and provide a
comprehensive explanation of their functionality and suitability. - Social Media Sentiment
Analysis: The objective is to analyze user sentiment based on their posts and interactions on social
media platforms. The system categorizes each post into multiple sentiment categories such as
"Positive," "Negative," and "Neutral" to provide insights into user behavior and preferences.

a) Healthcare Diagnosis System (Multi-Class Classification)


Goal: Predict exclusive conditions like Diabetes, Heart Disease, or Cancer.
Recommended Activation Function: Softmax
Justification:
• Used for multi-class classification where each input belongs to exactly one class.
• Outputs probabilities summing to 1 across classes.
• Converts logits zi to probabilities via:
Assumption: Each patient has only one condition.

b) Social Media Sentiment Analysis (Multi-Label Classification)


Goal: Classify a post as possibly Positive, Negative, and/or Neutral.
Recommended Activation Function: Sigmoid
Justification:
• Appropriate for multi-label classification, where outputs are independent.
• Outputs independent probability for each sentiment:

Assumption: A post may express multiple sentiments.

Q.3 Imagine a scenario where Principal Component Analysis (PCA) is utilized to analyze a dataset
encompassing various attributes related to customer satisfaction in a retail setting. The dataset
comprises variables such as purchase frequency, average transaction amount, customer feedback
scores, and loyalty program engagement. The loading vectors generated for PC1 and PC2 are
displayed below. (7 marks)
V1 V2

Purchase frequency 0.6 0.3


Average transaction amount 0.4 -0.7
Customer feedback scores 0.7 0.2
Loyalty program engagement 0.5 0.6

Interpret the loading vectors concerning PC1 and PC2. Discuss the implications of a data point with high
PC1 and one with high PC2 regarding the customer segment it represents. Provide insights into the
similarities or differences among various characteristics of the customers indicated by the features.
Interpretation:
• PC1: High on Customer feedback (0.7), Purchase frequency (0.6)
⇒ Reflects Overall Satisfaction/Engagement
• PC2: High on Loyalty (0.6), but negative on Avg. transaction (-0.7)
⇒ Represents Loyalty vs Spend Tradeoff
Implications:
• High PC1: Highly satisfied, loyal customers making frequent purchases.
• High PC2: Low spenders but high loyalty.
• Shows two types of valued customers: high spenders vs highly loyal.

Q.4 Compare and contrast the performance of K-means clustering with K-means++ initialization
against the traditional K-means algorithm with random initialization. Give example to support your
answer. (4 marks)
Traditional K-Means:
• Initializes centroids randomly.
• Can result in poor convergence or local minima.
K-Means++:
• Selects centroids with better spacing.
• Improves clustering accuracy and speed.
Example:
Suppose we cluster customer purchase patterns.
• Random K-means might initialize centroids close together → Poor separation.
• K-means++ spreads centroids → Better cluster separation and convergence.

Q.5 Answer the following questions with respect to the following figure. (10 Marks)

a) Suppose the above figures shows decision boundaries for KNN and Logistic regression model
applied on a 2D dataset. Answer which decision boundary (A and B) is for which algorithm (KNN
and Logistic Regression). Explain why. [4 Marks]

b) What function is denoted by the following equation? In which machine learning algorithm (KNN
or Logistic regression) is it used? [1 Mark]

c) What would be the equation of the decision boundary for the ML algorithm referred to in question
b)? Explain [2 Marks]

d) With reference of question b), explain the function and its necessity in the context of classification
problems? [3 Marks]
a) Which boundary belongs to which model? (4 marks)
• Decision Boundary A: KNN (non-linear, jagged)
• Decision Boundary B: Logistic Regression (smooth, linear)
Explanation:
• KNN decision boundary depends on local neighbours → irregular shape.
• Logistic Regression uses linear combinations → smooth boundary.

b) What function is shown? Which algorithm? (1 mark)


• Function: Sigmoid Function


• Used in: Logistic Regression

c) Equation of Decision Boundary? (2 marks)


At decision threshold = 0.5,
Then decision boundary:

d) Function and Necessity? (3 marks)


• The sigmoid maps real-valued input to [0, 1] range.
• Helps interpret output as probability of belonging to a class.
• Essential for classification decisions (e.g., assign class 1 if σ(z)>0.5)

Q.6 Suppose you are building a multiple linear regression model to predict the selling price of cars
based on the following predictor variables: (5 Marks)
• Age of the car (in years)
• Mileage (in thousands of miles)
• Engine size (in liters)
• Horsepower
• Fuel efficiency (in miles per gallon)
You calculate the Variance Inflation Factors (VIF) for each predictor variable, and you obtain the
following values:
• Age of the car: VIF = 1.5
• Mileage: VIF = 1.8
• Engine size: VIF = 8.2
• Horsepower: VIF = 9.1
• Fuel efficiency: VIF = 7.6
Using this information, answer the following questions:
1. Based on the VIF values, which predictors show signs of multicollinearity? Explain why. Cleary
mention any assumptions. (2 marks)
2. Suggest the next steps to address this issue of multicollinearity in this model. Suggest a few
approaches to handle the issue of multicollinearity in linear regression. (3 marks)

a) Identify Multicollinearity from VIFs (2 marks)


Variable VIF Interpretation
Age 1.5 OK
Mileage 1.8 OK
Engine Size 8.2 Moderate to high multicollinearity
Horsepower 9.1 High multicollinearity
Fuel Efficiency 7.6 Moderate multicollinearity
Assumption: VIF > 5 or 10 indicates problematic correlation.

b) Addressing Multicollinearity (3 marks)


Recommended Solutions:
1. Remove one of the correlated variables (e.g., Horsepower or Engine Size).
2. Combine predictors (e.g., create power-to-weight ratio).
3. Use Ridge Regression (L2 penalty) to reduce impact of correlation.
4. Apply PCA or Factor Analysis for dimensionality reduction.
Goal: Ensure more stable, interpretable regression coefficients.

You might also like