0% found this document useful (0 votes)
2 views

ml_exam_answers

The document discusses various machine learning concepts including Cross-Validation, Least Squares Regression, Ridge and Lasso Regression, Multivariate and Regularized Regression, and the Perceptron model. It explains techniques like PCA and SVD for dimensionality reduction and provides insights into the Backpropagation algorithm for training neural networks. Each section includes definitions, examples, advantages, and conclusions to summarize the key points of the methods discussed.

Uploaded by

rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ml_exam_answers

The document discusses various machine learning concepts including Cross-Validation, Least Squares Regression, Ridge and Lasso Regression, Multivariate and Regularized Regression, and the Perceptron model. It explains techniques like PCA and SVD for dimensionality reduction and provides insights into the Backpropagation algorithm for training neural networks. Each section includes definitions, examples, advantages, and conclusions to summarize the key points of the methods discussed.

Uploaded by

rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Machine Learning Exam

Answers
Q4. What is Cross-Validation and why is it
used?
Definition:
Cross-Validation is a model evaluation technique used to assess how well a machine learning model performs on unseen
(test) data. It helps us check if the model is generalizing properly or if it is overfitting/underfitting.

Instead of using just one train-test split, cross-validation splits the data into multiple parts and tests the model on each
part. This gives a more reliable estimate of model performance.

Why is Cross-Validation Used?


1. To Avoid Overfitting

Cross-validation helps ensure the model is not just memorizing the training data.

It evaluates the model on different subsets of the data.

2. More Reliable Performance Estimation

Instead of relying on one random test set, it averages results across multiple validations.

Gives a better estimate of accuracy, precision, etc.

3. Uses Data Efficiently

All the data gets used for both training and testing, just in different rounds.

This is helpful when data is limited.

4. Helps with Model Selection

It helps choose the best model or parameters based on consistent performance across folds.

How It Works (K-Fold Cross-Validation):


1. The dataset is split into K equal parts (folds).
2. The model is trained on K-1 folds and tested on the remaining fold.

3. This process is repeated K times, with each fold used once for testing.

4. The performance is averaged across all folds.

Types of Cross-Validation:
1. K-Fold Cross-Validation – Most common type.

2. Stratified K-Fold – Ensures same class proportion in each fold (good for classification).

3. Leave-One-Out Cross-Validation (LOOCV) – Each fold contains only one sample.

4. Repeated K-Fold – K-Fold applied multiple times for better accuracy.

Advantages:
Reduces model evaluation bias.
Provides a better estimate of model performance.
Useful in parameter tuning and model comparison.

Disadvantages:
More computation time (as model is trained multiple times).
Can be slower for large datasets.

Diagram:
K-Fold Cross-Validation Flow (Just write 'Diagram' in answer sheet)

Conclusion:
Cross-validation is an essential tool in machine learning for testing model performance and avoiding overfitting. It
ensures that the model will work well on new, unseen data and helps in selecting the best model for the problem.

Q5. Find the Least Squares Regression


line and estimate a value
Definition:
Least Squares Regression is a method used to find the best-fitting straight line through a set of points in such a way that
the sum of the squared vertical distances (errors) between the actual points and the line is minimized.
This line is called the line of best fit or regression line.

Equation of the Line:


The general form of a linear regression line is:

y = a + bx

Where:

y = predicted value

x = input value

a = y-intercept

b = slope of the line

Formulas to Calculate a and b:


Let's say we have n data points: (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)

Then,

b = [n(Σxy) - (Σx)(Σy)] / [n(Σx²) - (Σx)²]

a = (Σy - bΣx) / n

Example:
Let's consider a small dataset: | x | y | |---|---| | 1 | 2 | | 2 | 3 | | 4 | 5 |

Now calculate:

Σx = 1 + 2 + 4 = 7

Σy = 2 + 3 + 5 = 10

Σxy = (1×2) + (2×3) + (4×5) = 2 + 6 + 20 = 28

Σx² = (1² + 2² + 4²) = 1 + 4 + 16 = 21

n=3

Now plug into the formulas:

b = [3(28) - (7)(10)] / [3(21) - (7)²] b = (84 - 70) / (63 - 49) = 14 / 14 = 1

a = (10 - 1×7) / 3 = (10 - 7) / 3 = 3 / 3 = 1


So, the regression line is:

y = 1 + 1x = x + 1

Estimate a Value:
Let's estimate y when x = 3:

y=x+1=3+1=4

So, when x = 3, the predicted value of y is 4.

Conclusion:
Least Squares Regression is a foundational method for modeling linear relationships between variables. It helps in
making predictions based on past data using a simple linear formula.

Q6. Differentiate between Ridge and


Lasso Regression
Point
Ridge Regression Lasso Regression
No.
1 Also called L2 Regularization Also called L1 Regularization
Adds penalty equal to square of Adds penalty equal to absolute
2
coefficients value of coefficients
Keeps all features in the model, shrinks Can reduce some coefficients
3
them closer to zero exactly to zero
Useful when many features have small Useful when only few features are
4
but significant effect important
Performs automatic feature
5 Does not perform feature selection
selection
Good when data is sparse or for
6 Better when data has multicollinearity
feature elimination
7 Regularization term: λ Σ w² Regularization term: λ Σ w
8 All variables remain in final model Some variables may be removed
More interpretable due to feature
9 More stable but less interpretable
selection
Used in scenarios where we don't want to Used when we want to reduce
10
lose any features dimensionality

Q7. Explain Multivariate and Regularized


Regression
1. Multivariate Regression:
It involves predicting multiple dependent variables using multiple independent variables.

It is an extension of linear regression where more than one output is predicted.

Equation: Y = a + b₁x₁ + b₂x₂ + ... + bₙxₙ Where Y is a vector of multiple outputs.

Example: Predicting both price and demand of a product using features like cost, marketing spend, etc.

2. Regularized Regression:
Regularization is used to prevent overfitting in regression models.

It adds a penalty term to the loss function to shrink the model coefficients.

Two main types:

Ridge Regression (L2): Penalizes square of coefficients.

Lasso Regression (L1): Penalizes absolute values and can eliminate features.

Regularized regression ensures better generalization on unseen data by reducing model complexity.

Conclusion:
Multivariate regression deals with multiple outputs, while regularized regression controls overfitting using penalties.

Both are extensions of traditional linear regression to handle real-world challenges better.

Q8. Explain Least Squares Regression for


Classification
Least Squares is mainly for regression tasks but can also be used for classification by encoding class labels numerically
(e.g., 0 and 1).

The model fits a line or plane through the input data by minimizing the sum of squared differences between predicted and
actual values.

Steps:
1. Encode class labels: For binary classification, assign y = 0 for class A, y = 1 for class B.

2. Use linear regression formula: y = a + bx

3. If output > 0.5, predict class B; else, class A.


Limitations:
Not ideal for classification as it can predict values outside the [0,1] range.

Logistic regression is preferred because it produces probabilities between 0 and 1.

Example:
Dataset with hours studied (x) and exam result (pass=1, fail=0). Use linear regression to fit a line and classify students.

Conclusion:
Least Squares Regression can be adapted for classification but is not optimal. It's better suited for regression tasks. For
classification, logistic regression or other methods are more accurate.

Q9. What is the Curse of Dimensionality?


Explain PCA with an example
Curse of Dimensionality:
It refers to the problems that arise when we work with high-dimensional data (data with many features).

As the number of dimensions increases:

The data becomes sparse (spread out).

Distance between points increases.

Algorithms like clustering and classification become less effective.

Model becomes more complex and prone to overfitting.

Problems due to Curse of Dimensionality:


1. Increased computational cost.

2. Poor model accuracy.

3. Difficult visualization.

4. Data becomes less meaningful.

PCA (Principal Component Analysis):


PCA is a technique used for dimensionality reduction.
It transforms the data into a new set of axes called principal components.

These components capture the maximum variance in the data using fewer dimensions.

PCA helps in:

Reducing overfitting

Improving model performance

Better visualization of data

Steps in PCA:
1. Standardize the data (mean = 0).

2. Compute the covariance matrix.

3. Find eigenvalues and eigenvectors.

4. Select the top k eigenvectors to form a new dataset.

5. Project the original data onto this new space.

Example:
If you have 10 features and PCA finds that 2 components explain 95% of the variance, you can reduce the dataset from
10D to 2D.

Diagram:
PCA transformation steps and variance plot

Conclusion:
PCA helps overcome the curse of dimensionality by reducing unnecessary features while keeping the important patterns.

Q10. Apply PCA on a dataset and find the


principal component
Let's apply PCA step-by-step on a small dataset:

Dataset:

X1 X2
2 4
X1 X2
3 6
4 8
5 10

Step 1: Standardize the data


Find mean and subtract from each value.

Mean of X1 = 3.5

Mean of X2 = 7

Standardized values:

X1' X2'
-1.5 -3
-0.5 -1
0.5 1
1.5 3

Step 2: Covariance Matrix


Cov(X1', X1') = 1.67, Cov(X2', X2') = 6.67, Cov(X1', X2') = 3.33

Covariance matrix:

X1 X2
X1 1.67 3.33
X2 3.33 6.67

Step 3: Eigenvalues and Eigenvectors


Eigenvalues: 8.33, 0.01

Principal component corresponds to the largest eigenvalue → 8.33

Step 4: Principal Component


The first principal component is the eigenvector corresponding to 8.33.

It explains almost all the variance in the dataset.

Diagram:
Showing data before and after PCA projection
Conclusion:
PCA reduces the data from 2D to 1D while preserving most of the information.

Q11. Explain SVD (Singular Value


Decomposition) and its applications
What is SVD?
Singular Value Decomposition (SVD) is a technique used to factorize a matrix into three components:

A = U × Σ × Vᵀ

Where:

A = original matrix

U = left singular vectors

Σ = diagonal matrix with singular values

Vᵀ = transpose of right singular vectors

Purpose:
To reduce data dimensions

Extract important patterns

Solve systems of linear equations

Handle noise in data

Applications of SVD:
1. Dimensionality Reduction

Similar to PCA, SVD helps compress large datasets into fewer features.

2. Image Compression

SVD can reduce image size while maintaining quality.

3. Latent Semantic Analysis (NLP)

Used in text analysis to find hidden relationships between words and documents.
4. Recommender Systems

Helps find hidden patterns in user preferences.

5. Data Noise Reduction

Removes small singular values that represent noise.

Example:
Given a 3×2 matrix A:

|2|4||1|3||0|0|

Using SVD, we factor A into U, Σ, and Vᵀ. We can use just the largest singular values to reconstruct A with minimal error.

Diagram:
SVD matrix breakdown and compression

Conclusion:
SVD is a powerful tool used across many fields to simplify complex data, reduce storage, and improve understanding of
patterns.

Q12. Explain the Perceptron model with


bias
What is a Perceptron?
A perceptron is the simplest type of artificial neuron used in machine learning. It is a part of neural networks and is used
for binary classification tasks.

Structure of Perceptron:
1. Inputs (x₁, x₂, ..., xₙ)

2. Weights (w₁, w₂, ..., wₙ)

3. Bias (b)

4. Summation unit: Calculates weighted sum + bias.

5. Activation function: Applies a threshold to decide output (0 or 1).


Output Formula:
y = f(w·x + b) Where:

w·x = weighted sum

b = bias

f = activation function (usually step function)

Why Bias is Important:


Bias helps shift the decision boundary left or right.

Without bias, the model is forced to pass through the origin.

Example:
Let x₁ = 1, x₂ = 0 Weights: w₁ = 2, w₂ = -1 Bias: b = 1

Weighted sum = (2×1) + (-1×0) + 1 = 3 Activation: Step function gives output = 1 (class A)

Diagram:
Perceptron model showing input, weights, summation, bias, and activation

Conclusion:
The perceptron with bias is a fundamental building block in neural networks. It can learn simple decision boundaries and
is the basis for more advanced models.

Q13. Implement the OR logic gate using


single-layer Perceptron with learning rate
and weights
OR Gate Truth Table:
Input X1 Input X2 Output Y
0 0 0
0 1 1
1 0 1
1 1 1
Perceptron Algorithm:
Initial Weights: w1 = 0, w2 = 0

Bias: b = 0

Learning Rate (η): 1

Activation Function: Step function

Formula: y = f(w1·x1 + w2·x2 + b)

Training Steps:
1. Input (0,0) → output = 0 → target = 0 → no weight change

2. Input (0,1) → output = 0 → target = 1

Error = 1 → w2 = 0 + 1×1 = 1 b = 0 + 1×1 = 1

3. Input (1,0) → output = 1 → correct → no change

4. Input (1,1) → output = 1 → correct → no change

Final Weights:
w1 = 0

w2 = 1

b=1

Conclusion:
Using the perceptron learning algorithm, the OR gate can be implemented successfully.

Diagram:
Perceptron architecture with two inputs and OR logic output

Q14. Draw and explain the Error


Backpropagation Algorithm with flowchart
What is Backpropagation?
Backpropagation is an algorithm used to train multilayer neural networks.

It minimizes the error by updating weights using gradient descent.

It works by sending error backward from output to input layer.

Steps:
1. Forward Pass:

Compute output from input layer to output layer.

2. Compute Error:

Compare predicted output with actual output using loss function.

3. Backward Pass:

Calculate gradient of error w.r.t. each weight.

Propagate error from output to hidden layers.

4. Update Weights:

Adjust weights using learning rate and calculated gradients.

5. Repeat:

Perform steps for all training examples until error is minimized.

Mathematics Behind:
Error function: E = ½ (target – output)²

Weight update: w_new = w_old – η × (∂E/∂w)

Diagram:
Flowchart showing forward pass, error calculation, backpropagation, and weight update

Conclusion:
Backpropagation is essential for training deep neural networks by optimizing weights using error gradients.

Q15. Explain Artificial Neural Network


(ANN) architecture and working
What is ANN?
An Artificial Neural Network (ANN) is a machine learning model inspired by the human brain. It is made up of
interconnected nodes (neurons) organized in layers.

ANN Architecture:
1. Input Layer:

Takes input features. Each neuron corresponds to one feature.

2. Hidden Layers:

One or more layers between input and output. They perform computations using weights and activation functions.

3. Output Layer:

Gives the final prediction or classification result.

4. Weights and Biases:

Control the strength of connections.

5. Activation Function:

Adds non-linearity to the network (e.g., sigmoid, ReLU).

Working:
Inputs are multiplied by weights, summed, and passed through activation.

This process continues layer by layer.

The network is trained using backpropagation to minimize error.

Diagram:
ANN with input, hidden, and output layers

Conclusion:
ANNs are powerful tools used in image recognition, language translation, and more. Their layered structure allows them
to learn complex patterns.
Q16. Explain Delta Learning Rule (LMS /
Widrow-Hoff) with training process
What is Delta Rule?
Also known as Least Mean Square (LMS) or Widrow-Hoff Rule

It is used to update weights in neural networks to minimize error.

Formula:
Δw = η × (t – o) × x

Where:

η = learning rate

t = target output

o = actual output

x = input

Steps in Training:
1. Initialize weights and bias.

2. For each input, calculate output.

3. Compare with target to find error.

4. Update weights using delta rule.

5. Repeat until error is minimal.

Example:
If:

x = 1, target t = 1, output o = 0.6

η = 0.1

Then: Δw = 0.1 × (1 – 0.6) × 1 = 0.04 New weight = old weight + 0.04


Conclusion:
Delta rule is a basic but effective learning algorithm used in simple neural models.

Q17. What is Hebbian learning?


Implement OR logic using Hebb Net
What is Hebbian Learning?
Hebbian learning is based on the rule: "Neurons that fire together, wire together."

If two neurons are active at the same time, their connection is strengthened.

Learning Rule:
Δw = η × x × y

Where:

x = input

y = output

η = learning rate

Implementing OR Gate:
OR Truth Table:

x1 x2 Output
0 0 0
0 1 1
1 0 1
1 1 1

Training Using Hebbian Rule:


Initial weights = 0 Learning rate = 1

For each input pair where output = 1:

Δw1 = η × x1 × y

Δw2 = η × x2 × y
Update weights accordingly:

For (0,1) → Δw = (0,1)

For (1,0) → Δw = (1,0)

For (1,1) → Δw = (1,1)

Final weights: w1 = 2, w2 = 2

Diagram:
Hebbian network showing input, weight connections, and output

Conclusion:
Hebbian learning is a biological-inspired unsupervised learning rule that strengthens connections based on co-activation
of neurons.

Q18. What are Activation Functions?


Explain Binary, Bipolar, Continuous, and
Ramp types
What are Activation Functions?
Activation functions decide whether a neuron should be activated or not. They add non-linearity to the model, allowing
the network to learn complex patterns.

1. Binary Step Function


Output: 0 or 1

Formula: f(x) = 1 if x ≥ 0 f(x) = 0 if x < 0

Used in: Simple perceptron

Limitation: Non-differentiable

2. Bipolar Step Function


Output: -1 or +1

Formula: f(x) = 1 if x ≥ 0 f(x) = -1 if x < 0


Suitable when negative values are needed

3. Continuous (Sigmoid) Function


Output: Between 0 and 1

Formula: f(x) = 1 / (1 + e^(-x))

Smooth and differentiable

Used in deep learning

4. Ramp Function
Linearly increases within a range

Formula: f(x) = 0 if x < 0 f(x) = x if 0 ≤ x ≤ 1 f(x) = 1 if x > 1

Conclusion:
Activation functions are crucial for learning and generalization in neural networks.

Diagram:
Graphs of all four activation functions

Q19. Discuss activation functions with


formulas, graphs, and ranges
1. Sigmoid Function
Formula: f(x) = 1 / (1 + e^(-x))

Range: (0, 1)

Smooth curve

Used in: Binary classification

2. Tanh Function
Formula: f(x) = (ex – e–x) / (ex + e–x)

Range: (–1, 1)
Zero centered

3. ReLU (Rectified Linear Unit)


Formula: f(x) = max(0, x)

Range: [0, ∞)

Used in: Deep networks

Fast and simple

4. Leaky ReLU
Formula: f(x) = x if x > 0 else 0.01x

Range: (–∞, ∞)

Fixes "dying ReLU" problem

5. Softmax
Formula: f(xi) = exi / ∑ exj

Converts scores into probabilities

Used in: Multiclass classification

Conclusion:
Each activation function has its own use case depending on the task.

Diagram:
Graphs of sigmoid, tanh, ReLU, leaky ReLU, and softmax

Q20. Explain the Expectation-


Maximization (EM) algorithm for clustering
What is EM Algorithm?
EM is an iterative algorithm used for unsupervised learning.

It is often used to find clusters in data using Gaussian Mixture Models.


Two Main Steps:
1. E-Step (Expectation):

Calculate the probability of each data point belonging to a cluster using current parameters.

2. M-Step (Maximization):

Update the parameters (mean, variance, etc.) to maximize the likelihood using the probabilities.

Repeat E & M steps until convergence (no major changes).

Applications:
Clustering

Image segmentation

Missing data handling

Diagram:
Flowchart showing E-step and M-step iteratively updating cluster parameters

Conclusion:
EM helps find hidden structures (clusters) in data by alternating between estimating probabilities and optimizing
parameters.

Q21. Diagonalize a given matrix


What is Diagonalization?
A matrix A is diagonalizable if there exists a matrix P and diagonal matrix D such that: A = P D P⁻¹

Steps:
1. Find Eigenvalues (λ) by solving: det(A – λI) = 0

2. Find Eigenvectors for each λ by solving: (A – λI)x = 0

3. Form matrix P using eigenvectors as columns.

4. Matrix D will have eigenvalues on its diagonal.

Example:
Matrix A = | 4 1 | | 2 3 |

Find eigenvalues and eigenvectors → form P and D

Conclusion:
Diagonalization simplifies matrix operations like powers and exponentials.

Q22. Explain Eigenvalues and


Eigenvectors
Definitions:
For a square matrix A, an eigenvector v satisfies: A·v = λ·v

Here, λ is the eigenvalue and v is the eigenvector.

Steps to Find:
1. Solve det(A – λI) = 0 → gives eigenvalues

2. Substitute λ in (A – λI)v = 0 → gives eigenvectors

Example:
Matrix A = | 2 0 | | 0 3 |

Eigenvalues = 2, 3 Eigenvectors = [1, 0] and [0, 1]

Applications:
PCA (Dimensionality Reduction)

Stability analysis

Vibration modes

Diagram:
Shows how eigenvectors don't change direction after transformation
Q23. What is the Trace of a matrix?
Mention its properties
Definition:
The trace of a matrix is the sum of diagonal elements.

For matrix A: Trace(A) = a₁₁ + a₂₂ + … + aₙₙ

Properties:
1. Trace(A + B) = Trace(A) + Trace(B)

2. Trace(cA) = c × Trace(A) (c is scalar)

3. Trace(AB) = Trace(BA)

4. Trace(Aᵀ) = Trace(A)

5. Trace is linear: follows addition and scalar multiplication

Example:
Matrix A = | 1 2 | | 3 4 | Trace(A) = 1 + 4 = 5

Applications:
Used in eigenvalue sum

Characteristic equation

Optimization

Q24. Calculate Accuracy, Precision,


Recall, and F1-Score from confusion
matrix data
Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP FN
Predicted Positive Predicted Negative
Actual Negative FP TN

Formulas:
Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

Example:
TP = 70, TN = 50, FP = 10, FN = 20

Accuracy = (70+50)/150 = 80%

Precision = 70/(70+10) = 87.5%

Recall = 70/(70+20) = 77.78%

F1-Score ≈ 82.35%

Conclusion:
These metrics help evaluate classification model performance.

Q25. How do you measure the


quality/performance of a classification
model?
Metrics to Evaluate:
1. Accuracy:

% of correct predictions Best when classes are balanced

2. Precision:

How many predicted positives are truly positive

3. Recall:
How many actual positives are correctly predicted

4. F1-Score:

Harmonic mean of precision and recall Best for imbalanced data

5. ROC-AUC:

Area under ROC curve Shows performance at different thresholds

6. Confusion Matrix:

Full view of TP, TN, FP, FN

7. Log Loss / Cross-Entropy:

Measures prediction confidence

Conclusion:
Multiple metrics should be considered to evaluate a model's true performance.

Q26. Implement XOR function using


McCulloch-Pitts Model
XOR Truth Table:
x1 x2 Output
0 0 0
0 1 1
1 0 1
1 1 0

McCulloch-Pitts Model:
Cannot directly implement XOR using single-layer

XOR is not linearly separable

Needs multi-layer logic:

Logic Construction:
XOR = (x1 AND NOT x2) OR (NOT x1 AND x2) Use multiple MCP neurons to implement this logic
Diagram:
Multi-layer MCP network showing logic gate connections for XOR

Conclusion:
XOR needs multi-layer McCulloch-Pitts model due to non-linear separability.

Q27. Implement ANDNOT function using


McCulloch-Pitts Model
ANDNOT Truth Table:
x1 x2 Output
0 0 0
0 1 0
1 0 1
1 1 0

Logic:
ANDNOT(x1, x2) = x1 AND (NOT x2)

Weights & Threshold:


Input x1 = +1

Input x2 = –1

Threshold = 1

Neuron Output:
When x1 = 1 and x2 = 0 → Net = 1 → Output = 1

All other combinations → Net < 1 → Output = 0

Diagram:
MCP neuron with weights (+1, –1) and threshold 1

Conclusion:
ANDNOT can be implemented using a single MCP neuron with suitable weights and threshold.

You might also like