QBANK_ML
QBANK_ML
2. (a) Describe the Perceptron Learning Algorithm (PLA) and briefly explain the
working principle of the algorithm.
(b) Classes attended by 10 students in machine learning and marks obtained in the
examination are provided in the following table. Estimate the marks a student
may obtain in the examination when she attended 20 classes, using linear
regression.
1 28 43 6 28 39
2 27 39 7 26 36
3 23 27 8 21 36
4 27 36 9 22 31
5 24 34 10 28 37
3. (a) Derive the linear regression formula for single dependent variables.
(b) Consider the perceptron in two dimensions: h(x) = sign(wTx) where w = [w0,
w1, w2]T and x = [1, x1, x2]T. Technically, x has three coordinates, but we call
this perceptron two-dimensional because the first coordinate is fixed at 1.
(i) Show that the regions on the plane where h(x) = +1 and h(x) = -1 are
separated by a line. If we express this line by the equation x2 = ax1 + b, what are
the slope a and intercept b in terms of w0, w1, w2?
(ii) Draw a picture for the cases w = [3, 2, 1]T and w = - [3 , 2, 1]T.
7. (a) Explain the contexts where linear regression is used. Write the linear regression
algorithm, in detail.
(b) Illustrate a simple learning model using the concept of input, output, learning
algorithm and hypothesis set.
(c) Briefly explain the difference between input space, feature space, and output
space.
9. (a) Why does Perceptron Learning Algorithm(PLA) converge for linearly separable
data?
(b) Why does PLA fail for noisy or non-separable data?
(c) How does the Pocket Algorithm improve PLA?
(d) Compare Pocket Algorithm with PLA with respect to in-sample error.
10. (a) Derive the analytical solution for obtaining optimal values of parameters in
linear regression.
(b) State the differences between in-sample and out-of-sample errors in context of
Hoeffding’s inequality.
(c) Discuss the impact of number of observations and number of hypotheses in
Hoeffding’s inequality.
12. (a) Calculate growth function and break point for positive intervals (h(x) = 1 for a
<= x <= b, h(x) = -1 otherwise) for N points.
(b) You are given 4 points N1, N2, N3 and N4. Calculate the number of dichotomies
when breakpoint is 3.
(c) Explain the Bias-Variance trade off in the context of learning.
13. (a) Let B(N,k) represents the maximum number of dichotomies on N data points
with break point k. Derive the recursive bound on B(N,k).
(b) State and prove the analytical solution for the recursive bound on B(N,k).
(c) Discuss the differences between hypotheses and dichotomies with suitable
example.
14. (a) Discuss the relation between VC dimension and growth function.
(b) How does the VC dimension affect overfitting and underfitting in machine
learning models?
(c) Find the VC Dimension for the following hypotheses:
(i)Perceptron in R2.
(ii)Positive intervals F(x) = +1 for a ≤ x ≤ b; -1 otherwise.
(iii) Convex sets.
15. (a) Calculate growth function and break point for (i) h(x) = 1 for x >= a and h(x) =
-1 for x < a and (ii) convex sets for N points.
(b) Define the mathematical definition of training and testing in context of
Hoeffding’s inequality.
(c) Decompose out-of-sample error in terms of bias and variance based on squared
error measure.
16. (a) Suppose 4 input data points X1, X2, X3, and X4 need to be classified into two
classes +1 and -1. Determine the set of maximum possible dichotomies for these
4 data points, which can be obtained under the following three conditions: (i)
break point is 2 (ii) break point is 3 (iii) break point is 4.
(b) Prove that the VC dimension of perceptron in d-dimension is d+1.
17. (a) State and explain the Vapnik–Chervonenkis (VC) inequality in context of
machine learning.
(b) Define the VC dimension and discuss the significance of VC dimension in
developing a machine learning model.
(c) Describe the relation between the number of observed data points and VC
dimension of a model.
19. (a) After learning with a particular data set D of size N, the final hypothesis g (D) has
in-sample error Ein(g(D)) and out-of-sample error Eout (g(D)). Illustrate using
learning curve for a simple learning model and a complex one.
(b) Discuss the concept of generalization bound with reference to VC Dimension.
(c) Explain Bias Variance analysis in the context of generalization in machine
learning.
20. (a) Derive the mathematical definition of bias and variance for a final hypothesis
g(D)applied on a dataset D.
(b) Explain bias-variance trade-offin context of generalization in machine learning.
(c) Describe the generalization bound with reference toVapnik–Chervonenkis
inequality.
21. (a) Derive the weight update equations of a feed forward multi-layered perceptron
network using back-propagation algorithm.
(b) Explain the main reasons why a back-propagation training algorithm might not
find a set of weights which minimizes the training error for a given feed-
forward neural network.
(c) Describe briefly the purpose of the momentum term in the back-propagation
algorithm.
22. (a) Explain how multi-layer perceptron can be used as an estimator. Write all the
required steps, in detail.
(b) Describe the significance of learning rate and bias in neural network.
(c) Describe what is likely to happen when a learning rate is used that is too large,
and when one is used that is too small. How can one optimize the learning rate?
25. (a) Assume we have a set of data from patients who have visited Heritagehospital
during the year 2017. A set of features (e.g., temperature, height)have also been
extracted for each patient. Our goal is to decide whether anew visiting patient
has diabetes, heart disease, or Alzheimer (a patientcan have one or more of
these diseases).
We have decided to use a neural network to solve this problem. We havetwo
choices: (i) either to train a separate neural network for each of thediseases or
(ii) to train a single neural network with one output neuron foreach disease, but
with a shared hidden layer. Which method do you prefer?Justify your answer.
(b) Briefly explain the momentum and how is it being incorporated in theback
propagation learning technique.
31. (a) Suppose a one-hidden layered neural network with a single weight w is usedto
implement a function y = 2x + 3 + c, where x and y are the input andoutput
parameters respectively, whereas c is Gaussian noise (randomnumber). Derive
the update equation using gradient descent approachto minimize the mean
squared error.
(b) You are asked to simulate the Boolean function x1 XOR x2 using amulti-layer
perceptron. Construct the network and explain how yournetwork is able to
model the said function.
32. (a) Briefly explain the convolution, pooling and fully connected layers in a
convolutional neural network.
(b) An input of volume 48 × 48 × 3 is fed to a convolutional neural network. What
would be the output volume of a convolution layer when you apply 8 (eight) 5 ×
5 × 3 filters with stride 2 and a zero-padding of size 1. Also calculate the
number parameters involved due to this layer.
38. (a) In case of non-linearly separable data or data with outliers, better solution is
achieved by using a soft margin. Derive the Lagrangian for the optimization
problem as defined by linear SVM – soft margin classification.
(b) State the mercer’s condition in selecting a kernel function for non-linear SVM.
39. (a) Describe the concept of regularization and over-fitting in machine learning.
(b) A linearly separable dataset is given in the following Table. Predict the class of
(0.6, 0.8) using a support vector machine classifier.
X1 X2 Y Lagrange
Multiplier
0.3 0.4 +1 5
0.7 0.6 -1 8
0.9 0.5 -1 0
0.7 0.9 -1 0
0.1 0.05 +1 0
0.4 0.3 +1 0
0.9 0.8 -1 0
0.2 0.01 +1 0
41. Construct the primal problem and then derive the Lagrangian and its dualfor the
optimization problem as
defined by linear support vector machine(SVM) classification.
42. A linearly separable dataset is given in the table below. Predict the class of (0.6, 0.8)
using a support vector machine classifier. Show all the relevantcomputations.
X1 X2 Y Lagrange
Multiplier
0.3858 0.4687 +1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 +1 0
0.4057 0.3529 +1 0
0.9355 0.8132 -1 0
0.2146 0.0099 +1 0
0.4871 0.611 -1 65.5261
43. (a) Suppose we have five 1D data pointsx 1=1, x2=2, x3=4, x4=5, x5=6, with
y1=1,y2=1, y3=-1, y4=-1, y5=1.When a polynomial kernel of degree two {K(x,y)
= (xy+1)2 } is used and C is set to 100, we get the Lagrange multipliers as
follows:
α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833
Identify the support vectors and derive the discrimination function.
44. (a) Explain how kernel function is used in non-linear support vector machines. Also
justify the statement that ‘One can use infinite-dimensional spaces with the
kernel trick’ in the perspective of non-linear SVM classification.
(b) Discuss the concepts of over-fitting and regularization in machine learning.
45. (a) Describe the concept of regularization in machine learning. State and explain
various regularization techniques considered to overcome the problem of
overfitting in a model.
(b) Discuss the differences between ridge and lasso regularization. Describe a
situation where ridge regularization should be chosen over lasso to achieve
better generalization with proper justification.