Assignment 5 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
Assignment 5 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
)
Introduction to Machine Learning
Prof. B. Ravindran
1. You are given the following neural networks which take two binary valued inputs x1 , x2 ∈ {0, 1}
and the activation function is the threshold function(h(x) = 1 if x > 0; 0 otherwise). Which
of the following logical functions does it compute?
Figure 1: Q1
(a) OR
(b) AND
(c) NAND
(d) None of the above.
Solution: B
You can construct the truth table and see the values and decide which gate the network mimics.
0 0 0
0 1 0
1 0 0
1 1 1
2. We have a function which takes a two-dimensional input x = (x1 , x2 ) and has two parameters
w = (w1 , w2 ) given by f (x, w) = σ(σ(x1 w1 )w2 + x2 ) where σ(x) = 1+e1−x . We use backprop-
agation to estimate the right parameter values. We start by setting both the parameters to
1
0. Assume that we are given a training point x1 = 1, x2 = 0, y = 5. Given this information
∂f
answer the next two questions. What is the value of ∂w 2
?
(a) 0.5
(b) -0.25
(c) 0.125
(d) -0.5
Solution: C
Write σ(x1 w1 )w2 + x2 as o2 and x1 w1 as o1
∂f ∂f ∂o2
=
∂w2 ∂o2 ∂w2
∂f
= σ(o2 )(1 − σ(o2 )) × σ(o1 )
∂w2
∂f
= 0.5 ∗ 0.5 ∗ 0.5
∂w2
3. If the learning rate is 0.5, what will be the value of w2 after one update using backpropagation
algorithm?
(a) 0.0625
(b) -0.0625
(c) 0.5625
(d) - 0.5625
Solution: C
The update equation would be
∂L
w2 = w2 − λ
∂w2
where L is the loss function, here L = (y − f )2
∂f
w2 = w2 − λ × 2(y − f ) × (−1) ×
∂w2
2
Solution C
We will write the log likelihood as the following,
X 1 (xi −µ)2
L= log( √ e 2σ2 )
i
σ 2π
X (xi − µ)2
L=K+
i
2σ 2
∂L
Now we need to maximize this L, which we do by setting ∂µ to 0, which gives us option C as
the solution.
5. Continuing with the above question, assume that the prior distribution of the mean is also a
Gaussian distribution, but with parameters mean µp and variance σp2 . Find the MAP estimate
of the mean.
σ 2 µp +σp2 N
P
i=1 xi
(a) µM AP = σ +N σp2
2
σ 2 +σp2 N
P
i=1 xi
(b) µM AP = σ +σp2
2
σ 2 +σp2 N
P
i=1 xi
(c) µM AP = σ +N σp2
2
σ 2 µp +σp2 N
P
i=1 xi
(d) µM AP = N (σ +σp2 )
2
Solution C
For a MAP estimate, we try to maximize f (µ)f (X|µ)
(µ−µp )2
1 Y 1 (xi −µ)2
2
f (µ)f (X|µ) = √ e 2σp
√ e 2σ2
σp 2π i
σ 2π
We will maximize this with respect to µ after taking a logarithm. This will yield the following
equation, P
i xi µp N 1
+ − µ( + ) = 0
σ σp σ σp
Thus solution will be C
(a) MAP estimates suffer more from overfitting than maximum likelihood estimates.
(b) MAP estimates are equivalent to the ML estimates when the prior used in the MAP is a
uniform prior over the parameter space.
(c) One drawback of maximum likelihood estimation is that in some scenarios (hint: multi-
nomial distribution), it may return probability estimates of zero.
(d) The parameters which minimize the expected Bayesian L1 Loss is the median of the
posterior distribution.
Solution - B, C, D
3
7. Using the notations used in class and the tutorial document, evaluate the value of the neural
network with a 3-3-1 architecture (2-dimensional input with 1 node for the bias term in both
the layers). The parameters are as follows
1 0.2 0.4
α=
−1 0.3 0.5
β = 0.3 0.4 0.5
Using sigmoid function as the activation functions at both the layers, the output of the network
for an input of (0.8, 0.7) will be
(a) 0.6710
(b) 0.6617
(c) 0.6948
(d) 0.3369
Solution C
This is a straight forward computation task. First pad x with 1 and make it the X vector,
1
X = 0.8
0.7
o1 = αX
o2 = βa1
1
a2 =
1 + e−o2
a2 = 0.6948
4
Solution A, B, C
A - Neural networks are also called as universal approximators, because of their ability to learn
complex functions by varying the number of layersPand nodes.
h
B - The decision from any SVM is given by ŷ = ( i=0 αK(xi , x) + b) where xi represent the
Support Vectors and K is the gaussian kernel. This can be implemented using a RBF-Neural
Network. The first layer would be the input layer. Second layer would be the radial basis
nodes, with as many nodes as support vectors in the SVM. And a single node in the final
layer. The centers of the gaussian basis functions would be the support vectors of the SVM.
The would be same as that of the kernel. The weights connected the hidden layer to the last
layer would be given by i and a bias b. The activation function for the last layer would be the
sgn function. C This is true because bad initializations might hinder the learning of the neural
network, for example if you use all zeros the network might not be able to learn anything
because of zero gradients.