0% found this document useful (0 votes)
7 views

2019-20-I Q4_key

This document is a quiz for CS 771A: Introduction to Machine Learning, dated November 1, 2019, consisting of multiple-choice questions and mathematical derivations. It includes instructions for completing the quiz, true/false questions about machine learning concepts, and problems related to Gram matrices and derivatives. The quiz is designed for evaluation, with a total of 30 marks available.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

2019-20-I Q4_key

This document is a quiz for CS 771A: Introduction to Machine Learning, dated November 1, 2019, consisting of multiple-choice questions and mathematical derivations. It includes instructions for completing the quiz, true/false questions about machine learning concepts, and problems related to Gram matrices and derivatives. The quiz is designed for evaluation, with a total of 30 marks available.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CS 771A: Introduction to Machine Learning Quiz 4 (01 Nov 2019)

Name SAMPLE SOLUTIONS 30 marks


Roll No Dept. Page 1 of 2

Instructions:
1. This question paper contains 1 page (2 sides of paper). Please verify.
2. Write your name, roll number, department above in block letters neatly with ink.
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ. We will entertain no requests for leniency.
5. Do not rush to fill in answers. You have enough time to solve this quiz.

Q1. Write T or F for True/False (write only in the box on the right hand side) (8x2=16 marks)
The Adagrad method is a technique for choosing an appropriate batch size when
1
training a deep network.
F
The largest value the Gaussian kernel can take on any two points depends on the
2
value of the bandwidth parameter used within the kernel.
F
k-means++ initialization is one of the algorithms that cannot be kernelized easily
3
since it involves probabilities and sampling.
F
Suppose 𝐺 is the Gram matrix of 𝑛 data points 𝐱1 , … , 𝐱 𝑛 ∈ ℝ2 with respect to the
4
homogeneous polynomial kernel of degree 𝑝 = 2. Then 𝐺 must be pos. semi def.
T
If for some 𝐰 ∗ we have 𝑦 𝑖 = 〈𝐰 ∗ , 𝐱 𝑖 〉, 𝑖 ∈ [𝑛] then kernel regression with 𝐾(𝐱, 𝐲)
5
= (〈𝐱, 𝐲〉 + 1)2 cannot get zero training error w.r.t least squares loss on this data
F
Kernel k-means clustering with the quadratic kernel results in a larger model size
6
than what is possible if we had done linear k-means (i.e. with the linear kernel).
T
A NN with a single hidden layer and a single output node with all nodes except
7
input layer nodes using ReLU activation will always learn a differentiable function.
F
Dropout is a technique that takes a training set and randomly drops training points
8
to reduce the training set size so that training can be done faster
F
Q2. Suppose we have 𝑛 distinct data points data points 𝐱1 , … , 𝐱 𝑛 ∈ ℝ2 . Consider the Gram matrix
𝐺 w.r.t the Gaussian kernel 𝐾(𝐱, 𝐲) = exp(−𝛾 ⋅ ‖𝐱 − 𝐲‖22 ). Answer in the boxes only. (6 marks)

2.1 Write down the value of trace(𝐺) as 𝛾 → 0 𝑛

2.2 Write down the value of trace(𝐺) as 𝛾 → ∞ 𝑛

2.3 Write down the value of rank(𝐺) as 𝛾 → 0 1

2.4 Write down the value of rank(𝐺) as 𝛾 → ∞ 𝑛


If instead of being distinct, had all the points been the same i.e.
2.5
𝐱1 = 𝐱 2 = ⋯ = 𝐱 𝑛 , write down the value of rank(𝐺) as 𝛾 → 0
1
If instead of being distinct, had all the points been the same i.e.
2.6
𝐱1 = 𝐱 2 = ⋯ = 𝐱 𝑛 , write down the value of rank(𝐺) as 𝛾 → ∞
1
Page 2 of 2
Q3 Let 𝐱 = [1, −1]⊤ , 𝐲 = [−1,1]⊤ ∈ ℝ2 . Define the function 𝑓: ℝ2 → ℝ2 as 𝑓(𝐳) = 𝑧1 ⋅ 𝐱 + 𝑧2 ⋅ 𝐲
for any 𝐳 = [𝑧1 , 𝑧2 ] ∈ ℝ2 . Define another function 𝑔: ℝ → ℝ2 as 𝑔(𝑟) = [𝑟, 𝑟 2 ] where 𝑟 ∈ ℝ. Let
𝑑ℎ
ℎ: ℝ → ℝ2 be defined as ℎ(𝑟) = 𝑓(𝑔(𝑟)). Derive a general expression for using the chain rule
𝑑𝑟
𝑑ℎ
giving major steps of derivation and then evaluate at 𝑟 = 3. (6 + 2 = 8 marks)
𝑑𝑟
1 −1
Let 𝐴 = [𝐱 ⊤ , 𝐲 ⊤ ] = [ ] ∈ ℝ2 so that we have 𝑓(𝐳) = 𝐴𝐳 which gives us 𝐽𝑓 = ∇𝑓 = 𝐴
−1 1
(to see that the answer is indeed 𝐴 and not 𝐴⊤ , think of a hypothetical example where we
have 𝐳 = [𝑧1 , 𝑧2 , 𝑧3 ] ∈ ℝ3 and 𝑓(𝐳) = 𝑧1 ⋅ 𝐱 + 𝑧2 ⋅ 𝐲 + 𝑧3 ⋅ 𝐩 for 𝐩 = [1,1]⊤ ). Next, we
calculate 𝐽𝑔 = [1,2𝑟]⊤ (notice that this is a column vector since this is not a gradient of a real
𝑑ℎ
valued function but rather the Jacobian of a vector-valued function). Thus, we have = 𝐽ℎ =
𝑑𝑟
1 −1 1 1 − 2𝑟
𝐽𝑓 ⋅ 𝐽𝑔 = [ ][ ] = [ ]. Note the dimensionality of 𝐽ℎ which fits our convention
−1 1 2𝑟 2𝑟 − 1
of Jacobians being of dimensionality o/p dims × i/p dims since ℎ: ℝ → ℝ2 . At 𝑟 = 3 we have
−5
𝐽ℎ = [ ].
5

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- END OF QUIZ - - - - -- - - - - - - - - - - - - - - - - - - - - - - - -
---

You might also like