Homework 0: Mathematical Background For Machine Learning
Homework 0: Mathematical Background For Machine Learning
Instructions
• Submit your homework to Autolab by 11:59 p.m. Friday, January 25, 2019.
• Late homework policy: Homework is worth full credit if submitted before the due date, half
credit during the next 48 hours, and zero credit after that. Additionally, you are permitted
to drop 1 homework.
• Collaboration policy: For this homework only, you are welcome to collaborate on any of the
questions with anybody you like. However, you must write up your own final solution, and you
must list the names of anybody you collaborated with on this assignment. The point of this
homework is not really for us to evaluate you, but instead for you to refresh the background
needed for this class, and to fill in any gaps you may have.
1
Homework 0: Background Test 10-315 Introduction to Machine Learning
3. Is X invertible? If so, give the inverse, and if no, explain why not.
Solution: Yes.
3
−1 1 3 −4 −2
X = = 21 (3)
(2 × 3) − (1 × 4) −1 2 −2 1
Solution:
dy
= 3x2 + 1 (4)
dx
2
Homework 0: Background Test 10-315 Introduction to Machine Learning
−x1 ∂x1 f
2. If f (x1 , x2 ) = x1 sin(x2 )e , what is the gradient ∇f (x) of f ? Recall that ∇f (x) = .
∂x2 f
Solution:
sin(x2 )e−x1 − x1 sin(x2 )e−x1
∂ x1 f
∇f (x) = = (5)
∂ x2 f x1 cos(x2 )e−x1
(1 − x1 ) sin(x2 )e−x1
= (6)
x1 cos(x2 )e−x1
Solution:
1+1+0+1+0 3
Sample mean = = (7)
5 5
Solution:
" 2 2 2 2 #
2
1 2 2 3 2 3
Sample variance = + + − + + − (8)
5 5 5 5 5 5
1 3×4 2×9 6
= + = (9)
5 25 25 25
3. What is the probability of observing this data, assuming it was generated by flipping a coin
with an equal probability of heads and tails (i.e. the probability distribution is p(x = 1) = 0.5,
p(x = 0) = 0.5).
Solution:
1
Probability of S = 0.55 = (10)
32
3
Homework 0: Background Test 10-315 Introduction to Machine Learning
4. Note that the probability of this data sample would be greater if the value of p(x = 1) was
not 0.5, but instead some other value. What is the value that maximizes the probability of
the sample S. Please justify your answer.
Solution: Let p be the probability of 1 (i.e. p(x = 1)). We want to find the value of the p
that maximizes the probability of the sample S. Note that the probably of the sample S can be
written
5
Y P5 P5
pxi (1 − p)(1−xi ) = p i=1 xi
(1 − p)n− i=1 xi
. (11)
i=1
We want to maximize the above as a function of p. We first take the log of the above and call
this function `(p), which we can write as
5
! 5
!
X X
`(p) = xi log(p) + n − xi log(1 − p) (12)
i=1 i=1
To find the p that maximizes the probability of p(x = 1), we can find the p that maximizes the
probability of `(p). To do this, we take the derivative of `(p) with respect to p, set this to zero,
and solve for p:
5 5
!
d`(p) 1X 1 X
= xi − n− xi = 0 (13)
dp p i=1 1−p i=1
P5
xi − pn
=⇒ 0 = i=1 (14)
p(1 − p)
X 5
=⇒ pn = xi (15)
i=1
5
1 X
=⇒ p = xi (16)
n i=1
Plugging in our values for x1 , . . . , x5 into the above formula, we find that the best p = 35 .
5. Consider the following joint probability table over variables y and z, where y takes a value
from the set {a,b,c}, and z takes a value from the set {T,F}:
y
a b c
z T 0.2 0.1 0.2
F 0.05 0.15 0.3
4
Homework 0: Background Test 10-315 Introduction to Machine Learning
1. f (n) = ln(n), g(n) = lg(n). Note that ln denotes log to the base e and lg denotes log to the
base 2.
Solution: Both, since the functions are equivalent up to a multiplicative constant.
3. f (n) = 3n , g(n) = 2n
Solution: g(n) = O(f (n)), since f (n) grows much more rapidly as n becomes large.
5
Homework 0: Background Test 10-315 Introduction to Machine Learning
Algorithms [5 Points]
Divide and Conquer: Assume that you are given an array with n elements all entries equal
either to 0 or +1 such that all 0 entries appear before +1 entries. You need to find the index
where the transition happens, i.e. you need to report the index with the last occurrence of 0.
Give an algorithm that runs in time O(log n). Explain your algorithm in words, describe why
the algorithm is correct, and justify its running time.
Solution:
We give an algorithm for this problem here:
find-transition (i , j)
else /* a == 0 and b == 1 */
This algorithm is correct. Note that for each recursive call we know that array[i]==0 and ar-
ray[j]==1. When we stop, we know that a==0 and b==1, so we array[mid] is the last entry with
0.
The running time can be analyzed via the recurrence: T (n) = T (n/2) + O(1), T (n) = c for
n ≤ 4 which solves to O(log n). This algorithm is based on the idea of binary search and hence
the running time is as expected.
6
Homework 0: Background Test 10-315 Introduction to Machine Learning
(g) nx px (1 − p)n−x
(c) Uniform
(h) √ 1
exp − 12 −(x − µ)> Σ−1 (x − µ)
(d) Binomial
(2π)d |Σ|
Solutions: (a) with (h), (b) with (e), (c) with (f), (d) with (g)
(b) What is the mean, variance, and entropy of a Bernoulli(p) random variable?
Solutions: The mean is p, the variance is p(1−p), and the entropy is −(1−p) log(1−p)−p log(p).
Solutions: Assuming a fair die, the number of times 3 shows up should be close to 1000 due
to the Law of Large Numbers Theorem.
(b) If a fair coin is tossed n times and X̄ denotes the average number of heads, then distribution
of X̄ satisfies √ n→∞
n(X̄ − 1/2)] → N (0, 1/4)
7
Homework 0: Background Test 10-315 Introduction to Machine Learning
Solutions: The expression on the left-hand-side should tend to the expression on the right-hand-
side as n → ∞ due to the Central Limit Theorem.
Geometry
(a) Show that the vector w is orthogonal to the line w> x + b = 0. (Hint: Consider two points
x1 , x2 that lie on the line. What is the inner product w> (x1 − x2 )?)
Solution: This line is all x such that w> x + b = 0. Consider two such x, called x1 and x2 .
Note that x1 − x2 is a vector parallel to our line. Also note that
(b) Argue that the distance from the origin to the line w> x + b = 0 is kwk b
.
T
Solution: Let x be any point on the hyperplane w x + b = 0. The Euclidean distance from the
origin to the hyperplane can be computed by projecting x onto the normal vector of the hyerplane,
T x|
which is given by w. Hence the distance is given by |w
||w||2
|−b|
= ||w|| 2
|b|
= ||w|| 2
, which completes the
proof.
Here is another method for solving (b) using Lagrange multipliers. We can show this by first
finding the closest point to the origin that lies on this line, and then finding the distance to this
8
Homework 0: Background Test 10-315 Introduction to Machine Learning
point. Let a∗ be the closest point to the origin the lies on the line. We can write a∗ as
So we first solve this constrained optimization problem and find a∗ . We start by taking the
derivative of the objective, setting it to zero, and using Lagrange multipliers (i.e. setting the
derivative of the Lagrangian a> a − λ(w> a + b) to zero). We can write
λ
2a∗ = λw =⇒ a∗ = w (27)
2
Hence, plugging this value for a into the constraint, we can write:
w> a + b = 0 (28)
λ
=⇒ w> w +b=0 (29)
2
λ >
=⇒ w w+b=0 (30)
2
−2b
=⇒ λ = > (31)
w w
∗ −b
=⇒ a = > w (32)
w w
once we have a∗, we can compute the distance between a∗ and the origin to get
s 2
p
∗ > ∗
−b
distance = (a ) a = w> w (33)
w> w
b √ > b b
= > w w=√ = (34)
w w w> w kwk
9
Homework 0: Background Test 10-315 Introduction to Machine Learning
(c) How does the scatter plot change if you double the variance of each component (x1 & x2 )?
(d) How does the scatter plot change if the covariance matrix is changed to the following?
1 0.5
0.5 1
(e) How does the scatter plot change if the covariance matrix is changed to the following?
1 −0.5
−0.5 1
Solutions:
(a) See attached code.
(b) See attached code. The data moves up and to the left. Namely, the center of the data moves
from roughly [0, 0] to roughly [−1, 1].
(c) See attached code. The data become more “spread out”.
(d) See attached code. The data become skewed so that they stretch from the lower left to the
upper right.
(d) See attached code. The data become skewed so that they stretch from the upper left to the
bottom right.
10
Homework 0: Background Test 10-315 Introduction to Machine Learning
11