0% found this document useful (0 votes)
72 views

Homework 0: Mathematical Background For Machine Learning

This document provides instructions for Homework 0 in a machine learning course. It aims to refresh students' mathematical background in areas like calculus, linear algebra, and probability. The homework includes minimum and medium background tests covering vectors/matrices, calculus, probability/statistics. Students must pass the minimum tests to take the class, and are encouraged to study more if they do not pass the medium tests.

Uploaded by

anwer fadel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Homework 0: Mathematical Background For Machine Learning

This document provides instructions for Homework 0 in a machine learning course. It aims to refresh students' mathematical background in areas like calculus, linear algebra, and probability. The homework includes minimum and medium background tests covering vectors/matrices, calculus, probability/statistics. Students must pass the minimum tests to take the class, and are encouraged to study more if they do not pass the medium tests.

Uploaded by

anwer fadel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Homework 0: Background Test 10-315 Introduction to Machine Learning

Homework 0: Mathematical Background for


Machine Learning
10-315 Introduction to Machine Learning
Due 11:59 p.m. Friday, January 25, 2019
The goal of this homework is to help you refresh the mathematical background needed to take
this class. Although most students find the machine learning class to be very rewarding, it does
assume that you have a basic familiarity with several types of math: calculus, matrix and vector
algebra, and basic probability. You don’t need to be an expert in all these areas, but you will need
to be conversant in each, and to understand:
• Basic probability and statistics (at the level of a first undergraduate course). For example, we
assume you know how to find the mean and variance of a set of data, and that you understand
basic notions such as conditional probabilities and Bayes rule. During the class, you might be
asked to calculate the probability of a data set with respect to a given probability distribution.
• Basic tools concerning analysis and design of algorithms, including the big-O notation for the
asymptotic analysis of algorithms.
• Basic calculus (at the level of a first undergraduate course). For example, we rely on you being
able to take derivatives. During the class we will sometimes calculate derivatives (gradients)
of functions with several variables.
• Linear algebra (at the level of a first undergraduate course). For example, we assume you
know how to multiply vectors and matrices, and that you understand matrix inversion.
For each of these mathematical topics, this homework provides (1) a minimum background test,
and (2) a medium background test. If you pass the medium background tests, you are in good
shape to take the class. If you pass the minimum background, but not the medium background
test, then you can still successfully take and pass the class but you should expect to devote some
extra time to fill in necessary math background as the course introduces it.
Please see the piazza post for useful resources for brushing up on, and filling in this background.

Instructions
• Submit your homework to Autolab by 11:59 p.m. Friday, January 25, 2019.
• Late homework policy: Homework is worth full credit if submitted before the due date, half
credit during the next 48 hours, and zero credit after that. Additionally, you are permitted
to drop 1 homework.
• Collaboration policy: For this homework only, you are welcome to collaborate on any of the
questions with anybody you like. However, you must write up your own final solution, and you
must list the names of anybody you collaborated with on this assignment. The point of this
homework is not really for us to evaluate you, but instead for you to refresh the background
needed for this class, and to fill in any gaps you may have.

1
Homework 0: Background Test 10-315 Introduction to Machine Learning

Minimum Background Test [80 Points]

Vectors and Matrices [20 Points]


Consider the matrix X and the vectors y and z below:
     
2 4 1 2
X= y= z=
1 3 3 3
1. What is the inner product of the vectors y and z? (this is also sometimes called the dot
product, and is sometimes written yT z)
Solution:
y> z = (1 × 2) + (3 × 3) = 11 (1)

2. What is the product Xy?


Solution:
   
(2 × 1) + (4 × 3) 14
Xy = = (2)
(1 × 1) + (3 × 3) 10

3. Is X invertible? If so, give the inverse, and if no, explain why not.
Solution: Yes.
   3 
−1 1 3 −4 −2
X = = 21 (3)
(2 × 3) − (1 × 4) −1 2 −2 1

4. What is the rank of X? Explain your answer.    


4 2
Solution: The rank of X is 2, since the column rank is 2, since 6= c for all c ∈ R.
3 1

Calculus [20 Points]


1. If y = x3 + x − 5 then what is the derivative of y with respect to x?

Solution:
dy
= 3x2 + 1 (4)
dx

2
Homework 0: Background Test 10-315 Introduction to Machine Learning

 
−x1 ∂x1 f
2. If f (x1 , x2 ) = x1 sin(x2 )e , what is the gradient ∇f (x) of f ? Recall that ∇f (x) = .
∂x2 f
Solution:
sin(x2 )e−x1 − x1 sin(x2 )e−x1
   
∂ x1 f
∇f (x) = = (5)
∂ x2 f x1 cos(x2 )e−x1
(1 − x1 ) sin(x2 )e−x1
 
= (6)
x1 cos(x2 )e−x1

Probability and Statistics [20 Points]


Consider a sample of data S = {1, 1, 0, 1, 0} created by flipping a coin x five times, where 0 denotes
that the coin turned up heads and 1 denotes that it turned up tails.

1. What is the sample mean for this data?

Solution:
1+1+0+1+0 3
Sample mean = = (7)
5 5

2. What is the sample variance for this data?

Solution:
"   2  2  2  2 #
2
1 2 2 3 2 3
Sample variance = + + − + + − (8)
5 5 5 5 5 5
 
1 3×4 2×9 6
= + = (9)
5 25 25 25

3. What is the probability of observing this data, assuming it was generated by flipping a coin
with an equal probability of heads and tails (i.e. the probability distribution is p(x = 1) = 0.5,
p(x = 0) = 0.5).

Solution:
1
Probability of S = 0.55 = (10)
32

3
Homework 0: Background Test 10-315 Introduction to Machine Learning

4. Note that the probability of this data sample would be greater if the value of p(x = 1) was
not 0.5, but instead some other value. What is the value that maximizes the probability of
the sample S. Please justify your answer.

Solution: Let p be the probability of 1 (i.e. p(x = 1)). We want to find the value of the p
that maximizes the probability of the sample S. Note that the probably of the sample S can be
written
5
Y P5 P5
pxi (1 − p)(1−xi ) = p i=1 xi
(1 − p)n− i=1 xi
. (11)
i=1

We want to maximize the above as a function of p. We first take the log of the above and call
this function `(p), which we can write as
5
! 5
!
X X
`(p) = xi log(p) + n − xi log(1 − p) (12)
i=1 i=1

To find the p that maximizes the probability of p(x = 1), we can find the p that maximizes the
probability of `(p). To do this, we take the derivative of `(p) with respect to p, set this to zero,
and solve for p:
5 5
!
d`(p) 1X 1 X
= xi − n− xi = 0 (13)
dp p i=1 1−p i=1
P5
xi − pn
=⇒ 0 = i=1 (14)
p(1 − p)
X 5
=⇒ pn = xi (15)
i=1
5
1 X
=⇒ p = xi (16)
n i=1

Plugging in our values for x1 , . . . , x5 into the above formula, we find that the best p = 35 .
5. Consider the following joint probability table over variables y and z, where y takes a value
from the set {a,b,c}, and z takes a value from the set {T,F}:

y
a b c
z T 0.2 0.1 0.2
F 0.05 0.15 0.3

• What is p(z = T AND y = b)?


Solution: The answer is 0.1.

• What is p(z = T|y = b)?


Solution: Using the definition of conditional probability, we see that
p(z = T AND y = b) 0.1
p(z = T|y = b) = = = 0.4 (17)
p(y = b) 0.1 + 0.15

4
Homework 0: Background Test 10-315 Introduction to Machine Learning

Big-O Notation [20 Points]


For each pair (f, g) of functions below, state whether either, neither, or both of the following are
true: f (n) = O(g(n)), g(n) = O(f (n)). Briefly justify your answers.

1. f (n) = ln(n), g(n) = lg(n). Note that ln denotes log to the base e and lg denotes log to the
base 2.
Solution: Both, since the functions are equivalent up to a multiplicative constant.

2. f (n) = 3n , g(n) = n100


Solution: g(n) = O(f (n)), since f (n) grows much more rapidly as n becomes large.

3. f (n) = 3n , g(n) = 2n
Solution: g(n) = O(f (n)), since f (n) grows much more rapidly as n becomes large.

4. f (n) = 1000n2 + 2000n + 4000, g(n) = 3n3 + 1


Solution: f (n) = O(g(n)), since g(n) grows much more rapidly as n becomes large.

5
Homework 0: Background Test 10-315 Introduction to Machine Learning

Medium Background Test [20 Points]

Algorithms [5 Points]
Divide and Conquer: Assume that you are given an array with n elements all entries equal
either to 0 or +1 such that all 0 entries appear before +1 entries. You need to find the index
where the transition happens, i.e. you need to report the index with the last occurrence of 0.
Give an algorithm that runs in time O(log n). Explain your algorithm in words, describe why
the algorithm is correct, and justify its running time.
Solution:
We give an algorithm for this problem here:

find-transition (i , j)

let mid = i + floor ( (j-i) /2 )

let a = array [ mid ] and b = array [ mid + 1 ]

if (a == b == 1) return find-transition ( i , mid )

else if (a == b == 0) return find-transition ( mid + 1 , j )

else /* a == 0 and b == 1 */

return mid (last occurrence of 0)

This algorithm is correct. Note that for each recursive call we know that array[i]==0 and ar-
ray[j]==1. When we stop, we know that a==0 and b==1, so we array[mid] is the last entry with
0.
The running time can be analyzed via the recurrence: T (n) = T (n/2) + O(1), T (n) = c for
n ≤ 4 which solves to O(log n). This algorithm is based on the idea of binary search and hence
the running time is as expected.

Probability and Random Variables [5 Points]


Probability
State true or false. Here Ac denotes complement of the event A.
(a) P (A ∪ B) = P (A ∩ (B ∩ Ac ))
(b) P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

6
Homework 0: Background Test 10-315 Introduction to Machine Learning

(c) P (A) = P (A ∩ B) + P (Ac ∩ B)


(d) P (A|B) = P (B|A)
(e) P (A1 ∩ A2 ∩ A3 ) = P (A3 |(A2 ∩ A1 ))P (A2 |A1 )P (A1 )
Solutions: (a) False, (b) True, (c) False, (d) False, (e) True

Discrete and Continuous Distributions


Match the distribution name to the formula of its probability density function.
(a) Multivariate Gaussian (e) px (1 − p)1−x
1
(b) Bernoulli (f) when a ≤ x ≤ b; 0 otherwise
b−a

(g) nx px (1 − p)n−x

(c) Uniform
(h) √ 1
exp − 12 −(x − µ)> Σ−1 (x − µ)

(d) Binomial
(2π)d |Σ|
Solutions: (a) with (h), (b) with (e), (c) with (f), (d) with (g)

Mean, Variance and Entropy


(a) Let X be a random variable with finite expectation EX < ∞. Recall that the variance of a
random variable is defined as Var(X) = E [(X − EX)2 ]. Prove that Var(X) = E(X 2 )−E(X)2 .
Solutions:

Var(X) = E[(X − E(X))2 ] (18)


= E[X 2 − 2XE[X] + (E[X])2 ] (19)
= E[X 2 ] − E[2XE[X]] + E[(E[X])2 ] (20)
= E[X 2 ] − 2E[XE[X]] + (E[X])2 (21)
= E[X 2 ] − 2E[X 2 ] + (E[X])2 (22)
= E[X]2 − E[X 2 ] (23)

(b) What is the mean, variance, and entropy of a Bernoulli(p) random variable?

Solutions: The mean is p, the variance is p(1−p), and the entropy is −(1−p) log(1−p)−p log(p).

Law of Large Numbers and Central Limit Theorem


Provide one line justifications.
(a) If a die is rolled 6000 times, the number of times 3 shows up is close to 1000.

Solutions: Assuming a fair die, the number of times 3 shows up should be close to 1000 due
to the Law of Large Numbers Theorem.

(b) If a fair coin is tossed n times and X̄ denotes the average number of heads, then distribution
of X̄ satisfies √ n→∞
n(X̄ − 1/2)] → N (0, 1/4)

7
Homework 0: Background Test 10-315 Introduction to Machine Learning

Solutions: The expression on the left-hand-side should tend to the expression on the right-hand-
side as n → ∞ due to the Central Limit Theorem.

Some useful background reading material:


http://www.cs.cmu.edu/∼aarti/Class/10701/recitation/prob review.pdf

Linear Algebra [5 Points]


Vector norms
2
pPto vectors x ∈ R with following norms:
Draw the regions corresponding
2
(a) kxk2 ≤ 1 (Recall kxk2 = P i xi )
(b) kxk1 ≤ 1 (Recall kxk1 = i |xi |)
(c) kxk∞ ≤ 1 (Recall kxk∞ = maxi |xi |)

Geometry
(a) Show that the vector w is orthogonal to the line w> x + b = 0. (Hint: Consider two points
x1 , x2 that lie on the line. What is the inner product w> (x1 − x2 )?)

Solution: This line is all x such that w> x + b = 0. Consider two such x, called x1 and x2 .
Note that x1 − x2 is a vector parallel to our line. Also note that

w> x1 + b = 0 = w> x2 + b =⇒ w> x1 = w> x2 =⇒ w> (x1 − x2 ) = 0 (24)

which shows that the vector w is orthogonal to our line.

(b) Argue that the distance from the origin to the line w> x + b = 0 is kwk b
.
T
Solution: Let x be any point on the hyperplane w x + b = 0. The Euclidean distance from the
origin to the hyperplane can be computed by projecting x onto the normal vector of the hyerplane,
T x|
which is given by w. Hence the distance is given by |w
||w||2
|−b|
= ||w|| 2
|b|
= ||w|| 2
, which completes the
proof.
Here is another method for solving (b) using Lagrange multipliers. We can show this by first
finding the closest point to the origin that lies on this line, and then finding the distance to this

8
Homework 0: Background Test 10-315 Introduction to Machine Learning

point. Let a∗ be the closest point to the origin the lies on the line. We can write a∗ as

a∗ = min a> a (25)


s.t. w> a + b = 0 (26)

So we first solve this constrained optimization problem and find a∗ . We start by taking the
derivative of the objective, setting it to zero, and using Lagrange multipliers (i.e. setting the
derivative of the Lagrangian a> a − λ(w> a + b) to zero). We can write

λ
2a∗ = λw =⇒ a∗ = w (27)
2
Hence, plugging this value for a into the constraint, we can write:
w> a + b = 0 (28)
 
λ
=⇒ w> w +b=0 (29)
2
λ >
=⇒ w w+b=0 (30)
2
−2b
=⇒ λ = > (31)
w w
∗ −b
=⇒ a = > w (32)
w w
once we have a∗, we can compute the distance between a∗ and the origin to get
s 2
p
∗ > ∗
−b
distance = (a ) a = w> w (33)
w> w
b √ > b b
= > w w=√ = (34)
w w w> w kwk

Some useful background reading material:


http://www.cs.cmu.edu/∼aarti/Class/10701/recitation/LinearAlgebra Matlab Review.ppt
http://www.cs.cmu.edu/∼zkolter/course/15-884/linalg-review.pdf
Wikipedia: http://en.wikipedia.org/wiki/Eigenvalues and eigenvectors
Gilbert Strang. Linear Algebra and its Applications, Ch 5. HBJ Publishers.

Programming skills [5 Points]


Sampling from a distribution. Use the Python libraries numpy and matplotlib.
(a) Draw 100 samples x = [x1 x2 ] from a 2-dimensionalGaussian
 distribution with mean [0, 0]
kxk2
and identity covariance matrix i.e. p(x) = √ d exp − 2 . Plot them on a scatter plot
1
(2π)
(x1 vs. x2 ).
(b) How does the scatter plot change if the mean is [−1, 1]? (for the questions below, change
the mean back to [0, 0])

9
Homework 0: Background Test 10-315 Introduction to Machine Learning

(c) How does the scatter plot change if you double the variance of each component (x1 & x2 )?
(d) How does the scatter plot change if the covariance matrix is changed to the following?
 
1 0.5
0.5 1

(e) How does the scatter plot change if the covariance matrix is changed to the following?
 
1 −0.5
−0.5 1

Solutions:
(a) See attached code.
(b) See attached code. The data moves up and to the left. Namely, the center of the data moves
from roughly [0, 0] to roughly [−1, 1].
(c) See attached code. The data become more “spread out”.
(d) See attached code. The data become skewed so that they stretch from the lower left to the
upper right.
(d) See attached code. The data become skewed so that they stretch from the upper left to the
bottom right.

Some useful background reading material:


Matlab tutorial - http://www.math.mtu.edu/∼msgocken/intro/intro.pdf
R tutorial - http://math.illinoisstate.edu/dhkim/rstuff/rtutor.html

10
Homework 0: Background Test 10-315 Introduction to Machine Learning

11

You might also like