100% found this document useful (1 vote)

111 views

Midterm Sp16 Solutions

This document contains the solutions to a machine learning midterm exam. It is divided into 5 questions worth a total of 114 points. The questions cover topics like naive Bayes classification, maximum likelihood estimation, errors in classification, and overfitting/underfitting machine learning models. Students are instructed to enter their name, student ID, and to use one page of notes for the exam. They are also advised to answer the easier questions first.

Uploaded by

Rama Abdlrahman

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

111 views

Midterm Sp16 Solutions

Uploaded by

Rama Abdlrahman

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Midterm Exam Solutions

CMU 10-601: Machine Learning (Spring 2016)

Feb. 29, 2016

Name:
Andrew ID:

START HERE: Instructions

• This exam has 17 pages and 5 Questions (page one is this cover page). Check to see if
any pages are missing. Enter your name and Andrew ID above.

• You are allowed to use one page of notes, front and back.

• Electronic devices are not acceptable.

• Note that the questions vary in difficulty. Make sure to look over the entire exam
before you start and answer the easier questions first.

Question Point Score

1 20

2 20

3 20

4 20

5 20

Extra Credit 14

Total 114
10-601: Machine Learning Page 2 of 17 2/29/2016

1 Naive Bayes, Probability, and MLE [20 pts. + 2 Extra Credit]

1.1 Naive Bayes

You are given a data set of 10,000 students with their sex, height, and hair color. You are
trying to build a classifier to predict the sex of a student, so you randomly split the data
into a training set and a testing set. Here are the specifications of the data set:
• sex ∈ {male,female}
• height ∈ [0,300] centimeters
• hair ∈ {brown, black, blond, red, green}
• 3240 men in the data set
• 6760 women in the data set
Under the assumptions necessary for Naive Bayes (not the distributional assumptions you
might naturally or intuitively make about the dataset) answer each question with T or F
and provide a one sentence explanation of your answer:
(a) [2 pts.] T or F: As height is a continuous valued variable, Naive Bayes is not appropriate
since it cannot handle continuous valued variables.
Solution: False. Naive Bayes can handle both continuous and discrete values as long
as the appropriate distributions are used for conditional probabilities. For example,
Gaussian for continuous and Bernoulli for discrete
(b) [2 pts.] T or F: Since there is not a similar number of men and women in the dataset,
Naive Bayes will have high test error.
Solution: False. Since the data was randomly split, the same proportion of male and
female will be in the training and testing sets. Thus this discrepancy will not affect
testing error.
(c) [2 pts.] T or F: P (height|sex, hair) = P (height|sex).
Solution: True. This results from the conditional independence assumption required for
Naive Bayes.
(d) [2 pts.] T or F: P (height, hair|sex) = P (height|sex)P (hair|sex).
Solution: True. This results from the conditional independence assumption required for
Naive Bayes.

1.2 Maximum Likelihood Estimation (MLE)

Assume we have a random sample that is Bernoulli distributed X1 , . . . , Xn ∼ Bernoulli(θ).
We are going to derive the MLE for θ. Recall that a Bernoulli random variable X takes
values in {0, 1} and has probability mass function given by
P (X; θ) = θX (1 − θ)1−X .
10-601: Machine Learning Page 3 of 17 2/29/2016

(a) [2 pts.] Derive the likelihood, L(θ; X1 , . . . , Xn ).

Solution:
n
Y
L(θ; X1 , . . . , Xn ) = p(Xi ; θ)
i=1
Yn
= θxi (1 − θ)1−xi
i=1
Pn Pn
xi
=θ i=1 (1 − θ)n− i=1 xi
.
Either of the final two steps are acceptable.
(b) [2 pts.] Derive the following formula for the log likelihood:
n
! n
!
X X
`(θ; X1 , . . . , Xn ) = Xi log(θ) + n − Xi log(1 − θ).
i=1 i=1

Solution:
l(θ; X1 , . . . , Xn ) = log L(θ; X1 , . . . , Xn )
h Pn Pn i
= log θ i=1 xi (1 − θ)n− i=1 xi
Xn n
X
=( xi ) log(θ) + (n − xi ) log(1 − θ)
i=1 i=1

1 Pn
(c) Extra Credit: [2 pts.] Derive the following formula for the MLE: θ̂ = ( i=1 Xi ).
n
d
Solution: To find the MLE we solve dθ `(θ; X1 , . . . , Xn ) = 0. The derivative is given by
n n
d dhX X i
`(θ; X1 , . . . , Xn ) = ( xi ) log(θ) + (n − xi ) log(1 − θ)
dθ dθ
Pn i=1 Pn i=1
x i n − x
i=1 i
= i=1 −
θ 1−θ
d
Next, we solve dθ
`(θ; X1 , . . . , Xn ) = 0:
Pn
n − ni=1 xi
P
i=1 xi
− =0
θ 1−θ
n
X Xn
⇐⇒ xi (1 − θ) − n − xi θ = 0
i=1 i=1
n
X
⇐⇒ xi − nθ = 0
i=1
n
1 X
⇐⇒ θ̂ = ( Xi ).
n i=1
10-601: Machine Learning Page 4 of 17 2/29/2016

1.3 MAP vs MLE

Answer each question with T or F and provide a one sentence explanation of your
answer:

(a) [2 pts.] T or F: In the limit, as n (the number of samples) increases, the MAP and
MLE estimates become the same.
Solution: True. As the number of examples increases, the data likelihood goes to zero
very quickly, while the magnitude of the prior stays the same. In the limit, the prior
plays an insignificant role in the MAP estimate and the two estimates will converge.

(b) [2 pts.] T or F: Naive Bayes can only be used with MAP estimates, and not MLE
estimates.
Solution: False. In Naive Bayes we need to estimate the distribution of each feature Xi
given the label Y . Any technique for estimating the distribution can be used, including
both the MLE and the MAP estimate.

1.4 Probability
Assume we have a sample space Ω. Answer each question with T or F. No justification
is required.

(a) [1 pts.] T or F: If events A, B, and C are disjoint then they are independent.
Solution: False. If they are disjoint, i.e. mutually exclusive, they are very dependent!
P (A)P (B|A)
(b) [1 pts.] T or F: P (A|B) ∝ . (The sign ‘∝’ means ‘is proportional to’)
P (A|B)
P (A)P (B|A)
Solution: False. P (A|B) =
P (B)
(c) [1 pts.] T or F: P (A ∪ B) ≤ P (A).
Solution: False. P (A ∪ B) ≥ P (A)

(d) [1 pts.] T or F: P (A ∩ B) ≥ P (A).

Solution: False. P (A ∩ B) ≤ P (A)
10-601: Machine Learning Page 5 of 17 2/29/2016

2 Errors, Errors Everywhere [20 pts.]

2.1 True Errors

Consider a classification problem on Rd with distribution D and target function c∗ : Rd →
{±1}. Let S be an iid sample drawn from the distribution D. Answer each question with T
or F and provide a one sentence explanation of your answer:

(a) [4 pts.] T or F: The true error of a hypothesis h can be lower than its training error on
the sample S.
Solution: True. If the sample S happens to favor part of the space where h makes
mistakes then the sample error will overestimate the true error. An extreme example is
when the hypothesis h has true error 0.5, but the training sample S contains a single
sample that h incorrectly classifies.

(b) [4 pts.] T or F: Learning theory allows us to determine with 100% certainty the true
error of a hypothesis to within any > 0 error.
Solution: False. There is always a small chance that the sample S is not representative
of the underlying distribution D, in which case the sample S may have no relationship
to the true error. The sample complexity bounds discussed in class show that this is
rare, but not impossible.
10-601: Machine Learning Page 6 of 17 2/29/2016

2.2 Training Sample Size

In this problem, we will consider the effect of training sample size n on a logistic regression
classifier with d features. The classifier is trained by optimizing the conditional log-likelihood.
The optimization procedure stops if the estimated parameters perfectly classify the training
data or they converge. The following plot shows the general trend for how the training and
testing error change as we increase the sample size n = |S|. Your task in this question is
to analyze this plot and identify which curve corresponds to the training and test error.
Specifically:

curve (i)
Error

curve (ii)

Training set size

(a) [8 pts.] Which curve represents the training error? Please provide 1–2 sentences of
justification.
Solution: It is easier to correctly classify small training datasets. For example, if the
data contains just a single point, then logistic regression will always have zero training
error. On the other hand, we don’t expect a classifier learned from few examples to
generalize well, so for small training sets the true error is large. Therefore, curve (ii)
shows the general trend of the training error.

(b) [4 pt.] In one word, what does the gap between the two curves represent?
Solution: The gap between the two curves represents the amount of overfitting.
10-601: Machine Learning Page 7 of 17 2/29/2016

3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]

3.1 Linear regression

Given that we have an input x and we want to estimate an output y, in linear regression
we assume the relationship between them is of the form y = wx + b + , where w and b are
real-valued parameters we estimate and represents the noise in the data. When the noise
is Gaussian, maximizing the likelihood of a dataset S = {(x1 , y1 ), . . . , (xn , yn )} to estimate
the parameters w and b is equivalent to minimizing the squared error:
n
X
arg min (yi − (wxi + b))2 .
w
i=1

Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.

Dataset (a) (b) (c) (d) (e)

Regression line (b) (c) (b) (a) (a)

Figure 1: An observed data set and its associated regression line.

Figure 2: New regression lines for altered data sets S new .

10-601: Machine Learning Page 8 of 17 2/29/2016

(a) Adding one outlier to the (b) Adding two outliers to the original data
original data set. set.

(c) Adding three outliers to the original data (d) Duplicating the original data set.
set. Two on one side and one on the other
side.

(e) Duplicating the original data set and

adding four points that lie on the trajectory
of the original regression line.

Figure 3: New data set S new .

10-601: Machine Learning Page 9 of 17 2/29/2016

3.2 Logistic regression

Given a training set {(xi , yi ), i = 1, . . . , n} where xi ∈ Rd is a feature vector and yi ∈ {0, 1}
is a binary label, we want to find the parameters ŵ that maximize the likelihood for the
training set, assuming a parametric model of the form
1
p(y = 1|x; w) = .
1 + exp(−wT x)

The conditional log likelihood of the training set is

n
X
`(w) = yi log p(yi , |xi ; w) + (1 − yi ) log(1 − p(yi , |xi ; w)),
i=1

and the gradient is

n
X
∇`(w) = (yi − p(yi |xi ; w))xi .
i=1

(a) [5 pts.] Is it possible to get a closed form for the parameters ŵ that maximize the
conditional log likelihood? How would you compute ŵ in practice?
Solution: There is no closed form expression for maximizing the conditional log likeli-
hood. One has to consider iterative optimization methods, such as gradient descent, to
compute ŵ.

(b) [5 pts.] What is the form of the classifier output by logistic regression?
Solution: Given x, we predict ŷ = 1, if p(y = 1|x) ≥ p(y = 0|x). This is reduces to
ŷ = 1, if wT x ≥ 0, which is a linear classifier.

(c) [2 pts.] Extra Credit: Consider the case with binary features, i.e, x ∈ {0, 1}d ⊂ Rd ,
where feature x1 is rare and happens to appear in the training set with only label 1.
What is ŵ1 ? Is the gradient ever zero for any finite w? Why is it important to include
a regularization term to control the norm of ŵ?
Solution: If a binary feature was active for only label 1 in the training set then, by
maximizing the conditional log likelihood, we will make the weight associated to that
feature be infinite. This is because, when this feature is observed in the training set,
we will want to predict predict 1 irrespective of everything else. This is an undesired
behaviour from the point of view of generalization performance, as most likely we do
not believe this rare feature to have that much information about class 1. Most likely, it
is spurious co-occurrence. Controlling the norm of the weight vector will prevent these
pathological cases.
10-601: Machine Learning Page 10 of 17 2/29/2016

4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]

4.1 True or False

Answer each of the following questions with T or F and provide a one line justification.
(1) (1) (1) (1)
(a) [2 pts.] Consider two datasets D(1) and D(2) where D(1) = {(x1 , y1 ), ..., (xn , yn )}
(2) (2) (2) (2) (1) (2)
and D(2) = {(x1 , y1 ), ..., (xm , ym )} such that xi ∈ Rd1 , xi ∈ Rd2 . Suppose d1 > d2
and n > m. Then the maximum number of mistakes a perceptron algorithm will make
is higher on dataset D(1) than on dataset D(2) .

Solution: False. The maximum number of mistakes made by a perceptron is dependent

on the margin and radius of the training data, not its dimension or size. The maximum
mistake a perceptron will make is ( Rγ )2 .

(b) [2 pts.] Suppose φ(x) is an arbitrary feature mapping from input x ∈ X to φ(x) ∈ RN
and let K(x, z) = φ(x) · φ(z). Then K(x, z) will always be a valid kernel function.

Solution: True. K is a kernel if it is an inner product after applying some feature

transformation.

(c) [2 pts.] Given the same training data, in which the points are linearly separable, the
margin of the decision boundary produced by SVM will always be greater than or equal
to the margin of the decision boundary produced by Perceptron.

Solution: True. SVM explicitly maximizes margin; Perceptron does not differentiate
between decision boundaries as long as they lie within the margin of the training data.

4.2 Multiple Choice

(a) [3 pt.] If the data is linearly separable, SVM minimizes kwk2 subject to the constraints
∀i, yi w · xi ≥ 1. In the linearly separable case, which of the following may happen to the
decision boundary if one of the training samples is removed? Circle all that apply.

• Shifts toward the point removed Yes

• Shifts away from the point removed No
• Does not change Yes

pt.] Recall that when the data are not linearly separable, SVM minimizes kwk2 +
(b) [3 P
C i ξi subject to the constraint that ∀i, yi w · xi ≥ 1 − ξi and ξi ≥ 0. Which of the
following may happen to the size of the margin if the tradeoff parameter C is increased?
Circle all that apply.

• Increases No
• Decreases Yes
10-601: Machine Learning Page 11 of 17 2/29/2016

• Remains the same Yes

Proof of part (b):

Let w0∗ = argmin kwk2 + c0 i ξi and let w1∗ = argmin kwk2 + c1 i ξi . Let c1 > c0 . We
P P
need kw1∗ k2 ≥ kw0∗ k2 . Define ξi(0) to be the slack variables under w0∗ and ξi(1) to be the
slack variables under w1∗ .
We first show that for any kw0 k2 < kw0∗ k2 ,
P 0 P 0
i ξi > i ξi(0) where ξi are the slack
variables under w0 .
0 2 ∗ 2
P 0 P 0 2
P 0
By contradiction, assume kw k < kw 0 k and i ξ i ≤ i ξi(0) . Then, kw k + c0 i ξi <
kw0∗ k2 + c0 i ξi(0) and w0∗ 6= argmin kwk2 + c0 i ξi .
P P

Thus ∀kw0 k2 ≤ kw0∗ k2 , i ξi0 ≥ i ξi(0) .

P P

Next, we show that if w0∗ = argmin kwk2 + c0 i ξi and w1∗ = argmin kwk2 + c1 i ξi ,
P P
then kw1∗ k2 ≥ kw0∗ k2 .
By contradiction, assume kw1∗ k2 < kw0∗ k2 . Since w0∗ = argmin kwk2 + c0 i ξi :
P

X X
kw0∗ k2 + c0 ξi(0) ≤ kw1∗ k2 + c0 ξi(1)
i i
P P
Since c1 > c0 and i ξi(1) > i ξi(0) , then
X X
kw0∗ k2 + c1 ξi(0) < kw1∗ k2 + c1 ξi(1)
i i

But, X
w1∗ = argmin kwk2 + c1 ξi
i

This yields a contradiction. Thus kw1∗ k2 ≥ kw0∗ k2 .

4.3 Analysis
(a) [4 pts.] In one or two sentences, describe the benefit of using the Kernel trick.

Solution: Allows mapping features into higher dimensional space but avoids the extra
computational costs of mapping into higher dimensional feature space explicitly.

(b) [4 pt.] The concept of margin is essential in both SVM and Perceptron. Describe why a
large margin separator is desirable for classification.
Solution: Separators with large margin will have low generalization errors with high
probability.

(1) Draw the decision boundary on the graph.

10-601: Machine Learning Page 12 of 17 2/29/2016

(2) What is the size of the margin?

(3) Circle all the support vectors on the graph.

Solution: x2 − 2.5 = 0. The size of margin is 0.5. Support vectors are x2 , x3 , x6 , x7 .

Figure 4: SVM toy dataset

10-601: Machine Learning Page 13 of 17 2/29/2016

5 Learning Theory [20 pts.]

5.1 True or False

Answer each of the following questions with T or F and provide a one line justification.

(a) [3 pts.] T or F: It is possible to label 4 points in R2 in all possible 24 ways via linear
separators in R2 .
Solution: F. The VC dimension of linear separator in R2 is 3, hence it cannot shatter
a set of size 4 in all possible ways.

(b) [3 pts.] T or F: To show that the VC-dimension of a concept class H (containing

functions from X to {0, 1}) is d, it is sufficient to show that there exists a subset of X
with size d that can be labeled by H in all possible 2d ways.
Solution: F. This is only a necessary condition. We also need to show that no subset
of X with size d + 1 can be shattered by H.

(c) [3 pts.] T or F: The VC dimension of a finite concept class H is upper bounded by

dlog2 |H|e.
Solution: T. For any finite set S, if H shatters S, then H at least needs to have 2|S|
elements, which implies |S| ≤ dlog2 |H|e.

(d) [3 pts.] T or F: The VC dimension of a concept class with infinite size is also infinite.
Solution: F. Consider all the half-spaces in R2 , which has infinite cardinality but the
VC dimension is 3.

(e) [3 pts.] T or F: For every pair of classes, H1 , H2 , if H1 ⊆ H2 and H1 6= H2 , then

VCdim(H1 ) < VCdim(H2 ) (note that this is a strict inequality).
Solution: F. Let H1 be the collection of all the half-spaces in R2 with finite slopes and
let H2 be the collection of all the half-spaces in R2 . Clearly H1 ⊆ H2 and H1 6= H2 , but
V C(H1 ) = V C(H2 ) = 3.

(f) [3 pts.] T or F: Given a realizable concept class and a set of training instances, a
consistent learner will output a concept that achieves 0 error on the training instances.
Solution: T. Since the concept class is realizable, then the consistent learner can output
the oracle labeler by definition, which is guaranteed to achieve 0 error on the training
set.
10-601: Machine Learning Page 14 of 17 2/29/2016

5.2 VC dimension
Briefly explain in 2–3 sentences the importance of sample complexity and VC dimension
in learning with generalization guarantees.
Solution: Sample complexity guarantees quantify how many training samples we need
to see from the underlying data distribution D in order to guarantee that uniformly for
all hypotheses in the class of functions under consideration we have that their empirical
error rates are close to their true errors. This is important because we care abut finding
a hypothesis of small true error, but we can only optimize over a fixed training sample.
VC bounds are one kind of sample complexity guarantee, where the bound depends on the
VC-dimension of the hypothesis class, and they are particularly useful when the class of
functions is infinite.
10-601: Machine Learning Page 15 of 17 2/29/2016

6 Extra Credit: Neural Networks [6 pts.]

In this problem we will use a neural network to classify the crosses (×) from the circles (◦) in
the simple dataset shown in Figure 5a. Even though the crosses and circles are not linearly
separable, we can break the examples into three groups, S1 , S2 , and S3 (shown in Figure 5a)
so that S1 is linearly separable from S2 and S2 is linearly separable from S3 . We will exploit
this fact to design weights for the neural network shown in Figure 5b in order to correctly
classify this training set. For all nodes, we will use the threshold activation function

1 z>0
φ(z) =
0 z ≤ 0.

5
y
4 S2 w31
w32
3
S1 S3
x2 h1 h2
2

w11 w12 w21 w22

0 x1 x2
0 1 2 3 4 5
x1

(a) The dataset with groups S1 , S2 , and S3 . (b) The neural network architecture

Figure 5

5 5 5

4 4 4

3 3 3
x2 x2 x2
2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
x1 x1 x1

(a) Set S2 and S3 (b) Set S1 and S2 (c) Set S1, S2 and S3

Figure 6: NN classification.
10-601: Machine Learning Page 16 of 17 2/29/2016

(a) First we will set the parameters w11 , w12 and b1 of the neuron labeled h1 so that its
output h1 (x) = φ(w11 x1 + w12 x2 + b1 ) forms a linear separator between the sets S2 and
S3 .

(1) [1 pt.] On Fig. 6a, draw a linear decision boundary that separates S2 and S3 .
(2) [1 pt.] Write down the corresponding weights w11 , w12 , and b1 so that h1 (x) = 0 for
all points in S3 and h1 (x) = 1 for all points in S2 .
Solution: w11 = −1, w12 = 0, b1 = 3. With these parameters, we have w11 x1 +
w22 x2 + b1 > 0 if and only if −x1 > −3, which is equivalent to x1 < 3.

(b) Next we set the parameters w21 , w22 and b2 of the neuron labeled h2 so that its output
h2 (x) = φ(w21 x1 + w22 x2 + b2 ) forms a linear separator between the sets S1 and S2 .

(1) [1 pt.] On Fig. 6b, draw a linear decision boundary that separates S1 and S2 .
(2) [1 pt.] Write down the corresponding weights w21 , w22 , and b2 so that h2 (x) = 0 for
all points in S1 and h2 (x) = 1 for all points in S2 .
Solution: The provided line has a slope of −2 and crosses the x2 axis at the value
5. From this, the equation for the region above the line (those points for which
h2 (x) = 1)) is given by x2 ≥ −2x1 + 5 or, equivalently, x2 + 2x1 − 5 ≥ 0. Therefore,
w21 = 2, w22 = 1, b2 = −5.

(c) Now we have two classifiers h1 (to classify S2 from S3 ) and h2 (to classify S1 from S2 ). We
will set the weights of the final neuron of the neural network based on the results from h1
and h2 to classify the crosses from the circles. Let h3 (x) = φ w31 h1 (x) + w32 h2 (x) + b3 .

(1) [1 pt.] Compute w31 , w32 , b3 such that h3 (x) correctly classifies the entire dataset.
Solution: Consider the weights w31 = w32 = 1 and b3 = −1.5. With these weights,
h3 (x) = 1 if h1 (x) + h2 (x) ≥ 1.5. For points in S1 and S3 either h1 or h2 is zero, so
they will be classified as 0, as required. For points in S2 , both h1 and h2 output 1,
so the point is classified as 1. This rule has zero training error.
(2) [1 pt.] Draw your decision boundary in Fig. 6c.
10-601: Machine Learning Page 17 of 17 2/29/2016

Use this page for scratch work

Method Statement For Cable Laying, Termination, Testing
No ratings yet
Method Statement For Cable Laying, Termination, Testing
10 pages
Stats216 Winter18 Practice Midterm Solutions
100% (1)
Stats216 Winter18 Practice Midterm Solutions
3 pages
Computer Science Test Bank
100% (1)
Computer Science Test Bank
138 pages
Midterm Solutions Machine
100% (1)
Midterm Solutions Machine
17 pages
Theory of Automata (Cs-2208) : Lecture# 01
No ratings yet
Theory of Automata (Cs-2208) : Lecture# 01
21 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
97 pages
MCQ's B.A Third Year (Semester Economics Paper 113 - : VI) Research Methodology
No ratings yet
MCQ's B.A Third Year (Semester Economics Paper 113 - : VI) Research Methodology
14 pages
Statistics Correlation Analysis
No ratings yet
Statistics Correlation Analysis
10 pages
Association Analysis: Basic Concepts and Algorithms
No ratings yet
Association Analysis: Basic Concepts and Algorithms
28 pages
NSE BA Sample Paper With Solution
100% (1)
NSE BA Sample Paper With Solution
18 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
ML - Unit 2
No ratings yet
ML - Unit 2
15 pages
MLT, Two Marks
No ratings yet
MLT, Two Marks
19 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Discriminant Analysis Chapter-Seven
No ratings yet
Discriminant Analysis Chapter-Seven
7 pages
Weka Lab Record Experiments
No ratings yet
Weka Lab Record Experiments
21 pages
Density & Grid based clustering
100% (1)
Density & Grid based clustering
21 pages
Objective Questions of Discrete Mathematics
No ratings yet
Objective Questions of Discrete Mathematics
144 pages
Non Deterministic Finite Automata
No ratings yet
Non Deterministic Finite Automata
30 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
ML LAB Rec
No ratings yet
ML LAB Rec
9 pages
ML Assignment 3
No ratings yet
ML Assignment 3
5 pages
Set Theory: Prof. Asim Tewari IIT Bombay
No ratings yet
Set Theory: Prof. Asim Tewari IIT Bombay
22 pages
ML UNIT II
No ratings yet
ML UNIT II
30 pages
Rayleigh Model
No ratings yet
Rayleigh Model
9 pages
Unit 1
No ratings yet
Unit 1
70 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
PM1819M
No ratings yet
PM1819M
4 pages
Text
No ratings yet
Text
131 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
2 pages
Encoded Problems (StatsFinals)
No ratings yet
Encoded Problems (StatsFinals)
4 pages
358 33 Powerpoint Slides DSC Chapter 15
No ratings yet
358 33 Powerpoint Slides DSC Chapter 15
55 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Data Mining
No ratings yet
Data Mining
15 pages
Rainfall
No ratings yet
Rainfall
24 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
Chapter 6 Measures of Skewness and Kurtosis
No ratings yet
Chapter 6 Measures of Skewness and Kurtosis
25 pages
21CS54 Aiml Module3 PPT
No ratings yet
21CS54 Aiml Module3 PPT
102 pages
C++ PPT Final
No ratings yet
C++ PPT Final
13 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Strategic Approach To Software Testing
No ratings yet
Strategic Approach To Software Testing
6 pages
Hypothesis Testing: SUBJECT: Statistics and Probability Subject Teacher: Marilou A. Basilio
No ratings yet
Hypothesis Testing: SUBJECT: Statistics and Probability Subject Teacher: Marilou A. Basilio
29 pages
Analysis of Quantitative Research
No ratings yet
Analysis of Quantitative Research
3 pages
ST Microelectronics Interview Questions
No ratings yet
ST Microelectronics Interview Questions
4 pages
Soft Computing
No ratings yet
Soft Computing
26 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
23 pages
Isp 640
No ratings yet
Isp 640
10 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
Business Statistics Exam Prep Solutions
No ratings yet
Business Statistics Exam Prep Solutions
3 pages
K-Nearest Neighbor Learning
No ratings yet
K-Nearest Neighbor Learning
19 pages
Chapter 9: Correlation and Regression: Solutions
No ratings yet
Chapter 9: Correlation and Regression: Solutions
8 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
PM2223M
No ratings yet
PM2223M
3 pages
Fuzzy Logic and Applications PDF
No ratings yet
Fuzzy Logic and Applications PDF
13 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
Dokumen.pub Heidegger and the Human Suny Contemporary Continental Philosophy 1438490496 9781438490496
No ratings yet
Dokumen.pub Heidegger and the Human Suny Contemporary Continental Philosophy 1438490496 9781438490496
380 pages
Review Artikel PAI
No ratings yet
Review Artikel PAI
3 pages
2 English
No ratings yet
2 English
4 pages
Change Management in Nonprofit Organizations: Theory and Practice Kunle Akingbola - Download the ebook today to explore every detail
100% (4)
Change Management in Nonprofit Organizations: Theory and Practice Kunle Akingbola - Download the ebook today to explore every detail
67 pages
The Morphological Structure of English Words
No ratings yet
The Morphological Structure of English Words
51 pages
Kisi2 PAS Ganjil
No ratings yet
Kisi2 PAS Ganjil
5 pages
〈11〉 USP REFERENCE STANDARDS PDF
No ratings yet
〈11〉 USP REFERENCE STANDARDS PDF
3 pages
Cblechcn10 PDF
No ratings yet
Cblechcn10 PDF
34 pages
SY BSc. III - PHY 302 Electronics II
No ratings yet
SY BSc. III - PHY 302 Electronics II
11 pages
Introduction To Boilers
No ratings yet
Introduction To Boilers
20 pages
Weather and Climate
No ratings yet
Weather and Climate
69 pages
Experimental Investigation of Shear Connectors in Composite Sections
No ratings yet
Experimental Investigation of Shear Connectors in Composite Sections
48 pages
Exhibit MH-1: Major Approaches To Management
No ratings yet
Exhibit MH-1: Major Approaches To Management
17 pages
Homework 3
No ratings yet
Homework 3
4 pages
697559619-1st-Quarterly-Grade-3
No ratings yet
697559619-1st-Quarterly-Grade-3
4 pages
English 3 Module
No ratings yet
English 3 Module
16 pages
Noorhidana VA CivilEngineering PHD 2017
No ratings yet
Noorhidana VA CivilEngineering PHD 2017
322 pages
Arc-Welding Hazard
100% (2)
Arc-Welding Hazard
13 pages
Unit 6 Seen & Heard: Part A Grammar and Vocabulary
100% (1)
Unit 6 Seen & Heard: Part A Grammar and Vocabulary
5 pages
Jui Deb - Updated - 01-05-2021
No ratings yet
Jui Deb - Updated - 01-05-2021
5 pages
Magico M6 TAS February 2020
No ratings yet
Magico M6 TAS February 2020
10 pages
E-Tray Results
No ratings yet
E-Tray Results
7 pages
FIITJEE Classroom / Integrated School Programs: Class Program Goals Class Program Goals
No ratings yet
FIITJEE Classroom / Integrated School Programs: Class Program Goals Class Program Goals
1 page
Experiment No. 09: Single Phase Cyclo Converter
100% (1)
Experiment No. 09: Single Phase Cyclo Converter
7 pages
Core Curriculum: He Four Steps of Curriculum Development "The Tyler Rationale"
No ratings yet
Core Curriculum: He Four Steps of Curriculum Development "The Tyler Rationale"
15 pages
Understanding Social Inequality Intersections of Class Age Gender Ethnicity and Race in Canada 1st Edition Julie Mcmullin download pdf
100% (6)
Understanding Social Inequality Intersections of Class Age Gender Ethnicity and Race in Canada 1st Edition Julie Mcmullin download pdf
60 pages
Test Unit 4 - Plants - 3º Primaria
No ratings yet
Test Unit 4 - Plants - 3º Primaria
4 pages
Revision Checklist For O Level Chemistry 5070 FINAL
71% (7)
Revision Checklist For O Level Chemistry 5070 FINAL
32 pages
MTC2010 PDF
No ratings yet
MTC2010 PDF
145 pages

Midterm Sp16 Solutions

Uploaded by

Midterm Sp16 Solutions

Uploaded by

Midterm Exam Solutions

CMU 10-601: Machine Learning (Spring 2016)

Feb. 29, 2016

START HERE: Instructions

• Electronic devices are not acceptable.

Question Point Score

1 Naive Bayes, Probability, and MLE [20 pts. + 2 Extra Credit]

1.1 Naive Bayes

1.2 Maximum Likelihood Estimation (MLE)

(a) [2 pts.] Derive the likelihood, L(θ; X1 , . . . , Xn ).

1.3 MAP vs MLE

(d) [1 pts.] T or F: P (A ∩ B) ≥ P (A).

2 Errors, Errors Everywhere [20 pts.]

2.1 True Errors

2.2 Training Sample Size

Training set size

3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]

3.1 Linear regression

Dataset (a) (b) (c) (d) (e)

Figure 1: An observed data set and its associated regression line.

Figure 2: New regression lines for altered data sets S new .

(e) Duplicating the original data set and

Figure 3: New data set S new .

3.2 Logistic regression

The conditional log likelihood of the training set is

and the gradient is

4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]

4.1 True or False

Solution: False. The maximum number of mistakes made by a perceptron is dependent

Solution: True. K is a kernel if it is an inner product after applying some feature

4.2 Multiple Choice

• Shifts toward the point removed Yes

• Remains the same Yes

Proof of part (b):

Thus ∀kw0 k2 ≤ kw0∗ k2 , i ξi0 ≥ i ξi(0) .

This yields a contradiction. Thus kw1∗ k2 ≥ kw0∗ k2 .

(1) Draw the decision boundary on the graph.

(2) What is the size of the margin?

Solution: x2 − 2.5 = 0. The size of margin is 0.5. Support vectors are x2 , x3 , x6 , x7 .

Figure 4: SVM toy dataset

5 Learning Theory [20 pts.]

5.1 True or False

(b) [3 pts.] T or F: To show that the VC-dimension of a concept class H (containing

(c) [3 pts.] T or F: The VC dimension of a finite concept class H is upper bounded by

(e) [3 pts.] T or F: For every pair of classes, H1 , H2 , if H1 ⊆ H2 and H1 6= H2 , then

6 Extra Credit: Neural Networks [6 pts.]

w11 w12 w21 w22

Use this page for scratch work

You might also like