0% found this document useful (0 votes)
20 views

ASU Assignment2 Sol

Uploaded by

akula pranathi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

ASU Assignment2 Sol

Uploaded by

akula pranathi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CSE 575: Statistical Machine Learning Assignment #2

Instructor: Prof. Hanghang Tong


Out: Feb. 19th, 2016; Due: Mar. 18th, 2016
Submit electronically, using the submission link on Blackboard for Assignment #1, a file named
yourFirstName-yourLastName.pdf containing your solution to this assignment (a .doc
or .docx file is also acceptable, but .pdf is preferred).

1 Logistic Regression [20 points]

Suppose we have two positive examples x1 = (1, 1) and x2 = (1, −1); and two negative examples
x3 = (−1, 1) and x4 = (−1, −1). We use the standard gradient ascent method (without any
additional regularization terms) to train a logistic regression classifier. What is the final weight
vector w? Justifiy your answer. You can assume that the weight vector starts at the origin, i.e.,
w0 = (0, 0, 0)0 . How would you explain the final weight vector w you get?
Solutions: The final weight vector w = (0, ∞, 0)0 . We can prove this by induction.
Suppose at tth iteration, the current weight vector wt = (0, c, 0), where c ≥ 0, we will show
that after the update, wt+1 = (0, c + d, 0), where d > 0.
ec
We can verify that P (y1 = 1|x1 ) = P (y2 = 1|x2 ) = 1+e c = a, and P (y3 = 1|x3 ) = P (y4 =
1
1|x4 ) = 1+ec = b; and a + b = 1.
Thus, by gradient ascent, we have wt+1 = wt + η 4i=1 xi (yi − P (yi = 1|xi , wt )) = wt +
P
η(2(1 − a − b), 2(1 + b − a), 0)0 = (0, c + 2η(1 + b − a), 0, where η > 0 is the learning rate; and
0 < a, b < 1. Therefore d = 2η(1 + b − a) > 0.
The intuition is that this final weight vector maximizes the likelihood of the training set.

2 2-Class Logistic Regression [20 points]

Prove that 2-class logistic regression is always a linear classifier.


Solution: see page-47 in lecture slides

3 SVM [20 points]

1 (Kernel, 10 pts) Given the following dataset in 1-d space (Figure ??), which consists of 3
positive data points {−1, 0, 1} and 3 negative data points {−3, −2, 2}.

Figure 1: Training Data Set for SVM Classifiers

1
Figure 2: Feature Map

(1) Find a feature map({R1 → R2 }), which will map the original 1-d data points to 2-d
space so that the positive set and negative set are linearly separable with each other.
Plot the dataset after mapping in 2-d space.
Solutions: x → {x, x2 }. See figure 2.
(2) In your plot, draw the decision boundary given by hard-margin linear SVM. Mark the
corresponding support vector(s).
Solutions: See figure 2 (supported vectors are shadowed ones).
(3) For the feature map you choose, what is the corresponding kernel K(x1 , x2 )?
Solutions: K(x1 , x2 ) = x1 x2 + (x1 x2 )2 .

2 (Hinge Loss, 5 pts) Given m training data points {xi , yi }m


i=1 , remember that the soft-margin
linear SVM can be formalized as the following constrained quadratic optimization problem:
m
1 t X
argmin{w,b} w w+C i
2 1
Subject to :yi (wt xi + b) ≥ 1 − i
i ≥ 0 ∀i (1)

(1) Prove that the above formulation is equivalent to the following unconstrained quadratic
optimization problem:
m
X
t
argmin{w,b} w w + λ max(1 − yi (wt xi + b), 0)
1
(2)

2
Solutions: By eq. (1), we have i ≥ max(1 − yi (wt xi + b), 0), ∀i. Since we want to
minimize i , we must have i = max(1 − yi (wt xi + b), 0), ∀i. Plus this into eq. 1, we
have eq. (2).
(2) What is your intuition for this new optimization formulation (1 or 2 sentences)?
Solutions: Soft-margin SVM tries to balance the simplicity of classifier (the first
term) vs. the good prediction on training dataset (the second term).
(3) What is the value for λ (as a function of C)?

Solutions: λ = 2C.

3 (SMO, 5 pts) Suppose we are given 4 data points in 2-d space: x1 = (0, 1), y1 = −1;
x2 = (2, 0), y2 = +1; x3 = (1, 0), y3 = +1; and x4 = (0, 2), y4 = −1. We will use these 4
data points to train a soft-margin linear SVM. Let α1 , α2 , α3 , α4 be the Lagrange multipliers
for x1 , x2 , x3 , x4 respectively. And also let the regularization parameter C be 100.

(1) Write down the dual optimization formulation for this problem
Solutions:

argmaxα1 ,α2 ,α3 ,α4 α1 + α2 + α3 + α4 − 1/2(α12 + 4α22 + α32 + 4α42 + 4α2 α3 + 4α1 α4 )
Subject to :0 ≤ α1 , α2 , α3 , α4 ≤ 100
−α1 + α2 + α3 − α4 = 0 (3)

(2) Suppose we initialize α1 = 5, α2 = 4, α3 = 8, α4 = 7. And we want to update α1 and


α4 (keep α2 and α3 fixed) in the 1st iteration. Derive the update equations for α1 and
α4 (in terms of α2 and α3 ). What are the values for α1 and α4 after update?
Solutions: α1 = α2 + α3 = 12 and α4 = 0.
(3) Now fix α1 and α4 , derive the update equations for α2 and α3 (in terms of α1 and α4 ).
What are the values for α2 and α3 after update?
Solutions: α3 = α1 + α4 = 12 and α2 = 0.

4 More on SVM [20 points]

1. [4 points] Suppose we are using a linear SVM (i.e., no kernel), with some large C value, and
are given the following data set.

3
3

X2
2

1 2 3 4 5
X1

Figure 3

Draw the decision boundary of linear SVM. Give a brief explanation.


Solution. Because of the large C value, the decision boundary will classify all of the ex-
amples correctly. Furthermore, among separators that classify the examples correctly, it will
have the largest margin (distance to closest point).

2. [4 points] In the following image, circle the points such that after removing that point (exam-
ple) from the training set and retraining SVM, we would get a different decision boundary
than training on the full sample.

3
X2

1 2 3 4 5
X1

Figure 4

Justify your answer.


Solution. These examples are the support vectors; all of the other examples are such that
their corresponding constraints are not tight in the optimization problem, so removing them
will not create a solution with smaller objective function value (norm of w). These three

4
examples are positioned such that removing any one of them introduces slack in the con-
straints, allowing for a solution with a smaller objective function value and with a different
third support vector; in this case, because each of these new (replacement) support vectors
is not close to the old separator, the decision boundary shifts to make its distance to that
example equal to the others.

3. [4 points] Suppose instead of SVM, we use regularized logistic regression to learn the clas-
sifier. That is,

kwk2 X 1 e(w·xi +b)


(w, b) = arg min − 1[yi = 0] ln + 1[yi = 1] ln .
w∈R2 ,b∈R 2 i
1 + e(w·xi +b) 1 + e(w·xi +b)

In the following image, circle the points such that after removing that point (example) from
the training set and running regularized logistic regression, we would get a different decision
boundary than training with regularized logistic regression on the full sample.

3
X2

1 2 3 4 5
X1

Figure 5

Justify your answer.


Solution. Because of the regularization, the weights will not diverge to infinity, and thus the
probabilities at the solution are not at 0 and 1. Because of this, every example contributes to
the loss function, and thus has an influence on the solution.

4. [4 points] Suppose we have a kernel K(·, ·), such that there is an implicit high-dimensional
feature map φ : Rd → RD that satisfies ∀xi , xj ∈ Rd , K(xi , xj ) = φ(xi ) · φ(xj ), where
φ(xi ) · φ(xj ) = D l l l
P
l=1 φ(xi ) φ(xj ) is the dot product in the D-dimensional space, and φ(xi )
th
is the l element/feature in the D-dimensional space.
Show how to calculate the Euclidean distance in the D-dimensional space
v
u D
uX
kφ(xi ) − φ(xj )k = t (φ(xi )l − φ(xj )l )2
l=1

5
without explicitly calculating the values in the D-dimensional space. For this question,
please provide a formal proof.
Hint: Try converting the Euclidean distance into a set of inner products.
Solution.
v
u D
uX
kφ(xi ) − φ(xj )k = t (φ(x )l − φ(x )l )2
i j
l=1
v
u D
uX
= t (φ(x )l )2 + (φ(x )l )2 − 2φ(x )l φ(x )l
i j i j
l=1
v ! ! !
u D D D
u X X X
= t (φ(xi )l )2 + (φ(xj )l )2 − 2φ(xi )l φ(xj )l
l=1 l=1 l=1
q
= φ(xi ) · φ(xi ) + φ(xj ) · φ(xj ) − 2φ(xi ) · φ(xj )
q
= K(xi , xi ) + K(xj , xj ) − 2K(xi , xj ).

5. [2 points] Assume that we use the RBF kernel function K(xi , xj ) = exp(− 12 kxi − xj k2 ).
Also assume the same notation as in the last question. Prove that for any two input exam-
ples xi and xj , the squared Euclidean distance of their corresponding points in the high-
dimensional space RD is less than 2, i.e., prove that kφ(xi ) − φ(xj )k2 < 2.
Solution. This inequality directly follows from the result from the last question.

6. [2 points] Assume that we use the RBF kernel function, and the same notation as before.
Consider running One Nearest Neighbor with Euclidean distance in both the input space
Rd and the high-dimensional space RD . Is it possible that One Nearest Neighbor classifier
achieves better classification performance in the high-dimensional space than in the original
input space? Why?
Solution. No.

5 Naı̈ve Bayes Classifier and Logistic Regression [20 points]

1. Gaussian Naı̈ve Bayes and Logistic Regression (10 points). Suppose we want to train
Gaussian Naı̈ve Bayes to learn a boolean/binary classifier: f : X → Y , where X is a vector
of n dimensional real-valued features: X =< X1 , ..., Xn >; and Y is boolean class label
(i.e., Y = 1 or Y = 0). Recall that in Gaussian Naı̈ve Bayes, we assume all Xi (i = 1, ..., n)
are conditionally independent given the class label Y , i.e., P (Xi |Y = k) ∼ N (µik , σi ) (k =
0, 1; i = 1, ..., n). We also assume that P (Y ) follows Bernoulli(θ, 1 − θ) (i.e., P (Y = 1) =
θ).

– How many independent model parameters are there in this Gaussian Naı̈ve Bayes clas-
sifier?

6
Solutions: 3n + 1 (2 for each Xi of a given label, but for the same Xi, it shares the
same variance between two class labels, and the prior)
– Prove that the Gaussian Naı̈ve Bayes assumption imply that P (Y |X) follow the form
of P (Y = 1|X =< X1 , ..., Xn >) = 1+exp(w0 +1Pn wi Xi ) . In particular, you need
i=1
to express wi (i = 0, ..., n) by the model parameters (i.e., θ, µik , σi (k = 0, 1; i =
1, ..., n)).
Solutions:

P (Y = 1)P (X|Y = 1)
P (Y = 1|X) = (Bayes Rule)
P (Y = 1)P (X|Y = 1) + P (Y = 0)P (X|Y = 0)
1
= P (Y =0)P (X|Y =0)
1+ P (Y =1)P (X|Y =1)
1
=
1+ exp(ln( PP (Y =0)P (X|Y =0)
(Y =1)P (X|Y =1)
))
1
= (Xi |Y =0)
(Naı̈ve Bayes Assumption)
exp(ln( 1−θ + i ln( PP (X
P
1+ θ
) i |Y =1)
))

Since P (Xi |Y = k) ∼ N (µik , σi ) (k = 0, 1; i = 1, ..., n), we have P (Xi = x|Y =


−(x−µik )2
1 2σ 2
k) = √
σi 2π
e i , which implies
P (Xi |Y = 0) X µi0 − µi1 µ2i1 − µ2i0
ln( )= ( X i + )
P (Xi |Y = 1) i
σi2 2σi2
µ2i1 −µ2i0 µi0 −µi1
which completes the proof, with w0 = ln( 1−θ
P
θ
)+ i 2σi2
and wi = σi2
.

2. Boolean Naı̈ve Bayes and Logistic Regression (10 points). Now consider X being a vector
of boolean variables (Y still being the boolean class label). We still want to train a Naı̈ve
classifier.

– How many independent model parameters are there in this boolean Naı̈ve Bayes clas-
sifier?
Solutions: 2n + 1 (1 for each Xi of a given label, and the prior)
– Let P (Xi |Y = k) = θik (k = 0, 1) and P (Y = 1) = θ. Prove that the boolean
Naı̈ve Bayes assumption imply that P (Y |X) follow the form of P (Y = 1|X =<
X1 , ..., Xn >) = 1+exp(w0 +1Pn wi Xi ) . In particular, you need to express wi (i =
i=1
0, ..., n) by the model parameters (i.e., θ, µik , σi (k = 0, 1; i = 1, ..., n)).

Solutions:
Xi
Follow the same procedure as above, except that now P (Xi |Y = k) = θik (1−θik )1−Xi (k =
0, 1). This leads to
P (Xi |Y = 0) X θi0 1 − θi0 1 − θi0
ln( )= (ln − ln )Xi + ln )
P (Xi |Y = 1) i
θi1 1 − θi1 1 − θi1

7
This completes the proof, with w0 = ln( 1−θ ln 1−θ and w + i = (ln θθi0 1−θi0
P
θ
)+ i
i0
1−θi1 i1
− ln 1−θi1
)=
ln θθi0 (1−θi1 )
i1 (1−θi0 )
.

You might also like