ASU Assignment2 Sol
ASU Assignment2 Sol
Suppose we have two positive examples x1 = (1, 1) and x2 = (1, −1); and two negative examples
x3 = (−1, 1) and x4 = (−1, −1). We use the standard gradient ascent method (without any
additional regularization terms) to train a logistic regression classifier. What is the final weight
vector w? Justifiy your answer. You can assume that the weight vector starts at the origin, i.e.,
w0 = (0, 0, 0)0 . How would you explain the final weight vector w you get?
Solutions: The final weight vector w = (0, ∞, 0)0 . We can prove this by induction.
Suppose at tth iteration, the current weight vector wt = (0, c, 0), where c ≥ 0, we will show
that after the update, wt+1 = (0, c + d, 0), where d > 0.
ec
We can verify that P (y1 = 1|x1 ) = P (y2 = 1|x2 ) = 1+e c = a, and P (y3 = 1|x3 ) = P (y4 =
1
1|x4 ) = 1+ec = b; and a + b = 1.
Thus, by gradient ascent, we have wt+1 = wt + η 4i=1 xi (yi − P (yi = 1|xi , wt )) = wt +
P
η(2(1 − a − b), 2(1 + b − a), 0)0 = (0, c + 2η(1 + b − a), 0, where η > 0 is the learning rate; and
0 < a, b < 1. Therefore d = 2η(1 + b − a) > 0.
The intuition is that this final weight vector maximizes the likelihood of the training set.
1 (Kernel, 10 pts) Given the following dataset in 1-d space (Figure ??), which consists of 3
positive data points {−1, 0, 1} and 3 negative data points {−3, −2, 2}.
1
Figure 2: Feature Map
(1) Find a feature map({R1 → R2 }), which will map the original 1-d data points to 2-d
space so that the positive set and negative set are linearly separable with each other.
Plot the dataset after mapping in 2-d space.
Solutions: x → {x, x2 }. See figure 2.
(2) In your plot, draw the decision boundary given by hard-margin linear SVM. Mark the
corresponding support vector(s).
Solutions: See figure 2 (supported vectors are shadowed ones).
(3) For the feature map you choose, what is the corresponding kernel K(x1 , x2 )?
Solutions: K(x1 , x2 ) = x1 x2 + (x1 x2 )2 .
(1) Prove that the above formulation is equivalent to the following unconstrained quadratic
optimization problem:
m
X
t
argmin{w,b} w w + λ max(1 − yi (wt xi + b), 0)
1
(2)
2
Solutions: By eq. (1), we have i ≥ max(1 − yi (wt xi + b), 0), ∀i. Since we want to
minimize i , we must have i = max(1 − yi (wt xi + b), 0), ∀i. Plus this into eq. 1, we
have eq. (2).
(2) What is your intuition for this new optimization formulation (1 or 2 sentences)?
Solutions: Soft-margin SVM tries to balance the simplicity of classifier (the first
term) vs. the good prediction on training dataset (the second term).
(3) What is the value for λ (as a function of C)?
Solutions: λ = 2C.
3 (SMO, 5 pts) Suppose we are given 4 data points in 2-d space: x1 = (0, 1), y1 = −1;
x2 = (2, 0), y2 = +1; x3 = (1, 0), y3 = +1; and x4 = (0, 2), y4 = −1. We will use these 4
data points to train a soft-margin linear SVM. Let α1 , α2 , α3 , α4 be the Lagrange multipliers
for x1 , x2 , x3 , x4 respectively. And also let the regularization parameter C be 100.
(1) Write down the dual optimization formulation for this problem
Solutions:
argmaxα1 ,α2 ,α3 ,α4 α1 + α2 + α3 + α4 − 1/2(α12 + 4α22 + α32 + 4α42 + 4α2 α3 + 4α1 α4 )
Subject to :0 ≤ α1 , α2 , α3 , α4 ≤ 100
−α1 + α2 + α3 − α4 = 0 (3)
1. [4 points] Suppose we are using a linear SVM (i.e., no kernel), with some large C value, and
are given the following data set.
3
3
X2
2
1 2 3 4 5
X1
Figure 3
2. [4 points] In the following image, circle the points such that after removing that point (exam-
ple) from the training set and retraining SVM, we would get a different decision boundary
than training on the full sample.
3
X2
1 2 3 4 5
X1
Figure 4
4
examples are positioned such that removing any one of them introduces slack in the con-
straints, allowing for a solution with a smaller objective function value and with a different
third support vector; in this case, because each of these new (replacement) support vectors
is not close to the old separator, the decision boundary shifts to make its distance to that
example equal to the others.
3. [4 points] Suppose instead of SVM, we use regularized logistic regression to learn the clas-
sifier. That is,
In the following image, circle the points such that after removing that point (example) from
the training set and running regularized logistic regression, we would get a different decision
boundary than training with regularized logistic regression on the full sample.
3
X2
1 2 3 4 5
X1
Figure 5
4. [4 points] Suppose we have a kernel K(·, ·), such that there is an implicit high-dimensional
feature map φ : Rd → RD that satisfies ∀xi , xj ∈ Rd , K(xi , xj ) = φ(xi ) · φ(xj ), where
φ(xi ) · φ(xj ) = D l l l
P
l=1 φ(xi ) φ(xj ) is the dot product in the D-dimensional space, and φ(xi )
th
is the l element/feature in the D-dimensional space.
Show how to calculate the Euclidean distance in the D-dimensional space
v
u D
uX
kφ(xi ) − φ(xj )k = t (φ(xi )l − φ(xj )l )2
l=1
5
without explicitly calculating the values in the D-dimensional space. For this question,
please provide a formal proof.
Hint: Try converting the Euclidean distance into a set of inner products.
Solution.
v
u D
uX
kφ(xi ) − φ(xj )k = t (φ(x )l − φ(x )l )2
i j
l=1
v
u D
uX
= t (φ(x )l )2 + (φ(x )l )2 − 2φ(x )l φ(x )l
i j i j
l=1
v ! ! !
u D D D
u X X X
= t (φ(xi )l )2 + (φ(xj )l )2 − 2φ(xi )l φ(xj )l
l=1 l=1 l=1
q
= φ(xi ) · φ(xi ) + φ(xj ) · φ(xj ) − 2φ(xi ) · φ(xj )
q
= K(xi , xi ) + K(xj , xj ) − 2K(xi , xj ).
5. [2 points] Assume that we use the RBF kernel function K(xi , xj ) = exp(− 12 kxi − xj k2 ).
Also assume the same notation as in the last question. Prove that for any two input exam-
ples xi and xj , the squared Euclidean distance of their corresponding points in the high-
dimensional space RD is less than 2, i.e., prove that kφ(xi ) − φ(xj )k2 < 2.
Solution. This inequality directly follows from the result from the last question.
6. [2 points] Assume that we use the RBF kernel function, and the same notation as before.
Consider running One Nearest Neighbor with Euclidean distance in both the input space
Rd and the high-dimensional space RD . Is it possible that One Nearest Neighbor classifier
achieves better classification performance in the high-dimensional space than in the original
input space? Why?
Solution. No.
1. Gaussian Naı̈ve Bayes and Logistic Regression (10 points). Suppose we want to train
Gaussian Naı̈ve Bayes to learn a boolean/binary classifier: f : X → Y , where X is a vector
of n dimensional real-valued features: X =< X1 , ..., Xn >; and Y is boolean class label
(i.e., Y = 1 or Y = 0). Recall that in Gaussian Naı̈ve Bayes, we assume all Xi (i = 1, ..., n)
are conditionally independent given the class label Y , i.e., P (Xi |Y = k) ∼ N (µik , σi ) (k =
0, 1; i = 1, ..., n). We also assume that P (Y ) follows Bernoulli(θ, 1 − θ) (i.e., P (Y = 1) =
θ).
– How many independent model parameters are there in this Gaussian Naı̈ve Bayes clas-
sifier?
6
Solutions: 3n + 1 (2 for each Xi of a given label, but for the same Xi, it shares the
same variance between two class labels, and the prior)
– Prove that the Gaussian Naı̈ve Bayes assumption imply that P (Y |X) follow the form
of P (Y = 1|X =< X1 , ..., Xn >) = 1+exp(w0 +1Pn wi Xi ) . In particular, you need
i=1
to express wi (i = 0, ..., n) by the model parameters (i.e., θ, µik , σi (k = 0, 1; i =
1, ..., n)).
Solutions:
P (Y = 1)P (X|Y = 1)
P (Y = 1|X) = (Bayes Rule)
P (Y = 1)P (X|Y = 1) + P (Y = 0)P (X|Y = 0)
1
= P (Y =0)P (X|Y =0)
1+ P (Y =1)P (X|Y =1)
1
=
1+ exp(ln( PP (Y =0)P (X|Y =0)
(Y =1)P (X|Y =1)
))
1
= (Xi |Y =0)
(Naı̈ve Bayes Assumption)
exp(ln( 1−θ + i ln( PP (X
P
1+ θ
) i |Y =1)
))
2. Boolean Naı̈ve Bayes and Logistic Regression (10 points). Now consider X being a vector
of boolean variables (Y still being the boolean class label). We still want to train a Naı̈ve
classifier.
– How many independent model parameters are there in this boolean Naı̈ve Bayes clas-
sifier?
Solutions: 2n + 1 (1 for each Xi of a given label, and the prior)
– Let P (Xi |Y = k) = θik (k = 0, 1) and P (Y = 1) = θ. Prove that the boolean
Naı̈ve Bayes assumption imply that P (Y |X) follow the form of P (Y = 1|X =<
X1 , ..., Xn >) = 1+exp(w0 +1Pn wi Xi ) . In particular, you need to express wi (i =
i=1
0, ..., n) by the model parameters (i.e., θ, µik , σi (k = 0, 1; i = 1, ..., n)).
Solutions:
Xi
Follow the same procedure as above, except that now P (Xi |Y = k) = θik (1−θik )1−Xi (k =
0, 1). This leads to
P (Xi |Y = 0) X θi0 1 − θi0 1 − θi0
ln( )= (ln − ln )Xi + ln )
P (Xi |Y = 1) i
θi1 1 − θi1 1 − θi1
7
This completes the proof, with w0 = ln( 1−θ ln 1−θ and w + i = (ln θθi0 1−θi0
P
θ
)+ i
i0
1−θi1 i1
− ln 1−θi1
)=
ln θθi0 (1−θi1 )
i1 (1−θi0 )
.