0% found this document useful (0 votes)

20 views

ASU Assignment2 Sol

Uploaded by

akula pranathi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

ASU Assignment2 Sol

Uploaded by

akula pranathi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CSE 575: Statistical Machine Learning Assignment #2

Instructor: Prof. Hanghang Tong

Out: Feb. 19th, 2016; Due: Mar. 18th, 2016
Submit electronically, using the submission link on Blackboard for Assignment #1, a file named
yourFirstName-yourLastName.pdf containing your solution to this assignment (a .doc
or .docx file is also acceptable, but .pdf is preferred).

1 Logistic Regression [20 points]

Suppose we have two positive examples x1 = (1, 1) and x2 = (1, −1); and two negative examples
x3 = (−1, 1) and x4 = (−1, −1). We use the standard gradient ascent method (without any
additional regularization terms) to train a logistic regression classifier. What is the final weight
vector w? Justifiy your answer. You can assume that the weight vector starts at the origin, i.e.,
w0 = (0, 0, 0)0 . How would you explain the final weight vector w you get?
Solutions: The final weight vector w = (0, ∞, 0)0 . We can prove this by induction.
Suppose at tth iteration, the current weight vector wt = (0, c, 0), where c ≥ 0, we will show
that after the update, wt+1 = (0, c + d, 0), where d > 0.
ec
We can verify that P (y1 = 1|x1 ) = P (y2 = 1|x2 ) = 1+e c = a, and P (y3 = 1|x3 ) = P (y4 =
1
1|x4 ) = 1+ec = b; and a + b = 1.
Thus, by gradient ascent, we have wt+1 = wt + η 4i=1 xi (yi − P (yi = 1|xi , wt )) = wt +
P
η(2(1 − a − b), 2(1 + b − a), 0)0 = (0, c + 2η(1 + b − a), 0, where η > 0 is the learning rate; and
0 < a, b < 1. Therefore d = 2η(1 + b − a) > 0.
The intuition is that this final weight vector maximizes the likelihood of the training set.

2 2-Class Logistic Regression [20 points]

Prove that 2-class logistic regression is always a linear classifier.

Solution: see page-47 in lecture slides

3 SVM [20 points]

1 (Kernel, 10 pts) Given the following dataset in 1-d space (Figure ??), which consists of 3
positive data points {−1, 0, 1} and 3 negative data points {−3, −2, 2}.

Figure 1: Training Data Set for SVM Classifiers

1
Figure 2: Feature Map

(1) Find a feature map({R1 → R2 }), which will map the original 1-d data points to 2-d
space so that the positive set and negative set are linearly separable with each other.
Plot the dataset after mapping in 2-d space.
Solutions: x → {x, x2 }. See figure 2.
(2) In your plot, draw the decision boundary given by hard-margin linear SVM. Mark the
corresponding support vector(s).
Solutions: See figure 2 (supported vectors are shadowed ones).
(3) For the feature map you choose, what is the corresponding kernel K(x1 , x2 )?
Solutions: K(x1 , x2 ) = x1 x2 + (x1 x2 )2 .

2 (Hinge Loss, 5 pts) Given m training data points {xi , yi }m

i=1 , remember that the soft-margin
linear SVM can be formalized as the following constrained quadratic optimization problem:
m
1 t X
argmin{w,b} w w+C i
2 1
Subject to :yi (wt xi + b) ≥ 1 − i
i ≥ 0 ∀i (1)

(1) Prove that the above formulation is equivalent to the following unconstrained quadratic
optimization problem:
m
X
t
argmin{w,b} w w + λ max(1 − yi (wt xi + b), 0)
1
(2)

2
Solutions: By eq. (1), we have i ≥ max(1 − yi (wt xi + b), 0), ∀i. Since we want to
minimize i , we must have i = max(1 − yi (wt xi + b), 0), ∀i. Plus this into eq. 1, we
have eq. (2).
(2) What is your intuition for this new optimization formulation (1 or 2 sentences)?
Solutions: Soft-margin SVM tries to balance the simplicity of classifier (the first
term) vs. the good prediction on training dataset (the second term).
(3) What is the value for λ (as a function of C)?

Solutions: λ = 2C.

3 (SMO, 5 pts) Suppose we are given 4 data points in 2-d space: x1 = (0, 1), y1 = −1;
x2 = (2, 0), y2 = +1; x3 = (1, 0), y3 = +1; and x4 = (0, 2), y4 = −1. We will use these 4
data points to train a soft-margin linear SVM. Let α1 , α2 , α3 , α4 be the Lagrange multipliers
for x1 , x2 , x3 , x4 respectively. And also let the regularization parameter C be 100.

(1) Write down the dual optimization formulation for this problem
Solutions:

argmaxα1 ,α2 ,α3 ,α4 α1 + α2 + α3 + α4 − 1/2(α12 + 4α22 + α32 + 4α42 + 4α2 α3 + 4α1 α4 )
Subject to :0 ≤ α1 , α2 , α3 , α4 ≤ 100
−α1 + α2 + α3 − α4 = 0 (3)

(2) Suppose we initialize α1 = 5, α2 = 4, α3 = 8, α4 = 7. And we want to update α1 and

α4 (keep α2 and α3 fixed) in the 1st iteration. Derive the update equations for α1 and
α4 (in terms of α2 and α3 ). What are the values for α1 and α4 after update?
Solutions: α1 = α2 + α3 = 12 and α4 = 0.
(3) Now fix α1 and α4 , derive the update equations for α2 and α3 (in terms of α1 and α4 ).
What are the values for α2 and α3 after update?
Solutions: α3 = α1 + α4 = 12 and α2 = 0.

4 More on SVM [20 points]

1. [4 points] Suppose we are using a linear SVM (i.e., no kernel), with some large C value, and
are given the following data set.

3
3

X2
2

1 2 3 4 5
X1

Figure 3

Draw the decision boundary of linear SVM. Give a brief explanation.

Solution. Because of the large C value, the decision boundary will classify all of the ex-
amples correctly. Furthermore, among separators that classify the examples correctly, it will
have the largest margin (distance to closest point).

2. [4 points] In the following image, circle the points such that after removing that point (exam-
ple) from the training set and retraining SVM, we would get a different decision boundary
than training on the full sample.

3
X2

1 2 3 4 5
X1

Figure 4

Justify your answer.

Solution. These examples are the support vectors; all of the other examples are such that
their corresponding constraints are not tight in the optimization problem, so removing them
will not create a solution with smaller objective function value (norm of w). These three

4
examples are positioned such that removing any one of them introduces slack in the con-
straints, allowing for a solution with a smaller objective function value and with a different
third support vector; in this case, because each of these new (replacement) support vectors
is not close to the old separator, the decision boundary shifts to make its distance to that
example equal to the others.

3. [4 points] Suppose instead of SVM, we use regularized logistic regression to learn the clas-
sifier. That is,

kwk2 X 1 e(w·xi +b)

(w, b) = arg min − 1[yi = 0] ln + 1[yi = 1] ln .
w∈R2 ,b∈R 2 i
1 + e(w·xi +b) 1 + e(w·xi +b)

In the following image, circle the points such that after removing that point (example) from
the training set and running regularized logistic regression, we would get a different decision
boundary than training with regularized logistic regression on the full sample.

3
X2

1 2 3 4 5
X1

Figure 5

Justify your answer.

Solution. Because of the regularization, the weights will not diverge to infinity, and thus the
probabilities at the solution are not at 0 and 1. Because of this, every example contributes to
the loss function, and thus has an influence on the solution.

4. [4 points] Suppose we have a kernel K(·, ·), such that there is an implicit high-dimensional
feature map φ : Rd → RD that satisfies ∀xi , xj ∈ Rd , K(xi , xj ) = φ(xi ) · φ(xj ), where
φ(xi ) · φ(xj ) = D l l l
P
l=1 φ(xi ) φ(xj ) is the dot product in the D-dimensional space, and φ(xi )
th
is the l element/feature in the D-dimensional space.
Show how to calculate the Euclidean distance in the D-dimensional space
v
u D
uX
kφ(xi ) − φ(xj )k = t (φ(xi )l − φ(xj )l )2
l=1

5
without explicitly calculating the values in the D-dimensional space. For this question,
please provide a formal proof.
Hint: Try converting the Euclidean distance into a set of inner products.
Solution.
v
u D
uX
kφ(xi ) − φ(xj )k = t (φ(x )l − φ(x )l )2
i j
l=1
v
u D
uX
= t (φ(x )l )2 + (φ(x )l )2 − 2φ(x )l φ(x )l
i j i j
l=1
v ! ! !
u D D D
u X X X
= t (φ(xi )l )2 + (φ(xj )l )2 − 2φ(xi )l φ(xj )l
l=1 l=1 l=1
q
= φ(xi ) · φ(xi ) + φ(xj ) · φ(xj ) − 2φ(xi ) · φ(xj )
q
= K(xi , xi ) + K(xj , xj ) − 2K(xi , xj ).

5. [2 points] Assume that we use the RBF kernel function K(xi , xj ) = exp(− 12 kxi − xj k2 ).
Also assume the same notation as in the last question. Prove that for any two input exam-
ples xi and xj , the squared Euclidean distance of their corresponding points in the high-
dimensional space RD is less than 2, i.e., prove that kφ(xi ) − φ(xj )k2 < 2.
Solution. This inequality directly follows from the result from the last question.

6. [2 points] Assume that we use the RBF kernel function, and the same notation as before.
Consider running One Nearest Neighbor with Euclidean distance in both the input space
Rd and the high-dimensional space RD . Is it possible that One Nearest Neighbor classifier
achieves better classification performance in the high-dimensional space than in the original
input space? Why?
Solution. No.

5 Naı̈ve Bayes Classifier and Logistic Regression [20 points]

1. Gaussian Naı̈ve Bayes and Logistic Regression (10 points). Suppose we want to train
Gaussian Naı̈ve Bayes to learn a boolean/binary classifier: f : X → Y , where X is a vector
of n dimensional real-valued features: X =< X1 , ..., Xn >; and Y is boolean class label
(i.e., Y = 1 or Y = 0). Recall that in Gaussian Naı̈ve Bayes, we assume all Xi (i = 1, ..., n)
are conditionally independent given the class label Y , i.e., P (Xi |Y = k) ∼ N (µik , σi ) (k =
0, 1; i = 1, ..., n). We also assume that P (Y ) follows Bernoulli(θ, 1 − θ) (i.e., P (Y = 1) =
θ).

– How many independent model parameters are there in this Gaussian Naı̈ve Bayes clas-
sifier?

6
Solutions: 3n + 1 (2 for each Xi of a given label, but for the same Xi, it shares the
same variance between two class labels, and the prior)
– Prove that the Gaussian Naı̈ve Bayes assumption imply that P (Y |X) follow the form
of P (Y = 1|X =< X1 , ..., Xn >) = 1+exp(w0 +1Pn wi Xi ) . In particular, you need
i=1
to express wi (i = 0, ..., n) by the model parameters (i.e., θ, µik , σi (k = 0, 1; i =
1, ..., n)).
Solutions:

Since P (Xi |Y = k) ∼ N (µik , σi ) (k = 0, 1; i = 1, ..., n), we have P (Xi = x|Y =

−(x−µik )2
1 2σ 2
k) = √
σi 2π
e i , which implies
P (Xi |Y = 0) X µi0 − µi1 µ2i1 − µ2i0
ln( )= ( X i + )
P (Xi |Y = 1) i
σi2 2σi2
µ2i1 −µ2i0 µi0 −µi1
which completes the proof, with w0 = ln( 1−θ
P
θ
)+ i 2σi2
and wi = σi2
.

2. Boolean Naı̈ve Bayes and Logistic Regression (10 points). Now consider X being a vector
of boolean variables (Y still being the boolean class label). We still want to train a Naı̈ve
classifier.

– How many independent model parameters are there in this boolean Naı̈ve Bayes clas-
sifier?
Solutions: 2n + 1 (1 for each Xi of a given label, and the prior)
– Let P (Xi |Y = k) = θik (k = 0, 1) and P (Y = 1) = θ. Prove that the boolean
Naı̈ve Bayes assumption imply that P (Y |X) follow the form of P (Y = 1|X =<
X1 , ..., Xn >) = 1+exp(w0 +1Pn wi Xi ) . In particular, you need to express wi (i =
i=1
0, ..., n) by the model parameters (i.e., θ, µik , σi (k = 0, 1; i = 1, ..., n)).

Solutions:
Xi
Follow the same procedure as above, except that now P (Xi |Y = k) = θik (1−θik )1−Xi (k =
0, 1). This leads to
P (Xi |Y = 0) X θi0 1 − θi0 1 − θi0
ln( )= (ln − ln )Xi + ln )
P (Xi |Y = 1) i
θi1 1 − θi1 1 − θi1

7
This completes the proof, with w0 = ln( 1−θ ln 1−θ and w + i = (ln θθi0 1−θi0
P
θ
)+ i
i0
1−θi1 i1
− ln 1−θi1
)=
ln θθi0 (1−θi1 )
i1 (1−θi0 )
.

Field Ops Manual CAARS User Manual
0% (1)
Field Ops Manual CAARS User Manual
72 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Cbu Brochure PDF
No ratings yet
Cbu Brochure PDF
12 pages
Discovery PDF
No ratings yet
Discovery PDF
269 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
SVM
No ratings yet
SVM
36 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
Dis11 Sol
No ratings yet
Dis11 Sol
5 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
ML Practice 1
No ratings yet
ML Practice 1
106 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Ml -5 Sovan Lr Svm 1
No ratings yet
Ml -5 Sovan Lr Svm 1
59 pages
Support Vector Machine
No ratings yet
Support Vector Machine
50 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Support Vector Machines: Vibhav Gogate The University of Texas at Dallas
No ratings yet
Support Vector Machines: Vibhav Gogate The University of Texas at Dallas
36 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
Exam 2011
No ratings yet
Exam 2011
22 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
4 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
svm
No ratings yet
svm
36 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
MergedPDF Iml
No ratings yet
MergedPDF Iml
114 pages
SVM Reference
No ratings yet
SVM Reference
8 pages
Class05 LogisticsSVM
No ratings yet
Class05 LogisticsSVM
33 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
Problemset2 PDF
No ratings yet
Problemset2 PDF
4 pages
10-701 Midterm Exam Solutions, Spring 2007
No ratings yet
10-701 Midterm Exam Solutions, Spring 2007
20 pages
Question Bank 2023 Final All Questions
No ratings yet
Question Bank 2023 Final All Questions
78 pages
Svm Student
No ratings yet
Svm Student
40 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
20.1 Support Vector Machines (Introduction Linear Models)
No ratings yet
20.1 Support Vector Machines (Introduction Linear Models)
19 pages
endsem_ML_makeup_AK-_1_
No ratings yet
endsem_ML_makeup_AK-_1_
7 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
20 SVM
No ratings yet
20 SVM
35 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Problem 1 Report Trần Minh Long 2052154 Final
No ratings yet
Problem 1 Report Trần Minh Long 2052154 Final
31 pages
Sample Midterm Exam 6
No ratings yet
Sample Midterm Exam 6
11 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
13 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
SVM
No ratings yet
SVM
40 pages
23.0 Logistic Regression-6
No ratings yet
23.0 Logistic Regression-6
24 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
SVM
No ratings yet
SVM
28 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Document Checklist USA
No ratings yet
Document Checklist USA
1 page
Oregon State University
No ratings yet
Oregon State University
6 pages
Printf : Java Method Overview
No ratings yet
Printf : Java Method Overview
2 pages
Ebook RC Preactice
No ratings yet
Ebook RC Preactice
21 pages
2 Hrs/Day: 320+ in GRE
No ratings yet
2 Hrs/Day: 320+ in GRE
20 pages
Easy Steps To Download Protected PDF From Google Drive - UPDF
No ratings yet
Easy Steps To Download Protected PDF From Google Drive - UPDF
11 pages
CompTIA A+ Certification All-In-One Exam Guide (Exams 220-901 & 220-902) Ninth Edition Meyers - Ebook PDF All Chapter Instant Download
100% (4)
CompTIA A+ Certification All-In-One Exam Guide (Exams 220-901 & 220-902) Ninth Edition Meyers - Ebook PDF All Chapter Instant Download
41 pages
Bmed Unit 1 Mark Scheme
No ratings yet
Bmed Unit 1 Mark Scheme
6 pages
My - Statement - 17 Jun, 2018 - 23 Jun, 2018 - 6302470395
No ratings yet
My - Statement - 17 Jun, 2018 - 23 Jun, 2018 - 6302470395
7 pages
SAP IS Oil
No ratings yet
SAP IS Oil
15 pages
Experiment No.: 4 Implement Linear Queue ADT Using Array
No ratings yet
Experiment No.: 4 Implement Linear Queue ADT Using Array
8 pages
CMM Nokia Capacity
No ratings yet
CMM Nokia Capacity
57 pages
Resultat 50 Infos - Krnage
No ratings yet
Resultat 50 Infos - Krnage
1 page
Lecture 2 PDF
No ratings yet
Lecture 2 PDF
38 pages
Stellarium User Guide-0.20.2-1 PDF
No ratings yet
Stellarium User Guide-0.20.2-1 PDF
381 pages
Data Structures and Algorithm Analysis in C 2nd Edition China Reprint Edition Weiss download
100% (1)
Data Structures and Algorithm Analysis in C 2nd Edition China Reprint Edition Weiss download
61 pages
Coding
No ratings yet
Coding
3 pages
Variant Configuration in SAP
No ratings yet
Variant Configuration in SAP
67 pages
Download Full (Ebook) Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann ISBN 9781449373320, 1449373321 PDF All Chapters
100% (10)
Download Full (Ebook) Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann ISBN 9781449373320, 1449373321 PDF All Chapters
65 pages
Ersatzteilliste ML 182/183 Spare Parts List ML 182/183: M-OSD-10021-10
No ratings yet
Ersatzteilliste ML 182/183 Spare Parts List ML 182/183: M-OSD-10021-10
8 pages
Digital System Design and Simulation
No ratings yet
Digital System Design and Simulation
2 pages
User Manual DR-100S
No ratings yet
User Manual DR-100S
38 pages
Wireless Communication Protocols Comparison Table
No ratings yet
Wireless Communication Protocols Comparison Table
1 page
Module 2.3
No ratings yet
Module 2.3
11 pages
h16084 WP Tech Overview New Improved Features Onefs8.1
No ratings yet
h16084 WP Tech Overview New Improved Features Onefs8.1
5 pages
BTr Digitation Efforts - NGCDS
No ratings yet
BTr Digitation Efforts - NGCDS
24 pages
Batch Apex
No ratings yet
Batch Apex
4 pages
Black ZHornet ZBrochure
No ratings yet
Black ZHornet ZBrochure
8 pages
DBMS Imp Questions
100% (1)
DBMS Imp Questions
3 pages
HP-UX Whitelisting (WLI) Performance Whitepaper PDF
No ratings yet
HP-UX Whitelisting (WLI) Performance Whitepaper PDF
14 pages
GP-32 Ome-44200-E3
No ratings yet
GP-32 Ome-44200-E3
88 pages
Interim 2010
No ratings yet
Interim 2010
70 pages

ASU Assignment2 Sol

Uploaded by

ASU Assignment2 Sol

Uploaded by

CSE 575: Statistical Machine Learning Assignment #2

Instructor: Prof. Hanghang Tong

1 Logistic Regression [20 points]

2 2-Class Logistic Regression [20 points]

Prove that 2-class logistic regression is always a linear classifier.

3 SVM [20 points]

Figure 1: Training Data Set for SVM Classifiers

2 (Hinge Loss, 5 pts) Given m training data points {xi , yi }m

(2) Suppose we initialize α1 = 5, α2 = 4, α3 = 8, α4 = 7. And we want to update α1 and

4 More on SVM [20 points]

Draw the decision boundary of linear SVM. Give a brief explanation.

Justify your answer.

kwk2 X 1 e(w·xi +b)

Justify your answer.

5 Naı̈ve Bayes Classifier and Logistic Regression [20 points]

Since P (Xi |Y = k) ∼ N (µik , σi ) (k = 0, 1; i = 1, ..., n), we have P (Xi = x|Y =

You might also like