0% found this document useful (0 votes)
16 views

hw1_red

Uploaded by

chonleda777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

hw1_red

Uploaded by

chonleda777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

Homework #1
RELEASE DATE: 09/09/2024
RED CORRECTION: 09/16/2024 06:00
DUE DATE: 10/07/2024, BEFORE 13:00 on GRADESCOPE
QUESTIONS ARE WELCOMED ON DISCORD (INFORMALLY) OR VIA EMAILS (FORMALLY).

You will use Gradescope to upload your scanned/printed solutions. For problems marked with (*), please
follow the guidelines on the course website and upload your source code to Gradescope as well. Any
programming language/platform is allowed.
Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail
the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.
Discussions on course materials and homework solutions are encouraged. But you should write the final
solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but
not copied from.
Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework
solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness
in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will
be punished according to the honesty policy.
You should write your solutions in English or Chinese with the common math notations introduced in
class or in the problems. We do not accept solutions written in any other languages.

This homework set comes with 200 points and 20 bonus points. In general, every home-
work set would come with a full credit of 200 points, with some possible bonus points.

1. (10 points, auto-graded) Which of the following tasks is best suited for machine learning? Choose
the best answer.
[a] generate an image of Hercules that matches his actual facial look
[b] search for the shortest road path from Taipei to Taichung
[c] summarize any news article to 10 lines
[d] predict whether Schrödinger’s cat is alive or dead inside the box
[e] none of the other choices
N
2. (10 points, auto-graded) Assume that a data set of an even size N , with being positive examples
2
N
and 2being negative. If each example is used to update wt in PLA exactly once. What is the
resulting w0 in wPLA ? Please assume that the initial weight vector w0 is 0. Choose the correct
answer.

[a] N
N
[b] 2
[c] 0
[d] − N2
[e] none of the other choices

1 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

3. (10 points, auto-graded) Dr. Norman thinks PLA will be highly influenced by very long examples,
as wt changes drastically if ∥xn(t) ∥ is large. Hence, ze decides to preprocess the training data
by scaling down each input vector by 2 i.e., zn ← x2n . How does PLA’s upper bound on Page
19 of Lecture 2 change with this preprocessing procedure, with respect to the R and ρ that were
calculated before scaling? Choose the correct answer.
2R2
[a] ρ2
R2
[b] ρ2
R2
[c] 2ρ2
R2
[d] 4ρ2
[e] none of the other choices
4. (10 points, auto-graded) Dr. Norman has another idea of scaling. Instead of scaling by a constant,
ze decides to preprocess the training data by normalizing each input vector i.e., zn ← ∥xxnn ∥ . How
does PLA’s upper bound on Page 19 of Lecture 2 change with this preprocessing procedure in
yn wfT zn
terms of ρz = minn ∥wf ∥ ? Choose the correct answer.

[a] ∞ (i.e., PLA might never terminate)


1
[b] ρ2z
1
[c] 2ρz

[d] √1
ρz

[e] none of the other choices

5. (20 points, human-graded) Go ask any chatGPT-like agent the following question, “what is a
possible application of active learning?”, list the answer that you get, and argue with 10-20 English
sentences on whether you agree with the agent or not, as if you are the “boss” of the agent.
The TAs will grade based on the persuasiveness of your arguments—please note that our TAs are
more used to being persuaded by humans than machines. So if your arguments do not look very
human-written, the TAs may not be persuaded.

6. (20 points, human-graded) Go ask any chatGPT-like agent the following question, “can machine
learning be used to predict earthquakes?”, list the answer that you get, and argue with 10-20
English sentences on whether you agree with the agent or not, as if you are the “boss” of the agent.
The TAs will grade based on the persuasiveness of your arguments—please note that our TAs are
more used to being persuaded by humans than machines. So if your arguments do not look very
human-written, the TAs may not be persuaded.
7. (20 points, human-graded) Before running PLA, our class convention adds x0 = 1 to every xn
′ ′
vector, forming xn = (1, xorig
n ). Suppose that x0 = 2 is added instead to form xn = (2, xn ).
orig
N
Consider running PLA on {(xn , yn )}n=1 in a cyclic manner with the naı̈ve cycle. That is, the
algorithm keeps finding the next mistake in the order of 1, 2, . . . , n, 1, 2, . . .. Assume that such a
PLA with w0 = 0 returns wPLA , and running PLA on {(x′n , yn )}N n=1 with the same cyclic manner
′ ′
with w0 = 0 returns wPLA . Prove or disprove that wPLA and wPLA are equivalent. We define two
weight vectors to be equivalent if they return the same binary classification output on every possible
example in Rd , the space that every xorig belongs to. Please take any deterministic convention for
sign(0), for example setting sign(0) = 1.

2 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

8. (20 points, human-graded) Before running PLA, our class convention adds x0 = 1 to every xn

vector, forming xn = (1, xorig
n ). Suppose that we scale every xn by 3, and x0 = 3 is added instead
′ orig N
to form xn = (3, 3xn ). Consider running PLA on {(xn , yn )}n=1 in a cyclic manner with the naı̈ve
cycle. That is, the algorithm keeps finding the next mistake in the order of 1, 2, . . . , n, 1, 2, . . ..
Assume that such a PLA with w0 = 0 returns wPLA , and running PLA on {(x′n , yn )}N n=1 with
′ ′
the same cyclic manner with w0 = 0 returns wPLA . Prove or disprove that wPLA and wPLA are
equivalent. Similar to Problem 7, please take any deterministic convention for sign(0), for example
setting sign(0) = 1.
9. (20 points, human-graded) Consider online hatred article detection with machine learning. We will
represent each article x by the distinct words that it contains. In particular, assume that there are
at most m distinct words in each article, and each word belongs to a big dictionary of size d ≥ m.
The i-th component xi is defined as Jword i is in article xK for i = 1, 2, . . . , d, and x0 = 1 as always.
We will assume that d+ of the words in the dictionary are more hatred-like, and d− = d − d+ of
the words are less hatred-like. A simple function that classifies whether an artile is a hatred is
to count z+ (x), the number of more hatred-like words with the article (ignoring duplicates), and
z− (x), the number of less hatred-like words in the article, and classify by

f (x) = sign(z+ (x) − z− (x) − 3.5).

That is, an article x is classified as a hatred iff the integer z+ (x) is more than the integer z− (x)
by 4.
Assume that f can perfectly classify any article into hatred/non-hatred, but is unknown to us. We
now run an online version of Perceptron Learning Algorithm (PLA) to try to approximate f . That
is, we maintain a weight vector wt in the online PLA, initialized with w0 = 0. Then for every
article xt encountered at time t, the algorithm makes a prediction sign(wtT xt ), and receives a true
label yt . If the prediction is not the same as the true label (i.e. a mistake), the algorithm updates
wt by
wt+1 ← wt + yt xt .
Otherwise the algorithm keeps wt without updating

wt+1 ← wt .

Derive an upper bound on the maximum number of mistakes that the online PLA can make for
this hatred article classification problem. The tightness of your upper bound will be taken into
account during grading.
Note: For those who know the bag-of-words representation for documents, the representation we
use is a simplification that ignores duplicates of the same word.

3 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

10. (20 points, human-graded) Next, we use a real-world data set to study PLA. Please download the
RCV1.binary (training) data set at
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2

and takes the first N = 200 lines as our data set. Each line of the data set contains one (xn , yn ) in
the LIBSVM format, with xn ∈ R47205 . The first number of the line is yn , and the rest of the line
is xn represented in the sparse format that LIBSVM uses.
https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#/Q03:_Data_preparation

Please initialize your algorithm with w = 0 and take sign(0) as −1.


Please first follow page 4 of Lecture 2, and add x0 = 1 to every xn . Implement a version of PLA that
randomly picks an example (xn , yn ) in every iteration, and updates wt if and only if wt is incorrect
on the example. Note that the random picking can be simply implemented with replacement—that
is, the same example can be picked multiple times, even consecutively. Stop updating and return
wt as wPLA if wt is correct consecutively after checking 5N randomly-picked examples.
Hint: You can simply follow the algorithm above to solve this problem. But if you are interested
in knowing why the algorithm above is somewhat equivalent to the PLA algorithm that you learned
in class, here is some more information. (1) The update procedure described above is equivalent to
the procedure of gathering all the incorrect examples first and then randomly picking an example
among the incorrect ones. But the description above is usually much easier to implement. (2) The
stopping criterion above is a randomized, more efficient implementation of checking whether wt
makes no mistakes on the data set. Passing 5N times of correctness checking means that wt is
mistake-free with more than 99% of probability.
Repeat your experiment for 1000 times, each with a different random seed. Plot a histogram to
visualize the distribution of the number of updates needed before returning wPLA . Describe your
findings. Then, provide the first page of the snapshot of your code as a proof that you have written
the code.
(Note: As a general principle, you can use any plotting software outside your data processing and
machine learning code.)
11. (20 points, human-graded) When running the 1000 experiments above, record ∥wt ∥ as a function
of t. Plot the ∥wt ∥ in each experiment as a function of t for t = 1, 2, . . . , Tmin , where Tmin is the
smallest number of updates in the previous problem. Superpose the 1000 functions on the same
figure. Describe your findings. Then, provide the first page of the snapshot of your code as a proof
that you have written the code.
12. (20 points, code needed, human-graded) Modify your PLA above to a variant that keeps correcting
the same example until it is perfectly classified. That is, when selecting an incorrect example
(xn(t) , yn(t) ) for updating, the algorithm keeps using that example (that is, n(t + 1) = n(t)) to
update until the weight vector perfectly classifies the example (and each update counts!). Repeat
the 1000 experiments above. Plot a histogram to visualize the distribution of the number of updates
needed before returning wPLA . What is the median number of updates? Compare your result to
that of Problem 10. Describe your findings. Then, provide the first page of the snapshot of your
code as a proof that you have written the code.
13. (Bonus 20 points, human-graded) When PLA makes an update on a misclassified example (xn(t) , yn(t) ),
the new weight vector wt+1 does not always classify (xn(t) , yn(t) ) correctly. Consider a variant of
PLA that makes an update by
$ %
1 −10yn(t) wtT xn(t)
wt+1 ← wt + yn(t) xn(t) · +1 .
10 ∥xn(t) ∥2

First, prove that wt+1 always correctly classifies (xn(t) , yn(t) ) after the update. Second, prove
that such a PLA halts with a perfect hyperplane if the data is linearly separable. (Hint: Check
Problem 12.)

4 of 4

You might also like