hw1_red
hw1_red
Homework #1
RELEASE DATE: 09/09/2024
RED CORRECTION: 09/16/2024 06:00
DUE DATE: 10/07/2024, BEFORE 13:00 on GRADESCOPE
QUESTIONS ARE WELCOMED ON DISCORD (INFORMALLY) OR VIA EMAILS (FORMALLY).
You will use Gradescope to upload your scanned/printed solutions. For problems marked with (*), please
follow the guidelines on the course website and upload your source code to Gradescope as well. Any
programming language/platform is allowed.
Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail
the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.
Discussions on course materials and homework solutions are encouraged. But you should write the final
solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but
not copied from.
Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework
solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness
in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will
be punished according to the honesty policy.
You should write your solutions in English or Chinese with the common math notations introduced in
class or in the problems. We do not accept solutions written in any other languages.
This homework set comes with 200 points and 20 bonus points. In general, every home-
work set would come with a full credit of 200 points, with some possible bonus points.
1. (10 points, auto-graded) Which of the following tasks is best suited for machine learning? Choose
the best answer.
[a] generate an image of Hercules that matches his actual facial look
[b] search for the shortest road path from Taipei to Taichung
[c] summarize any news article to 10 lines
[d] predict whether Schrödinger’s cat is alive or dead inside the box
[e] none of the other choices
N
2. (10 points, auto-graded) Assume that a data set of an even size N , with being positive examples
2
N
and 2being negative. If each example is used to update wt in PLA exactly once. What is the
resulting w0 in wPLA ? Please assume that the initial weight vector w0 is 0. Choose the correct
answer.
[a] N
N
[b] 2
[c] 0
[d] − N2
[e] none of the other choices
1 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
3. (10 points, auto-graded) Dr. Norman thinks PLA will be highly influenced by very long examples,
as wt changes drastically if ∥xn(t) ∥ is large. Hence, ze decides to preprocess the training data
by scaling down each input vector by 2 i.e., zn ← x2n . How does PLA’s upper bound on Page
19 of Lecture 2 change with this preprocessing procedure, with respect to the R and ρ that were
calculated before scaling? Choose the correct answer.
2R2
[a] ρ2
R2
[b] ρ2
R2
[c] 2ρ2
R2
[d] 4ρ2
[e] none of the other choices
4. (10 points, auto-graded) Dr. Norman has another idea of scaling. Instead of scaling by a constant,
ze decides to preprocess the training data by normalizing each input vector i.e., zn ← ∥xxnn ∥ . How
does PLA’s upper bound on Page 19 of Lecture 2 change with this preprocessing procedure in
yn wfT zn
terms of ρz = minn ∥wf ∥ ? Choose the correct answer.
[d] √1
ρz
5. (20 points, human-graded) Go ask any chatGPT-like agent the following question, “what is a
possible application of active learning?”, list the answer that you get, and argue with 10-20 English
sentences on whether you agree with the agent or not, as if you are the “boss” of the agent.
The TAs will grade based on the persuasiveness of your arguments—please note that our TAs are
more used to being persuaded by humans than machines. So if your arguments do not look very
human-written, the TAs may not be persuaded.
6. (20 points, human-graded) Go ask any chatGPT-like agent the following question, “can machine
learning be used to predict earthquakes?”, list the answer that you get, and argue with 10-20
English sentences on whether you agree with the agent or not, as if you are the “boss” of the agent.
The TAs will grade based on the persuasiveness of your arguments—please note that our TAs are
more used to being persuaded by humans than machines. So if your arguments do not look very
human-written, the TAs may not be persuaded.
7. (20 points, human-graded) Before running PLA, our class convention adds x0 = 1 to every xn
′ ′
vector, forming xn = (1, xorig
n ). Suppose that x0 = 2 is added instead to form xn = (2, xn ).
orig
N
Consider running PLA on {(xn , yn )}n=1 in a cyclic manner with the naı̈ve cycle. That is, the
algorithm keeps finding the next mistake in the order of 1, 2, . . . , n, 1, 2, . . .. Assume that such a
PLA with w0 = 0 returns wPLA , and running PLA on {(x′n , yn )}N n=1 with the same cyclic manner
′ ′
with w0 = 0 returns wPLA . Prove or disprove that wPLA and wPLA are equivalent. We define two
weight vectors to be equivalent if they return the same binary classification output on every possible
example in Rd , the space that every xorig belongs to. Please take any deterministic convention for
sign(0), for example setting sign(0) = 1.
2 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
8. (20 points, human-graded) Before running PLA, our class convention adds x0 = 1 to every xn
′
vector, forming xn = (1, xorig
n ). Suppose that we scale every xn by 3, and x0 = 3 is added instead
′ orig N
to form xn = (3, 3xn ). Consider running PLA on {(xn , yn )}n=1 in a cyclic manner with the naı̈ve
cycle. That is, the algorithm keeps finding the next mistake in the order of 1, 2, . . . , n, 1, 2, . . ..
Assume that such a PLA with w0 = 0 returns wPLA , and running PLA on {(x′n , yn )}N n=1 with
′ ′
the same cyclic manner with w0 = 0 returns wPLA . Prove or disprove that wPLA and wPLA are
equivalent. Similar to Problem 7, please take any deterministic convention for sign(0), for example
setting sign(0) = 1.
9. (20 points, human-graded) Consider online hatred article detection with machine learning. We will
represent each article x by the distinct words that it contains. In particular, assume that there are
at most m distinct words in each article, and each word belongs to a big dictionary of size d ≥ m.
The i-th component xi is defined as Jword i is in article xK for i = 1, 2, . . . , d, and x0 = 1 as always.
We will assume that d+ of the words in the dictionary are more hatred-like, and d− = d − d+ of
the words are less hatred-like. A simple function that classifies whether an artile is a hatred is
to count z+ (x), the number of more hatred-like words with the article (ignoring duplicates), and
z− (x), the number of less hatred-like words in the article, and classify by
That is, an article x is classified as a hatred iff the integer z+ (x) is more than the integer z− (x)
by 4.
Assume that f can perfectly classify any article into hatred/non-hatred, but is unknown to us. We
now run an online version of Perceptron Learning Algorithm (PLA) to try to approximate f . That
is, we maintain a weight vector wt in the online PLA, initialized with w0 = 0. Then for every
article xt encountered at time t, the algorithm makes a prediction sign(wtT xt ), and receives a true
label yt . If the prediction is not the same as the true label (i.e. a mistake), the algorithm updates
wt by
wt+1 ← wt + yt xt .
Otherwise the algorithm keeps wt without updating
wt+1 ← wt .
Derive an upper bound on the maximum number of mistakes that the online PLA can make for
this hatred article classification problem. The tightness of your upper bound will be taken into
account during grading.
Note: For those who know the bag-of-words representation for documents, the representation we
use is a simplification that ignores duplicates of the same word.
3 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
10. (20 points, human-graded) Next, we use a real-world data set to study PLA. Please download the
RCV1.binary (training) data set at
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2
and takes the first N = 200 lines as our data set. Each line of the data set contains one (xn , yn ) in
the LIBSVM format, with xn ∈ R47205 . The first number of the line is yn , and the rest of the line
is xn represented in the sparse format that LIBSVM uses.
https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#/Q03:_Data_preparation
First, prove that wt+1 always correctly classifies (xn(t) , yn(t) ) after the update. Second, prove
that such a PLA halts with a perfect hyperplane if the data is linearly separable. (Hint: Check
Problem 12.)
4 of 4