Assignment 1
Assignment 1
Suppose that in your coin flip experiment, you observed a set of αH heads and αT tails. Let θ
denote the probability of observing heads, whose prior distribution follows Beta(βH , βT ), where
βH and βT are two positive parameters. Prove that the posterior distribution P (θ|D) (D denotes
the observed coin flips) follows Beta(βH + αH , βT + αT ). What is the mean of P (θ|D)? What is
the MAP estimator θ̂M AP of θ?
For this question, assume that x1 , . . . , xN ∈ R are i.i.d samples drawn from the same underlying
distribution. Assume that the underlying distribution is Gaussian N (µ, σ 2 ).
1. (5 points) Let µ̂M LE denote the MLE estimator of µ. Please prove that µ̂M LE is unbiased.
Hint: The bias of an estimator of the parameter µ is defined to be the difference between
the expected value of the estimator and µ.
2. (10 points) If the true value of µ is unknown, then the MLE estimator of σ 2 is as follows.
N
2 1 X
σ̂M LE = (xi − µ̂M LE )2
N i=1
2
Please prove that σ̂M LE is biased.
Given the training data set shown in Figure 1, we train a Naı̈ve Bayes classifier with it. Each row
refers to a person, where the categorical features (age, income etc.) and the class label (whether
he/she buys a computer) are shown.
1. (5 points) How many independent parameters would be there for the Naı̈ve Bayes classifier
trained with this data? What are they? Justify the your answers.
2. (10 points) Using standard MLE, what are the estimated values for these parameters?
3. (5 points) Given a new person with features x = (youth, medium, yes, f air), what is P (y =
yes|x)? Would the Naı̈ve Bayes classifier predict y = yes or y = no for this person?
1
Figure 1: Training Data for Naı̈ve Bayes Classifier
Suppose we have two positive examples x1 = (1, 0) and x2 = (0, −1) and two negative examples
x3 = (0, 1) and x4 = (−1, 0). Apply standard gradient ascent method to train a logistic regression
classifier (without any regularization terms). Initialize the weight vector with two different values
and set w00 = 0 (e.g. w0 = (0, 0, 0)0 , w0 = (0, 1, 0)0 ). Would the final weight vector (w∗ ) be the
same for the two different initial values? What are the values? Please explain your answer. You
may assume the learning rate to be a positive real constant η.
1. (5 points) Gaussian Naı̈ve Bayes and Logistic Regression. Suppose a logistic regression
model and a Gaussian Naı̈ve Bayes classifier are trained for a binary classification task f :
X → Y where X is real-valued features X =< X1 , ..., Xd >∈ Rd , Y = {0, 1} is the
binary label. After training, we get the weight vector w =< w0 , w1 , ..., wd > for the logistic
regression model.
Recall that in Gaussian Naı̈ve Bayes, each feature Xi (i = 1, ..., d) is assumed to be
conditional independent given the label Y so that P (Xi |Y = k) = N (µik , σik ) (k =
0, 1; i = 1, ..., d). We assume that the marginal distribution of class labels P (Y ) follows
Bernoulli(θ, 1 − θ) (P (Y = 1) = θ, P (Y = 0) = 1 − θ).
– How many independent parameters are there in this Gaussian Naı̈ve Bayes classifier?
What are them?
– Can we translate w into the parameters of an equivalent Gaussian Naı̈ve Bayes classifier
without any extra assumption? If that is the case, justify your answer. Otherwise, please
specify what extra assumption(s) you need to complete the translation and explain why.
2. (25 points) Implementation of Gaussian Naı̈ve Bayes and Logistic Regression. Compare
the two approaches on the bank note authentication dataset, which can be downloaded from
2
http://archive.ics.uci.edu/ml/datasets/banknote+authentication. Complete description of the
dataset can be also found on this webpage. In short, for each row the first four columns are
the feature values and the last column is the class label (0 or 1). You will observe the learn-
ing curves similar to those Dr. He mentioned in class. Implement a Gaussian Naı̈ve Bayes
classifier (recall the conditional independent assumption mentioned before) and a logistic
regression classifier. Please write your own code from scratch and do NOT use existing
functions or packages which can provide you the Naı̈ve Bayes Classifier/Logistic Re-
gression class or fit/predict function (e.g. sklearn). But you can use some basic linear
algebra/probability functions (e.g. numpy.sqrt(), numpy.random.normal()). For the Naı̈ve
Bayes classifier, assume that P (xi |y) ∼ N (µi,k , σi,k ), where xi is a feature in the bank note
data, and y is the class label. Use three-fold cross-validation to split the data and train/test
your models.
– (5 points) For each algorithm: briefly describe how you implement it by giving the
pseudocode. The pseudocode must include equations for estimating the model pa-
rameters and for classifying a new example. Remember, this should not be a print-
out of your code, but a high-level outline description. Include the pseudocode in
your pdf file (or .doc/.docx file). Submit the actual code as a single zip file named
yourFirstName-yourLastName.zip IN ADDITION TO the pdf file (or .doc/.docx file).
– (10 points) Plot a learning curve: the accuracy vs. the size of the training set. Plot 6
points for the curve, using [.01 .02 .05 .1 .625 1] RANDOM fractions of you training
set and testing on the full test set each time. Average your results over 5 runs using
each random fraction (e.g. 0.05) of the training set. Plot both the Naı̈ve Bayes and
logistic regression learning curves on the same figure. For logistic regression, do not
use any regularization term.
– (10 points) Show the power of generative model: Use your trained Naı̈ve Bayes classi-
fier (with the complete training set) to generate 400 examples from class y = 1. Report
the mean and variance of the generated examples and the corresponding training data
(for each fold, over 1 run). and compare with those in your training set (examples in
training set with y = 1). Try to explain what you observed in this comparison.