Final 2018
Final 2018
Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other
materials.
Time: 2 hours.
Be sure to write your name and Penn student ID (the 8 bigger digits on your ID card) on the
bubble form and fill in the associated bubbles in pencil.
If you are taking this as a WPE, then enter only your WPE number and fill in the associated
bubbles, and do not write your name.
If you think a question is ambiguous, mark what you think is the single best answer. The questions
seek to test your general understanding; they are not intentionally “trick questions.” As always,
we will consider written regrade requests if your interpretation of a question differed from what we
intended. We will only grade the bubbled answer key.
For the “TRUE or FALSE” questions, note that “TRUE” is (a) and “FALSE” is (b). For the
multiple choice questions, select exactly one answer.
Name:
1
CIS520 Final, Fall 2018 2
1. [1 points]True or False? Both PCA and linear regression can be thought of as algorithms for
minimizing a sum of squared errors.
2. [1 points]True or False? The largest eigenvector of the covariance matrix is the direction of
minimum variance in the data.
3. [1 points]True or False? The non-zero eigenvalues of AA> and A> A are the same.
(a) Eigenvectors of A
(b) Eigenvectors of (A> A)−1 A> A
(c) Eigenvectors of AA>
(d) Eigenvectors of A> A
A B C D E
F G H I
J K L
9. [1 points]True or False? C ⊥
⊥D|F
10. [1 points]True or False? D ⊥
⊥ I | E, F, K
11. [1 points]True or False? C ⊥
⊥ J | A, F, L
12. [1 points]True or False? F ⊥
⊥L|G
13. [1 points]True or False? ¬(G ⊥
⊥ E | D, K)
14. [1 points]True or False? I d-separates E and L
CIS520 Final, Fall 2018 3
15. [2 points] What is the minimum number of parameters needed to represent the full joint
probability P (A, B, C, D, E, F, G, H, I, J, K, L) in the above network if all the variables are
binary?
(a) 4095
(b) 20
(c) 23
(d) 24
(e) 29
16. [2 points] Consider the following objective function for a GAN, where G()˙ represents a gener-
ator that generates a p-dimensional example given a latent variable z drawn from p(z), and
D()˙ is a discriminator that outputs a prediction for the probability a p-dimensional example
has been drawn from the true dataset, which has density function pdata (x).
(a) The D is chosen to maximize V (G, D), and G is chosen to minimize V (G, D).
(b) The G is chosen to maximize V (G, D), and D is chosen to minimize V (G, D).
(c) The objective function is incorrect since the GAN formulation assumes z is p-dimensional.
(d) The objective function is incorrect since the GAN formulation assumes z is completely
random, rather than being drawn from some distribution.
(e) None of the above.
17. [2 points] A real estate tycoon has employed you to assist with home sale negotiations. Your
employer would like you to build a model to predict the counter-offer the opposing party
will make in each round of bargaining given some features about the home and the values
of counter-offers at all earlier rounds of bargaining. Suppose there are at most 4 rounds of
bargaining. Which deep learning architecture best matches the structure of the problem?
18. [2 points] You are processing the data of a survey where people have the option to report their
income. We knowt that people with extremely high or low income are less likely to report
their incomes. What is the best way to deal with the missing data?
(a) Impute (replace) the missing data with the mode of the reported income
(b) Impute (replace) missing data with the mean of themean reported income
(c) Replace the missing income with as ”0” and add an extra column indicating whether or
not the data is missing
(d) Fill in the missing data with values randomly drawn from the reported values
CIS520 Final, Fall 2018 4
19. [1 points]True or False? Consider an MDP (Markov decision process), M = {S, A, p, r, γ}. If
there are total |S| states and |A| possible actions, at each iteration, policy evaluation takes
O(|S|2 ), while value iteration takes O(|S|2 |A|).
20. [2 points] Suppose you are given a (fully specified) Markov decision process with state space
S = {1, 2, 3}, and action space A = {a, b, c, d}. You calculate the optimal state-action value
Q∗ (s, a) for each state-action pair (s, a) to be as follows:
a b c d
1 3.2 4.7 2.5 4.2
2 2.8 5.0 3.6 5.0
3 6.2 5.7 5.4 5.9
If we denote π ∗ as the optimal deterministic policy, which of the following cannot be true?
(a) π ∗ (1) = a
(b) π ∗ (2) = b
(c) π ∗ (2) = d
(d) π ∗ (3) = a
(e) None or more than one of the above
21. [2 points] Which of the following statements about Q-learning and Monte Carlo methods is
true?
(a) Q-learning has higher bias and lower variance than Monte Carlo methods
(b) Q-learning has lower bias and higher variance than Monte Carlo methods
(c) Both q-learning and Monte Carlo methods are on-policy algorithms
(d) Both q-learning and Monte Carlo methods are off-policy algorithms
22. [1 points] True or False? On a given data set X which is mean centered, you divide each
feature by its standard deviation so that the variance of each feature is 1. If you do PCA on
the new standardized data set and obtain scores (i.e. the transformed output of PCA), then
each of the scores will also have variance equal to 1.
CIS520 Final, Fall 2018 5
The next two questions are about the following piece of pseudocode:
24. [2 points] Which of the following categories does the above algorithm given fall into?
(a) Off-policy
(b) On-policy
(c) Multi-armed bandit
(d) None of the above
(a) AlphaGo uses three policy networks: a fast-rollout network, a network trained via su-
pervised learning, and a network trained via self-play
(b) In the final policy, AlphaGo selects actions which have been taken most often in the
Monte Carlo tree search, rather than those with the highest value estimations.
(c) In the Monte Carlo tree search, the SL (supervised learning) policy network promotes
exploitation and the value network promotes exploration.
(d) During tree search, the fast-policy network traces out a path to the end of the game at
each turn.
CIS520 Final, Fall 2018 6
26. [2 points] You are given two-dimensional training data for PCR. The mean of the training
data is h0, 3i, and the first principal component (loadings) is h1, 1i, (after subtracting off the
mean, but not standardizing the data). You learn a model ŷ = f (z) = 3z where z are the
scores w.r.t the first PC. Given a test point x = h2, 3i What is the prediction ŷ for this point?
(a) 6
(b) 3
(c) 9
(d) -3
(e) None of the above
27. [2 points] Which of these models gives a globally optimum solution to the loss function it is
minimizing?
1) Logistic Regression
2) Neural Networks
3) K-means clustering
(a) 1
(b) 1 and 2
(c) 3
(d) All of these methods
(e) None of these methods
28. [1 points] True or False? For a given t, we have Xs ⊥ Zt for all s < t.
29. [1 points] True or False? The following statement about hidden Markov models holds for all
1 ≤ t ≤ T and k
P (Xt+1 = xt+1 , . . . , XT = xT | X1 = x1 , . . . , Xt = xt , Zt = k)
= P (Xt+1 = xt+1 , . . . , XT = xT | Zt = k)
CIS520 Final, Fall 2018 7
30. [1 points] True or False? The quantity γj = P (Zj = 1 | Xj = xj ) for an unlabelled instance
xj ∈ XU , is a parameter of this EM model.
31. [2 points] What is the smallest number of parameters needed to specify a model for this
classification using EM algorithm?
(a) 15
(b) 63
(c) 115
(d) 129
(e) None of the above
32. [2 points] You are hired by Cambridge Analytica as a Machine Learning consultant. Your task
is to use Facebook data of 100 million (108 ) people as training data to learn a classification
model to predict the binary election vote for each person, represented by y = ±1. You decide
to use regularized Logistic regression, which has the following penalized loss:
10 8
1 X
min 8 log(1 + exp(−yi wT Xi )) + λ||w||22
w 10
i=1
Using cross-validation you find the best penalty hyperparameter λ1 . Later you learn that only
10 million of these people consented to this experiment, so as an ethical programmer, you
decide learn a model using only 10 million people, and discard the rest. Using cross-validation
again on this smaller data set you find the best penalty hyperparameter λ2 . Which of the
following statements is true?
33. [2 points] Unfortunately, you got fired for your heroic stance, and your replacement, Mark,
decides to use linear and degree 2 (quadratic) polynomial kernel SVM models trained on all
of the 100 million people, instead of your Logistic regression model trained on 100 million
people. Once these three models have been trained, Mark tests them by giving them a new
voter to classify. Which of the following classifiers would be computationally most expensive
to run?
34. [2 points] You have just trained a logistic regression classifier which, given an instance x,
estimates the probability of a positive label to be
1
η̂(x) =
1 + e−ŵT x
(For simplicity, we ignore bias/threshold terms.) You are now told that the cost of a false
positive (incorrectly predicting a negative example as positive) will be 35 , and that of a false
negative will be 25 . In order to classify a new instance as positive or negative, what decision
rule should be used?
35. [1 points] True or False? After i−th iteration of online perceptron learning, you have a model
hi and you receive a new instance Xi+1 . You find out that your current model misclassifies
the instance as hi (Xi+1 ) = +1 when you receive the actual label Yi+1 = −1. You update the
model using the perceptron algorithm and get a classifier hi+1 . hi+1 is guaranteed to classify
Xi+1 correctly as −1?
36. [2 points] You have a corpus of documents on which you want to implement LDA topic
modelling. Which of the following statements is true?
37. [2 points] Which of the following statements about AdaBoost algorithm for binary classifica-
tion is true?
(a) 1 only
(b) 2 only
CIS520 Final, Fall 2018 9
(c) 1 and 2
(d) 3 only
(e) 1 and 3
38. [2 points] For which of the following models, does the complexity increase as the given hyper-
parameter increases? (Assume all other hyper-parameters stay constant).
(a) 1
(b) 2
(c) 3
(d) 1 and 3
(e) 2 and 3
40. [2 points] You are training a simple neural network for a regression problem on a 2-dimensional
data set. Your Neural Net architecture is as follows: 3 hidden layers with sigmoid units,
trained for 1000 epochs, with L2 penalty for each hidden layer. Using 5-fold cross-validation
you learn that the 1st hidden layer should have 6 neurons, the 2nd hidden layer should have
4 neurons and the 3rd hidden layer should have 3 neurons. However, you find that the test
error is 10 times the training error. Which of the following changes is most likely to bring
the biggest improvement in performance?
(1) (2)
(3) (4)
41. [2 points] You want to fit an unregularized Logistic regression model to determine the
decision boundary, which is a line in this case. Which of the following figures shows the
decision boundary line produced by the model?
(a) Figure 1
(b) Figure 2
(c) Figure 3
(d) Figure 4
(e) All figures are valid
42. [2 points] Now you want to fit a L2 regularized Logistic regression model to determine the
decision boundary, which is also a line in this case. Which of the following figures cannot be
a decision boundary for this model?
(a) Figure 1
(b) Figure 3
(c) Figure 4
(d) All figures are valid
CIS520 Final, Fall 2018 11
43. [1 points] The following data set consists of 5 points: each corner of a unit square and its
center. Can this data set be made separable by an SVM with an RBF kernel using only two
support vectors? (There is no restriction on the kernel width or choice of support vectors.)
(a) True
(b) False
44. [2 points] The following training set consists of binary labeled points. You want to train a
Neural Net model on this data. If you use only one hidden layer with ReLU activation units,
what is the smallest number of activation units required to separate this training set?
(a) 1
(b) 2
(c) 3
(d) More than 3
(e) It cannot be separated using only one hidden layer of any number of ReLU units
CIS520 Final, Fall 2018 12
45. [2 points] Consider an active learning setup for a cost-sensitive binary classification with labels
{±1}. The loss matrix is:
ŷ = +1 ŷ = −1
y = +1 0 2
y = −1 6 0
For any instance x, let η(x) = P (Y = +1|X = x) denote the conditional probability that the
true label is +1 given x. You are given a small labeled training set, from which you learn a
logistic regression model. You are also given four more unlabeled data points, x1 , x2 , x2 , x4 ,
and are allowed to query the label of one of these. Your logistic regression model predicts the
probabilities of each of these instances having label +1 as follows:
If you use an uncertainty sampling approach, which of the above instances would be chosen
to query a label for?
(a) x1
(b) x2
(c) x3
(d) x4
(e) None of the above
46. [1 points] Suppose you are given a binary labelled data set that is linearly separable. Using
an SVM, you find a hyperplane H that separates the labels with maximum margin γ. Is it
possible that there is another hyperplane, different from H, that also separates the labels
with the same margin γ?
(a) Yes
(b) No
47. [2 points] When p >> n, which of the following methods can we not use to train the model?
(As usual, data dimension is p and training sample size is n.)
48. [2 points] Suppose you are given an EM algorithm that finds maximum likelihood estimates
for a model with latent variables. You are asked to modify the algorithm so that it finds
MAP estimates instead. Which step or steps do you need to modify:
(a) Expectation
(b) Maximization
(c) Both
CIS520 Final, Fall 2018 13
49. [2 points] Which of the following best describes the M-step of EM algorithm?
50. [2 points] Which of the following is not best estimated using the EM algorithm.
(a) HMMs
(b) Gaussian Mixture Models
(c) Belief nets where not all variables are observed.
(d) Model-based reinforcement learning
51. [1 points] True or False? When deciding which points (observations) to get labels for, picking
the point about which one is most uncertain will reliably lead to good coverage of the feature
space.
52. [2 points] We studied a number of active learning methods. Which of the following is not
among them?
53. [2 points] Which of the following is not a valid method of computing the square of the
Frobeneous norm of a square, symmetric matrix?
(c) square of the sum of the absolute values of the matrix entries,
X
( |xij |)2
ij
54. [2 points] Most methods of measuring variable importance (e.g. like we saw for random
forests) are designed
(a) to roughly approximate how large the effect on the output would be if that feature
changes in the real world.
CIS520 Final, Fall 2018 14
(b) to roughly approximate how large the effect on the output would be if the feature were
removed from the model.
(c) The above two answers are the same, so both of them are correct.
(a) explains which features in a model are most important (across all points)
(b) explains which features in a model are most important for predicting at a particular
point in the training data
(c) explains which features in a model are most important for predicting at any particular
point
56. [2 points] For which of the following situations is mean centering the data before doing PCA
probably a good thing to do? In each case the rows of the matrix are people.
58. [2 points] Consider the error decomposition for a least squares regression model
Ex,y,D [(h(x; D) − y)2 ] = Ex,D [(h(x; D) − h(x))2 ] + Ex [(h(x) − y(x))2 ] + Ex,y [(y(x) − y)2 ]
| {z } | {z } | {z }
Variance Bias2 Noise
where h(x; D) is a model learned over a training sample D, h(x) = ED∼P n [(h(x; D)] is the
average model, and y(x) = Ey|x [y] is the optimal Bayes model. Which of the following best
describes the term labeled variance?
(a) On average, how much your learned model differs from average model across different
samples D
(b) How far is the average model from optimal Bayes model
(c) The variance between predictions for a fixed sample D
(d) How accurate the model is in predicting y
CIS520 Final, Fall 2018 15
59. [2 points] Suppose you are learning a CNN on greyscale images of size 105 × 154, so the image
has only one channel. In the first convolutional layer, you use a filter of size 21 × 14 with
stride of size 7 in both x and y dimensions without any padding or bias term. How many
neurons will there be in the next layer?
(a) 12*20
(b) 13*21
(c) 15*22
(d) 16*23
(e) None of the above
60. [2 points] Suppose you have a two dimensional training data set X with real valued labels Y .
The following plot shows training data; each element Xi of the training set is the center of a
circle and the radius of the circle equals its label Yi . (Both axes have the same scale.)
You want to reduce the data to one dimension by projecting onto a suitable direction, and
then learn a linear regression model in the reduced space, i.e. do PCR using only a single
component. In order to be able to accurately predict labels of as many of the training points
as possible, which of the following projections would be best?