Midterm 2023 Redacted
Midterm 2023 Redacted
Data 402
Instructions
You have two and half hours to complete this exam, although it is designed to take much
less time than that.
The exam is open-note and open-internet. That is, you may use any class materials and any
public online materials to help you during the exam. This also means you may use software
to perform calculations or derivations.
Of course, you may not contact any other humans during the course of the exam. This
includes online forums and message boards; that is, you may visit StackOverflow or reddit and
read existing posts, but you may not post a new question of any kind. You also may not use
AI/LLMs like ChatGPT.
You may answer these questions using your computer or written on paper, or any combination
thereof. As long as I can easily find all your answers, any format is fine. Don’t forget to upload
any digital answers to Canvas, and to turn in any paper answers to me.
You do not need to turn in any notes, scratch paper, etc. that you use during the course of
the exam.
1
Part I [100 points]
1. Give an intuitive explanation for this choice of loss function. How does it express the
desire for accurate predictions of a quantitative variable?
2. Do you see any possible issues with this loss function? What assumptions have to be
true about the data for this loss function to be viable?
3. Find an equation for the gradient of the loss function. (A general equation for the partial
derivative at 𝛽𝑗 will suffice.)
4. Give a brief code outline (psuedocode, R, or python) to show the procedure you would
use to calculate the “best” 𝛽’s according to this loss function. Your code does not need
to actually run; however, it does need to be specific about the inputs as well as the
equations. For example:
Sufficient:
sx = std dev of x
sy = std dev of y
mx = mean of x
my = mean of y
2
beta_1 = sy/sx*rxy
beta_0 = my - beta_1*mx
Insufficient:
ols:
3
Part II [50 points]
Question A
A researcher is trying to fit a linear regression model using a LASSO penalty. To choose a
good value of the penalty parameter, 𝜆, she decides to take the following approach:
Discuss this strategy: Do you think it is a good idea? Why or why not? Do you have any
suggestions to make it more efficient, more justifiable, or more correct?
Question B
4
• A model specification, and why you think that model would be a good choice for this
task. This can be a model we studied, an existing model we haven’t studies, or you can
“invent” something - but you must include some discussion of why you think that choice
is good/reasonable for this scenario.
• A loss function that you will use to fit the model, and why this loss function correctly
expresses your desires for your “best” model.
• A metric you will use to report your model’s abilities and/or to tune hyperparameters,
and why this metric is a good measure of “model success” in this case. This metric
should not be R-squared, MAE, or MSE. (Those would be reasonable in this case, but I
want you to find/invent a different one and justify it.)
Note: You do not need to concern yourself in this question with the computational feasibility
of your model, loss, or metric. I am only looking for you to translate the “real world” needs
of the scenario into mathematical decisions.