hwk5
hwk5
Homework 5
STAT 511 – Regression
AUTOR/A FECHA DE PUBLICACIÓN
Maxwell Hahn 14 diciembre 2024, 13:47 EST
Question 1
Problem 1
Normal dist.
–
We rewrite as
2 2
y − 2yμ + μ 1
exp (− ) ×
2
2σ √ 2πσ 2
and
1 2 2
yμ − μ y 1
2
exp ( )exp (− ) ×
2 2
σ 2σ √ 2πσ 2
c(y,ϕ)
ϕ = σ; and
2
A(θ) =
1
2
μ
2
=
1
2
θ
2
.
N.B. dist.
–
We rewrite as
y + r − 1
exp (r log p + y log(1 − p))( )
y
c(y,ϕ)
′
μ = E[Y ] = A (θ) = θ
′′ 2
Var(Y ) = ϕA (θ) = σ
Var(Y ) 2
σ
V(μ) = = 2
= 1
ϕ σ
N.B. dist.
–
′ exp(θ)
E[Y ] = A (θ) = r
1−exp(θ)
′′ exp(θ)
Var(Y ) = ϕA (θ) = −r 2
(1−exp(θ))
Var(Y ) exp(θ)
V(μ) = = −r
ϕ (1−exp(θ)) 2
Problem 3
Normal dist.
–
N.B. dist.
–
Problem 4
Normal dist.
–
2
1 1 (y − μ)
2 2
2 {[ − log(2πσ ) − 0] − [ − log(2πσ ) − ]}
2
2 2 2σ
2
(y − μ)
2
= 2 {+ } = (y − μ)
2
2σ
N.B. dist.
–
y + r − 1 r y y + r − 1 r μ
2 {[ log ( ) + r log ( ) + y log ( )] − [ log ( ) + r log ( ) + y log ( )]}
y r + y r + y y r + μ r + μ
= 2 {[r log(r) − r log(r + y) + y log(y) − y log(r + y)] − [r log(r) − r log(r + μ) + y log(μ) − y log(r + μ)]}
r + μ y r + μ
= 2 {r log ( ) + y log ( ) + y log ( )}
r + y μ r + y
r + μ y(r + μ)
= 2 {r log ( ) + y log ( )}
r + y μ(r + y)
localhost:7358 2/16
12/14/24, 1:51 PM Homework 5
Question 2
Problem 1
We see that Y i ∼ Bernoulli(π i ) and accordingly has pmf
yi 1−y i
π (1 − π i )
i
which we rewrite as
exp (y i log π i + (1 − y i ) log(1 − π i )) = exp (y i (log π i − log(1 − π i )) + log(1 − π i ))
Accordingly, we have
πi
θi
θ i = log ; ϕ i = 1; A(θ i ) = log(1 + e )
1 − πi
Problem 2
Part (a).
–
We see that
⋆
Pr(Y i = 1 ∣ x i ) = Pr(Y i ≥ τ ∣ xi )
and
Pr(Y i = 1 ∣ x i ) = Pr(α + β 1 x i1 + ⋯ + β p x ip + u i ≥ τ )
where ,
u i ∼ N (0, 1)
Let β0 = α − τ . Then,
Pr(Y i = 1 ∣ x i ) = Φ(β 0 + β 1 x i1 + ⋯ + β p x ip )
as desired.
Part (b).
–
g(π i ) = Φ(β 0 + β 1 x i1 + ⋯ + β p x ip )
as above.
We rewrite as
−1
β 0 + β 1 x i1 + ⋯ + β p x ip = Φ (π i )
localhost:7358 3/16
Problem 3
12/14/24, 1:51 PM Homework 5
df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
summary(logistic)
Call:
glm(formula = admit ~ ., family = binomial(link = "logit"), data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.449548 1.132846 -3.045 0.00233 **
gre 0.002294 0.001092 2.101 0.03564 *
gpa 0.777014 0.327484 2.373 0.01766 *
rank -0.560031 0.127137 -4.405 1.06e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(probit)
Call:
glm(formula = admit ~ ., family = binomial(link = "probit"),
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.0915037 0.6718360 -3.113 0.00185 **
gre 0.0013982 0.0006487 2.156 0.03112 *
gpa 0.4643598 0.1950263 2.381 0.01727 *
rank -0.3317117 0.0745524 -4.449 8.61e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
localhost:7358 4/16
12/14/24, 1:51 PM Homework 5
Logistic:
exp(−3.450 + 0.002 × gre + 0.777 × gpa − 0.560 × rank)
πi =
1 + exp(−3.450 + 0.002 × gre + 0.777 × gpa − 0.560 × rank)
Probit:
π i = Φ(−2.092 + 0.001 × gre + 0.464 × gpa − 0.332 × rank)
Generally speaking, the coefficients are pretty similar. All predictors (and intercept) are significant in each model, and the p-
values are also, generally speaking, similar.
Problem 4
p1 <- predict(logistic, type = "response")
p2 <- predict(probit, type = "response")
plot(p1, p2,
xlab = "Logistic probs",
ylab = "Probit probs")
abline(0, 1, col = "red")
As most points fall on or close to the red line, we see that most of the probabilities from each model are pretty similar to
each other.
Problem 5
set.seed(511)
sample(c("L","P"), size=1)
[1] "P"
Case 1:
current_gre650 <- data.frame(
gpa = 3.5, rank = 2, gre = 650
)
1
0.4126149
1
0.4677204
newprob - currentprob
1
0.05510552
Case 2:
current_gre450 <- data.frame(
gpa = 3.5, rank = 2, gre = 450
)
1
0.3083728
1
0.3591821
newprob - currentprob
1
0.05080932
In both cases, we see that the expected change in probability of admission if the student improves GRE score by 100 is
approximately 0.05.
localhost:7358 6/16
12/14/24, 1:51 PM Homework 5
Question 3
load("homework5.Rdata")
head(lc_danish,4)
Problem 1
We assume ; therefore, the pmf is
Y i ∼ Pois(λ i ⋅ n i )
yi
(λ i ⋅ n i )
exp(−λ i ⋅ n i )
yi !
yi !
Problem 2
lc_danish$logni <- log(lc_danish$Pop)
model
Coefficients:
(Intercept) CityHorsens CityKolding
-4.10264 -1.86606 0.09683
CityVejle Age40-54 Age55-59
-0.37955 -1.52530 -0.18407
Age60-64 Age65-69 Age70-74
-0.06473 0.04048 0.26809
CityHorsens:Age40-54 CityKolding:Age40-54 CityVejle:Age40-54
2.09376 -1.13520 -0.21508
CityHorsens:Age55-59 CityKolding:Age55-59 CityVejle:Age55-59
0.95705 -0.68721 -0.16547
CityHorsens:Age60-64 CityKolding:Age60-64 CityVejle:Age60-64
1.91385 -0.78037 0.11730
localhost:7358 7/16
12/14/24, 1:51 PM Homework 5
CityHorsens:Age65-69 CityKolding:Age65-69 CityVejle:Age65-69
1.50458 -0.19070 0.63347
CityHorsens:Age70-74 CityKolding:Age70-74 CityVejle:Age70-74
1.73347 -0.34732 0.00383
summary(model)
Call:
glm(formula = Cases ~ City * Age + offset(logni), family = "poisson",
data = lc_danish)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.10264 0.31623 -12.974 < 2e-16 ***
CityHorsens -1.86606 0.77460 -2.409 0.015993 *
CityKolding 0.09683 0.42817 0.226 0.821094
CityVejle -0.37955 0.49280 -0.770 0.441190
Age40-54 -1.52530 0.43693 -3.491 0.000481 ***
Age55-59 -0.18407 0.43693 -0.421 0.673546
Age60-64 -0.06473 0.43693 -0.148 0.882234
Age65-69 0.04048 0.44721 0.091 0.927881
Age70-74 0.26809 0.43693 0.614 0.539496
CityHorsens:Age40-54 2.09376 0.87626 2.389 0.016874 *
CityKolding:Age40-54 -1.13520 0.72405 -1.568 0.116915
CityVejle:Age40-54 -0.21508 0.73059 -0.294 0.768463
CityHorsens:Age55-59 0.95705 0.92605 1.033 0.301383
CityKolding:Age55-59 -0.68721 0.63186 -1.088 0.276767
CityVejle:Age55-59 -0.16547 0.69038 -0.240 0.810580
CityHorsens:Age60-64 1.91386 0.87039 2.199 0.027888 *
CityKolding:Age60-64 -0.78037 0.64583 -1.208 0.226925
CityVejle:Age60-64 0.11730 0.65861 0.178 0.858647
CityHorsens:Age65-69 1.50458 0.89443 1.682 0.092535 .
CityKolding:Age65-69 -0.19070 0.61175 -0.312 0.755249
CityVejle:Age65-69 0.63347 0.64365 0.984 0.325026
CityHorsens:Age70-74 1.73348 0.87991 1.970 0.048832 *
CityKolding:Age70-74 -0.34732 0.62077 -0.559 0.575825
CityVejle:Age70-74 0.00383 0.67732 0.006 0.995488
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
glm(formula = Cases ~ City + Age + offset(logni), family = "poisson",
data = lc_danish)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.21241 0.20819 -20.233 < 2e-16 ***
CityHorsens -0.33006 0.18150 -1.818 0.0690 .
CityKolding -0.37155 0.18781 -1.978 0.0479 *
CityVejle -0.27232 0.18785 -1.450 0.1472
Age40-54 -1.41965 0.25027 -5.672 1.41e-08 ***
Age55-59 -0.31864 0.25205 -1.264 0.2062
Age60-64 0.09896 0.23566 0.420 0.6745
Age65-69 0.34805 0.23343 1.491 0.1359
Age70-74 0.43721 0.23929 1.827 0.0677 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
filter, lag
localhost:7358 9/16
12/14/24, 1:51 PM Homework 5
The following objects are masked from 'package:base':
Call:
glm(formula = Cases ~ City + age.q + offset(logni), family = "poisson",
data = lc_danish)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.254622 0.352847 -20.560 <2e-16 ***
CityHorsens -0.309326 0.181428 -1.705 0.0882 .
CityKolding -0.354055 0.187705 -1.886 0.0593 .
CityVejle -0.249588 0.187772 -1.329 0.1838
age.q 0.047263 0.005442 8.685 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
glm(formula = Cases ~ City + age.q + age.q2 + offset(logni),
family = "poisson", data = lc_danish)
Coefficients:
localhost:7358 10/16
12/14/24, 1:51 PM Homework 5
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.399e+01 1.723e+00 -8.123 4.57e-16 ***
CityHorsens -3.328e-01 1.815e-01 -1.834 0.0667 .
CityKolding -3.769e-01 1.878e-01 -2.007 0.0448 *
CityVejle -2.742e-01 1.878e-01 -1.460 0.1443
age.q 2.934e-01 6.098e-02 4.811 1.50e-06 ***
age.q2 -2.142e-03 5.264e-04 -4.069 4.73e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Problem 5
best.model <- model_quantAgeQuad
summary(best.model)
Call:
glm(formula = Cases ~ City + age.q + age.q2 + offset(logni),
family = "poisson", data = lc_danish)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.399e+01 1.723e+00 -8.123 4.57e-16 ***
CityHorsens -3.328e-01 1.815e-01 -1.834 0.0667 .
CityKolding -3.769e-01 1.878e-01 -2.007 0.0448 *
CityVejle -2.742e-01 1.878e-01 -1.460 0.1443
age.q 2.934e-01 6.098e-02 4.811 1.50e-06 ***
age.q2 -2.142e-03 5.264e-04 -4.069 4.73e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
localhost:7358 11/16
12/14/24, 1:51 PM Homework 5
We will choose the last model that we fit (i.e., with a quantitative age predictor and its quadratic term along with city
variable) as our best model. We note that it also has the lowest AIC value, indicating that this might be the best-fitting model.
Additionally, we tested whether the quadratic term results in a significantly different model than without it, and we do have
evidence that the model with the quadratic term performs best.
We will now check the diagnostic plots for this best model to make sure that we are adhering to the assumptions of our
model.
Recall our GLM assumptions are:
Correct link function is used
Linearity - each predictor is included in the linear predictor in the correct form
Correct variance function is used
Dispersion parameter is constant and no overdispersion problem exists
Responses are independent of each other and come from the specified EDM
par(mfrow=c(2,2))
plot(residuals(best.model,type = "working")~predict(best.model,type = "link"),las=1,
main="Working responses residuals",
ylab="Working responses - linear predictors",
xlab="Fitted linear predictors" )
plot(resid(best.model) ~ predict(best.model,type = "link"), las=1,
main="Deviance residuals", ylab="Deviance residuals",
xlab="Fitted linear predictors" )
plot( cooks.distance(best.model), type="h", las=1,
ylab="D", main="Cook's distance")
qqnorm( resid(best.model), las=1,
main="Normal Q-Q plot\ndeviance residuals")
qqline(resid(best.model))
best.model$deviance / best.model$df.residual
[1] 1.529386
:::
There might be slight overdispersion; however, we do not find this too concerning as the value is not excessively greater
than 1.
We ultimately proceed with our assumptions being reasonably satisfied after viewing the diagnostics and choose this model
as our best model.
localhost:7358 13/16
12/14/24, 1:51 PM Homework 5
Question 4
load("homework5.Rdata")
head(pock,4)
Count Dilution
1 116 1
2 151 1
3 171 1
4 194 1
summary(pock)
Count Dilution
Min. : 5.00 Min. : 1.000
1st Qu.: 18.75 1st Qu.: 2.000
Median : 48.50 Median : 4.000
Mean : 71.04 Mean : 6.417
3rd Qu.:115.25 3rd Qu.: 8.000
Max. :259.00 Max. :16.000
plot(pock$log2Dilution, pock$Count)
Problem 1
Pois.model <- glm(Count ~ log2Dilution, family = "poisson",
data = pock)
localhost:7358 14/16
12/14/24, 1:51 PM Homework 5
Pois.model
Coefficients:
(Intercept) log2Dilution
5.2679 -0.6809
summary(Pois.model)
Call:
glm(formula = Count ~ log2Dilution, family = "poisson", data = pock)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.26793 0.02255 233.60 <2e-16 ***
log2Dilution -0.68094 0.01544 -44.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Problem 2
We first check: ::: {.cell}
min(pock$Count)
[1] 5
:::
So we can reasonably use the Goodness of fit test.
(D = deviance(Pois.model))
[1] 290.4387
[1] 3.285023e-37
Coefficients:
(Intercept) log2Dilution
5.2679 -0.6809
summary(qPois.model)
Call:
glm(formula = Count ~ log2Dilution, family = "quasipoisson",
data = pock)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.26793 0.05678 92.78 <2e-16 ***
log2Dilution -0.68094 0.03888 -17.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimated coefficients are identical. The standard errors in the Poisson model are smaller than those of the quasi-
Poisson model.
Since there is overdispersion (as dispersion ≈ 6.339 > 1), the standard errors and p-values from the Poisson model are
underestimated. The quasi-Poisson model results are more reliable.
localhost:7358 16/16