Regression hw3
Regression hw3
1.(a)The best model should contain PAPER and MACHINE as the predictors,
which gives the smallest AIC(210.8477).
(b)First step, we compare COST~ MACHINE, COST~PAPER, COST~OVERHEAD
and COST~LABOR. We found that COST~MACHINE is the best.
Second step, we compare COST~ MACHINE + PAPER, COST~ MACHINE +
OVERHEAD and COST~ MACHINE + LABOR. We find the best is COST~
MACHINE + PAPER.
Third step, we compare COST~ MACHINE + PAPER, COST~ MACHINE + PAPER
+ OVERHEAD and COST~ MACHINE + PAPER + LABOR. The best model remain
the same which is COST~ MACHINE + PAPER, therefore, the procedure will be
stop.
(c) COST = 59.432 + 0.949(PAPER)+2.386(MACHINE)
2.(a) Using the all-possible regression technique, when there is a large number of candidate -
X variable, this approach may not be poetically feasible, because of the computational time.
Therefore, we would like to choose stepwise regression.
(b) PROD, FOV and HOUSE, are included in the final Model because they are significant
with SALES.
(c)
3(a) The SALARY is expected to increase 579.76 units for every unit increase in GENDER by 1,
keeping the YEARS, POSITION and EDUCAT constant.
(b) The residual degrees of freedom are d.f.= 47-5-1= 41
(d)
3
a
dataset <- read.csv("hwk3q3.csv")
model <- lm(SALARY ~ YEARS + as.factor(POSITION) + as.factor(EDUCAT)
+ as.factor(GENDER), data = dataset)
summary(model)
##
## Call:
## lm(formula = SALARY ~ YEARS + as.factor(POSITION) +
as.factor(EDUCAT) +
## as.factor(GENDER), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1410.3 -204.5 -103.4 230.3 752.1
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1320.86 411.76 3.208 0.00272 **
## YEARS 20.38 41.65 0.489 0.62736
## as.factor(POSITION)2 186.91 479.54 0.390 0.69889
## as.factor(POSITION)3 -223.54 409.34 -0.546 0.58820
## as.factor(POSITION)4 1437.47 521.08 2.759 0.00888 **
## as.factor(POSITION)5 2301.07 518.38 4.439 7.52e-05 ***
## as.factor(EDUCAT)2 133.16 321.02 0.415 0.68063
## as.factor(EDUCAT)3 -685.85 477.76 -1.436 0.15932
## as.factor(EDUCAT)4 NA NA NA NA
## as.factor(GENDER)1 231.36 338.49 0.684 0.49842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 495 on 38 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.6979
## F-statistic: 14.28 on 8 and 38 DF, p-value: 2.407e-09
The estimated coefficient of GENDER indicates that the average expected monthly salary of
men is 231.36 units higher than that of women in the same situation
b
The remaining degrees of freedom of the model can be 47-10=37, which does not match the
R output
c
which(dataset$POSITION == 4 | dataset$POSITION == 5)
## [1] 4 7 8 10 15 16 20 21 24 26 30 33 34 35 41 42 43 45 46 47
which(dataset$EDUCAT == 3 | dataset$EDUCAT == 4)
## [1] 4 7 8 10 15 16 20 21 24 26 30 33 34 35 41 42 43 45 46 47
d
full_model <- lm(SALARY ~ YEARS + as.factor(POSITION) +
as.factor(EDUCAT) + as.factor(GENDER), data = dataset)
reduced_model <- lm(SALARY ~ YEARS + as.factor(POSITION) +
as.factor(EDUCAT), data = dataset)
anova(reduced_model, full_model)
The results of the partial F test comparing the simplified model (excluding gender) and the
full model (including gender) provide a P-value of 0.4984 for the inclusion of the gender
variable. This P-value is much higher than the common significance level of 0.05, suggesting
that adding a gender variable to the model does not significantly improve the model’s ability
to explain wage differences among BigTex Services employees, that is, gender does not
statistically significantly explain wage differences among employees in the provided dataset.
4(a)
LungCap = 1.05157 + (0.55823-0.0597)(Age) + 0.22601 (Smokeyes)
LungCap = 1.05157 + 0.55823(Age)