Assignment 9
Assignment 9
Raymond Guo
2020-04-13
Exercise 1
i. The other continuous variable is mileage.
cars %>%
gather(Mileage, Liter, key = "catagory", value = "value") %>%
ggplot() +
geom_point(mapping = aes(x = value, y = Price)) +
facet_wrap(~catagory, scales = "free_x") +
labs(title="Relationship Between Price with Liter and Mileage")
60000
Price
40000
20000
Exercise 2
continuous_model %>%
tidy()
continuous_model %>%
glance() %>%
select(r.squared)
1
r.squared
0.3291279
The r.squared is closer to 0 than 1 which means this model is doing a poor job in capturing the
varability of Price. ## Exercise 3
# predict model plane over sensible grid of values
lit <- unique(cars$Liter)
mil <- unique(cars$Mileage)
grid <- with(cars, expand.grid(lit, mil))
d <- setNames(data.frame(grid), c("Liter", "Mileage"))
vals <- predict(continuous_model, newdata = d)
This model accurately fits with the data from excerise 1. I do not even know how am I suppose to
integrate the 3 assumptions with the looks of this 3D model. It is much easier to understand the
2D model compared to the 3D.
Exercise 4
2
ggplot(continuous_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")
60000
Observed Price
40000
20000
This graph barely shows a linear relationship from the explanatory variable and the response
variable.
ggplot(continuous_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")
3
Residual vs Predicted
40000
30000
Predicted 20000
10000
−10000
10000 20000 30000 40000
Predicted
It sort of looks funky because there happens to be a large contingent within the southern border,
but the northern border shows a few points that look like outliers. I say it is roughly yields a
constant variability.
ggplot(data = continuous_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
40000
20000
sample
−20000
−2 0 2
theoretical
4
Effect of make of car on price
60000
Price
40000
20000
Based on these box plots, there are instances where outliers only exist on the right side for half of
them. The value for q3 is significantly higher because of the outliers.
i. Cadillac
ii. Cadillac
iii. Chevrolet
Exercise 6
cars %>%
gather(Model:Cylinder, Doors:Leather, key="original_column", value="value") %>%
ggplot() +
geom_boxplot(aes(x = reorder(value, Price, FUN=median), y = Price)) +
facet_wrap(~original_column, scales = "free_x") +
labs(title = "Boxplot of All Categorical Variables") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))
5
Boxplot of All Categorical Variables
Cruise Cylinder Doors
60000
40000
20000
2
Leather Model Sound
60000
Price 40000
20000
AVEO
Sunfire
Cavalier
Classic
Ion
Cobalt
Grand Am
Vibe
Century
L Series
Malibu
Grand Prix
Impala
G6
Lesabre
Monte Carlo
Bonneville
Lacrosse
Park Avenue
9−2X AWD
9_5 HO
9_3
GTO
9_3 HO
9_5
CTS
Deville
STS−V6
Corvette
STS−V8
CST−V
XLR−V8
1
Trim Type
60000
40000
20000
SVM Sedan 4D
SVM Hatchback 4D
LS Hatchback 4D
LT Hatchback 4D
Coupe 2D
LS Sport Coupe 2D
LS Coupe 2D
LS Sport Sedan 4D
Quad Coupe 2D
LT Sedan 4D
LS Sedan 4D
GT Sportwagon
AWD Sportwagon 4D
Sportwagon 4D
GT Coupe 2D
L300 Sedan 4D
Sedan 4D
LS MAXX Hback 4D
SE Sedan 4D
MAXX Hback 4D
LT MAXX Hback 4D
Custom Sedan 4D
GT Sedan 4D
CX Sedan 4D
GTP Sedan 4D
LT Coupe 2D
SLE Sedan 4D
Limited Sedan 4D
SS Coupe 2D
CXL Sedan 4D
CXS Sedan 4D
GXP Sedan 4D
SS Sedan 4D
Linear Sedan 4D
Special Ed Ultra 4D
Aero Sedan 4D
Aero Wagon 4D
Linear Wagon 4D
Arc Sedan 4D
Arc Wagon 4D
Aero Conv 2D
Linear Conv 2D
Arc Conv 2D
DHS Sedan 4D
DTS Sedan 4D
Conv 2D
Hardtop Conv 2D
Hatchback
Coupe
Sedan
Wagon
Convertible
reorder(value, Price, FUN = median)
Exercise 7
mixed_model %>%
tidy()
6
select(r.squared)
r.squared
0.9389165
Exercise 8
ggplot(mixed_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")
60000
Observed Price
40000
20000
ggplot(mixed_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")
7
Residual vs Predicted
15000
10000
Predicted
5000
−5000
10000
sample
5000
−5000
−2 0 2
theoretical
Exercise 9
i. The value for r.squared is significantly closer to 1 compared to the 2 variable model. The
observed vs predicted graph perfectly shows a linear relationship. The variability of points
around the line is perfectly constant. The 2 variable model meets these requirements, but it
is a lot weaker. The qqplot clearly shows a bell shape curve compared to the first where it
obviously was not.
ii. The second model is the best because there are 3 conditions that needs to be satisfied to be a
reliable model. The second model does that job more effectively than the first one.