0% found this document useful (0 votes)

387 views

Assignment 9

i. The document discusses analyzing relationships between price, mileage, and liter variables in cars data through linear regression models and visualizations. Various linear models are created and evaluated using r-squared values and residual plots. ii. Boxplots are created to examine the effect of different categorical variables like make, model, doors, and others on price. Cadillac, Chevrolet, and Pontiac are identified as having outliers that increase the value of q3. iii. Multiple visualizations are used to evaluate linear models including observed vs predicted price plots, residual vs predicted plots, and QQ plots. The models are found to fit the data poorly with non-normal residuals.

Uploaded by

Ray Guo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

387 views

Assignment 9

Uploaded by

Ray Guo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Assignment 9: How much for that car?

Raymond Guo
2020-04-13

Exercise 1
i. The other continuous variable is mileage.
cars %>%
gather(Mileage, Liter, key = "catagory", value = "value") %>%
ggplot() +
geom_point(mapping = aes(x = value, y = Price)) +
facet_wrap(~catagory, scales = "free_x") +
labs(title="Relationship Between Price with Liter and Mileage")

Relationship Between Price with Liter and Mileage

Liter Mileage

60000
Price

40000

20000

2 3 4 5 6 0 10000 20000 30000 40000 50000

value

Exercise 2

continuous_model <-lm(Price~Mileage + Liter, data = cars)

continuous_model %>%
tidy()

term estimate std.error statistic p.value

(Intercept) 9426.6014688 1095.0777745 8.608157 0.0e+00
Mileage -0.1600285 0.0349084 -4.584237 5.3e-06
Liter 4968.2781155 258.8011436 19.197280 0.0e+00

continuous_model %>%
glance() %>%
select(r.squared)

1
r.squared
0.3291279

The r.squared is closer to 0 than 1 which means this model is doing a poor job in capturing the
varability of Price. ## Exercise 3
# predict model plane over sensible grid of values
lit <- unique(cars$Liter)
mil <- unique(cars$Mileage)
grid <- with(cars, expand.grid(lit, mil))
d <- setNames(data.frame(grid), c("Liter", "Mileage"))
vals <- predict(continuous_model, newdata = d)

# form surface matrix and give to plotly

m <- matrix(vals, nrow = length(unique(d$Liter)), ncol = length(unique(d$Mileage)))
p <- plot_ly() %>%
add_markers(
x = ~cars$Mileage,
y = ~cars$Liter,
z = ~cars$Price,
marker = list(size = 1)
) %>%
add_trace(
x = ~mil, y = ~lit, z = ~m, type="surface",
colorscale=list(c(0,1), c("yellow","yellow")),
showscale = FALSE
) %>%
layout(
scene = list(
xaxis = list(title = "mileage"),
yaxis = list(title = "liters"),
zaxis = list(title = "price")
)
)
if (!is_pdf) {p}

This model accurately fits with the data from excerise 1. I do not even know how am I suppose to
integrate the 3 assumptions with the looks of this 3D model. It is much easier to understand the
2D model compared to the 3D.

Exercise 4

continuous_df <- cars %>%

add_predictions(continuous_model) %>%
add_residuals(continuous_model)

2
ggplot(continuous_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")

Observed vs Predicted of Price

60000
Observed Price

40000

20000

10000 20000 30000 40000

Predicted Price

This graph barely shows a linear relationship from the explanatory variable and the response
variable.
ggplot(continuous_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")

3
Residual vs Predicted
40000

30000

Predicted 20000

10000

−10000
10000 20000 30000 40000
Predicted

It sort of looks funky because there happens to be a large contingent within the southern border,
but the northern border shows a few points that look like outliers. I say it is roughly yields a
constant variability.
ggplot(data = continuous_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
40000

20000
sample

−20000
−2 0 2
theoretical

This obviously does not follow a bell shape curve. ## Exercise 5

cars %>%
ggplot() +
geom_boxplot(aes(x = reorder(Make, Price, FUN=median), y = Price)) +
labs(x = "Make of car", title = "Effect of make of car on price")

4
Effect of make of car on price

60000

Price
40000

20000

Saturn Chevrolet Pontiac Buick SAAB Cadillac

Make of car

Based on these box plots, there are instances where outliers only exist on the right side for half of
them. The value for q3 is significantly higher because of the outliers.
i. Cadillac
ii. Cadillac
iii. Chevrolet

Exercise 6

cars %>%
gather(Model:Cylinder, Doors:Leather, key="original_column", value="value") %>%
ggplot() +
geom_boxplot(aes(x = reorder(value, Price, FUN=median), y = Price)) +
facet_wrap(~original_column, scales = "free_x") +
labs(title = "Boxplot of All Categorical Variables") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))

5
Boxplot of All Categorical Variables
Cruise Cylinder Doors

60000

40000

20000

2
Leather Model Sound

60000
Price 40000

20000

AVEO
Sunfire
Cavalier
Classic
Ion
Cobalt
Grand Am
Vibe
Century
L Series
Malibu
Grand Prix
Impala
G6
Lesabre
Monte Carlo
Bonneville
Lacrosse
Park Avenue
9−2X AWD
9_5 HO
9_3
GTO
9_3 HO
9_5
CTS
Deville
STS−V6
Corvette
STS−V8
CST−V
XLR−V8

1
Trim Type

60000

40000

20000
SVM Sedan 4D
SVM Hatchback 4D
LS Hatchback 4D
LT Hatchback 4D
Coupe 2D
LS Sport Coupe 2D
LS Coupe 2D
LS Sport Sedan 4D
Quad Coupe 2D
LT Sedan 4D
LS Sedan 4D
GT Sportwagon
AWD Sportwagon 4D
Sportwagon 4D
GT Coupe 2D
L300 Sedan 4D
Sedan 4D
LS MAXX Hback 4D
SE Sedan 4D
MAXX Hback 4D
LT MAXX Hback 4D
Custom Sedan 4D
GT Sedan 4D
CX Sedan 4D
GTP Sedan 4D
LT Coupe 2D
SLE Sedan 4D
Limited Sedan 4D
SS Coupe 2D
CXL Sedan 4D
CXS Sedan 4D
GXP Sedan 4D
SS Sedan 4D
Linear Sedan 4D
Special Ed Ultra 4D
Aero Sedan 4D
Aero Wagon 4D
Linear Wagon 4D
Arc Sedan 4D
Arc Wagon 4D
Aero Conv 2D
Linear Conv 2D
Arc Conv 2D
DHS Sedan 4D
DTS Sedan 4D
Conv 2D
Hardtop Conv 2D

Hatchback

Coupe

Sedan

Wagon

Convertible
reorder(value, Price, FUN = median)

Exercise 7

cars_factor_df <- cars %>%

mutate(Cylinder = as.factor(Cylinder))

mixed_model <-lm(Price~Mileage + Liter + Cylinder + Make +

Type, data = cars_factor_df)

mixed_model %>%
tidy()

term estimate std.error statistic p.value

(Intercept) 1.885018e+04 892.4119413 21.122738 0.0000000
Mileage -1.861764e-01 0.0106433 -17.492387 0.0000000
Liter 5.697442e+03 342.7322419 16.623596 0.0000000
Cylinder6 -3.312544e+03 619.9683651 -5.343086 0.0000001
Cylinder8 -3.672597e+03 1246.2162662 -2.946998 0.0033032
MakeCadillac 1.450444e+04 517.9855224 28.001635 0.0000000
MakeChevrolet -2.270807e+03 355.9736337 -6.379145 0.0000000
MakePontiac -2.355468e+03 363.9063301 -6.472731 0.0000000
MakeSAAB 9.905074e+03 450.2011112 22.001443 0.0000000
MakeSaturn -2.090266e+03 470.8305609 -4.439529 0.0000103
TypeCoupe -1.163869e+04 464.7055454 -25.045297 0.0000000
TypeHatchback -1.172638e+04 545.3936364 -21.500769 0.0000000
TypeSedan -1.178618e+04 411.1021489 -28.669707 0.0000000
TypeWagon -8.156551e+03 500.6379995 -16.292312 0.0000000

Yes, there are slopes for all of the categorical variables.

mixed_model %>%
glance() %>%

6
select(r.squared)

r.squared
0.9389165

Exercise 8

mixed_df <- cars_factor_df %>%

add_predictions(mixed_model) %>%
add_residuals(mixed_model)

ggplot(mixed_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")

Observed vs Predicted of Price

60000
Observed Price

40000

20000

10000 20000 30000 40000 50000

Predicted Price

ggplot(mixed_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")

7
Residual vs Predicted
15000

10000

Predicted
5000

−5000

10000 20000 30000 40000 50000

Predicted
ggplot(data = mixed_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
15000

10000
sample

5000

−5000

−2 0 2
theoretical

Exercise 9
i. The value for r.squared is significantly closer to 1 compared to the 2 variable model. The
observed vs predicted graph perfectly shows a linear relationship. The variability of points
around the line is perfectly constant. The 2 variable model meets these requirements, but it
is a lot weaker. The qqplot clearly shows a bell shape curve compared to the first where it
obviously was not.
ii. The second model is the best because there are 3 conditions that needs to be satisfied to be a
reliable model. The second model does that job more effectively than the first one.

E60 Code List
No ratings yet
E60 Code List
2 pages
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
Homework 4
No ratings yet
Homework 4
4 pages
GOLF - Eléctrico
100% (2)
GOLF - Eléctrico
17 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Assignment 8
No ratings yet
Assignment 8
6 pages
Diamond Dust
No ratings yet
Diamond Dust
1 page
Modern Data Science With R-775437 Chapters
No ratings yet
Modern Data Science With R-775437 Chapters
10 pages
Audi Maintenance Schedule Model Year 2020
No ratings yet
Audi Maintenance Schedule Model Year 2020
5 pages
Assignment 4
No ratings yet
Assignment 4
4 pages
Analysis of Mtcars
100% (1)
Analysis of Mtcars
3 pages
ANZ Virtual Internship Module Model Answer For Task 1
No ratings yet
ANZ Virtual Internship Module Model Answer For Task 1
9 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
Project Report On DBMS Project
No ratings yet
Project Report On DBMS Project
22 pages
Homework 2
100% (1)
Homework 2
12 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
Homework 2
100% (1)
Homework 2
14 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
How To Work With List Columns
No ratings yet
How To Work With List Columns
104 pages
Time Series Analysis
100% (1)
Time Series Analysis
2 pages
Project Based Learning
No ratings yet
Project Based Learning
3 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Estimation and Testing of Hypothesis PDF
100% (1)
Estimation and Testing of Hypothesis PDF
75 pages
Demographics Segmentation Using Machine Learning
No ratings yet
Demographics Segmentation Using Machine Learning
8 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
SPSS Multiple Linear Regression
No ratings yet
SPSS Multiple Linear Regression
55 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Cluster Analysis in Python Chapter2 PDF
No ratings yet
Cluster Analysis in Python Chapter2 PDF
30 pages
Time Series Project
No ratings yet
Time Series Project
19 pages
Starbucks Sentiment Analysis Using VADER
No ratings yet
Starbucks Sentiment Analysis Using VADER
23 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Problem 1
No ratings yet
Problem 1
12 pages
Statistics I
100% (2)
Statistics I
686 pages
ML Section16 Causality
No ratings yet
ML Section16 Causality
57 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Credit EDA Assignment PDF
No ratings yet
Credit EDA Assignment PDF
40 pages
Char Lie
100% (1)
Char Lie
64 pages
Mvchine Learning Project Report
No ratings yet
Mvchine Learning Project Report
33 pages
Time Series Analysis
No ratings yet
Time Series Analysis
3 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Project DVT CarInsurance
No ratings yet
Project DVT CarInsurance
10 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Diff in Diff Uk12 Villa
No ratings yet
Diff in Diff Uk12 Villa
16 pages
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
100% (2)
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
55 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
DMPM-LAB-03-Assignment: Rcode
No ratings yet
DMPM-LAB-03-Assignment: Rcode
9 pages
Deriving Insights From Data
No ratings yet
Deriving Insights From Data
8 pages
DMPM-Lab-02-Linear Regression Model
No ratings yet
DMPM-Lab-02-Linear Regression Model
2 pages
Yamaha Tenere 700 Owner Manual
100% (1)
Yamaha Tenere 700 Owner Manual
106 pages
Odb II Locations
No ratings yet
Odb II Locations
2 pages
3NS7wij4Ny2H9G17yy9ikD1v1Ka2nYuO - Engine Management 2017-2018 - Reduced PDF
No ratings yet
3NS7wij4Ny2H9G17yy9ikD1v1Ka2nYuO - Engine Management 2017-2018 - Reduced PDF
564 pages
Electric Cars-Vol2
No ratings yet
Electric Cars-Vol2
32 pages
Truck Homologation Norms
No ratings yet
Truck Homologation Norms
1 page
Ficha Tecnica de Montacargas
No ratings yet
Ficha Tecnica de Montacargas
7 pages
HI908T HI905T HI935T HI903T HI933T-Manual
No ratings yet
HI908T HI905T HI935T HI903T HI933T-Manual
25 pages
Caterpillar-Electronic Injector List (EUI)
No ratings yet
Caterpillar-Electronic Injector List (EUI)
10 pages
Arnish Engineering Axles Presentation 2021
No ratings yet
Arnish Engineering Axles Presentation 2021
29 pages
Repair content for 2013 Nissan Rogue
No ratings yet
Repair content for 2013 Nissan Rogue
3 pages
Katalog 970
No ratings yet
Katalog 970
2,161 pages
HPX Xuv Specs
No ratings yet
HPX Xuv Specs
1 page
Citroen C3 Brochure
No ratings yet
Citroen C3 Brochure
21 pages
2017 Tesla Model X Owner's Manual - Compressed
No ratings yet
2017 Tesla Model X Owner's Manual - Compressed
198 pages
Unbranded Fully Electric, Battery Operated Motor Driven For Carrying Bins E Cart For Garbage NA
No ratings yet
Unbranded Fully Electric, Battery Operated Motor Driven For Carrying Bins E Cart For Garbage NA
10 pages
Adama Science and Technology University School of Mechanical Chemical and Materials Engineering Department of Mechanical Engineering
No ratings yet
Adama Science and Technology University School of Mechanical Chemical and Materials Engineering Department of Mechanical Engineering
54 pages
2cb Fz150i Full
No ratings yet
2cb Fz150i Full
236 pages
Ashok Leyland, Hosur - Industry, Company Profile
50% (2)
Ashok Leyland, Hosur - Industry, Company Profile
10 pages
Axle Hub Capacity Chart
No ratings yet
Axle Hub Capacity Chart
1 page
Untitled
No ratings yet
Untitled
2 pages
Classics World - September 2024 UK
No ratings yet
Classics World - September 2024 UK
148 pages
1.1_Lab 01_Study and Demonstration of the Layout of an Automobile
No ratings yet
1.1_Lab 01_Study and Demonstration of the Layout of an Automobile
6 pages
MODEL 1500: Manual No. TI065
No ratings yet
MODEL 1500: Manual No. TI065
154 pages
Informasi Service Pertanggal: Nasmoco Magelang
No ratings yet
Informasi Service Pertanggal: Nasmoco Magelang
14 pages
Ferrari Industrial
No ratings yet
Ferrari Industrial
4 pages
List of Workshop - MH
50% (2)
List of Workshop - MH
24 pages
2015 Nissan 370Z 3.7L Eng VIN A Touring
No ratings yet
2015 Nissan 370Z 3.7L Eng VIN A Touring
69 pages

Assignment 9

Uploaded by

Assignment 9

Uploaded by

Assignment 9: How much for that car?

Relationship Between Price with Liter and Mileage

2 3 4 5 6 0 10000 20000 30000 40000 50000

continuous_model <-lm(Price~Mileage + Liter, data = cars)

term estimate std.error statistic p.value

# form surface matrix and give to plotly

continuous_df <- cars %>%

Observed vs Predicted of Price

10000 20000 30000 40000

This obviously does not follow a bell shape curve. ## Exercise 5

Saturn Chevrolet Pontiac Buick SAAB Cadillac

cars_factor_df <- cars %>%

mixed_model <-lm(Price~Mileage + Liter + Cylinder + Make +

term estimate std.error statistic p.value

Yes, there are slopes for all of the categorical variables.

mixed_df <- cars_factor_df %>%

Observed vs Predicted of Price

10000 20000 30000 40000 50000

10000 20000 30000 40000 50000

You might also like