100% found this document useful (1 vote)

336 views

Stats 101c Final Project

The document describes a team's data cleaning and modeling process for predicting emergency response times using the Los Angeles Fire Department data. They tried various algorithms including random forest, lasso, neural networks, and XGBoost. XGBoost performed best with a minimum mean squared error of 1394460.86615. Key predictors were year, dispatch sequence, unit type, dispatch status, and personal protective equipment level. Next steps could include further parameter tuning of XGBoost or trying additional algorithms.

Uploaded by

api-362845526

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

336 views

Stats 101c Final Project

Uploaded by

api-362845526

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Stats 101C Final Project

--Team YYY
Yiklun Kei
([email protected])
Yongkai Zhu
([email protected])
Qianhao Yu
([email protected])
Data Cleaning Procedure
1. Delete Emergency Dispatch Code since it has only one level, and also
remove incident.ID and row.ID when running models.
2. Separate Time variable by Hour, Min and Sec or
3. Divide Time variable into different intervals (We tried to divide it into either two
levels (day: (6 AM, 6 PM]; night: (6 PM, 6 AM]), or four levels (morning: [6
AM, 12 PM); afternoon: [12 PM, 6 PM); evening: [6 PM, 12 AM), late night:
[12AM, 6 AM)).
4. Divide Dispatch.Sequence into four intervals based on tree model
5. Combine levels in factor variables, especially the ones that have many levels
e.g. Unit.Type and Dispatch.Status (We combined all the entries that are less
than 10000 for Unit.Type (ended up with 7 levels) and Dispatch.Status (ended
up with 6 levels)).
6. Remove or replace NA with mean and median in training and testing data
Algorithms that we have tried
Random Forest (can handle variables with mixed types, and is robust to
outliers)

Lasso (the model matrix command can accept factor input and can perform
variable selection)

Artificial Neural Network (can induce hypotheses that generalize better than
other algorithms, but will create complex model that is hard to explain and
take longer training time)
Random Forest
tree.lafire <- tree(elapsed_time ~.,lafire)
cv.X <- cv.tree(tree.lafire)
prune.tree <- prune.tree(tree.lafire,best=4)
new.predict <- predict(prune.tree,test)
tree.predict <- predict(tree.lafire,test)

The only significant variable is Dispatch.Sequence. The tree is generated by

dividing the variable into 4 branches. Within branch, the prediction is identical.

Best MSE Using Random Forest: 1446915.91501

The variables used are:
year, First.in.District,
Lasso Regression Dispatch.Sequence (cut into
different levels), Unit.Type,
lafdtrain1=lafdtrain[complete.cases(lafdtrain),]
PPE.Levels, Dispatch.Status
(with some levels combined),
x=model.matrix(elapsed_time~.,data=lafdtrain1[-c(1,2,5:8,10,12,15,16,17,18,21:36)])[,-1] hour, minute, and
elapsed_time
y=lafdtrain1$elapsed_time

lasso.mod=glmnet(x,y,alpha=1,lambda=grid)

cv.out=cv.glmnet(x,y,alpha=1) Best MSE Using Lasso:

1476606.03132.
bestlam=cv.out$lambda.min

lafdtest$elapsed_time=NULL

newx=model.matrix(~.,lafdtest[-c(1,2,5,6,7,8,10,13,16:22)])[,-1]

lasso.pred=predict(lasso.mod,s=bestlam,newx=newx,type="response")

lassoprediction=data.frame(lafdtest$row.id,lasso.pred)
Artificial Nerual Network
lafire[,c(2,3,8,9,10)] <- scale(lafire[,c(2,3,8,9,10)])
new1 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence < 8.38764),]
new2 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence >= 8.38764),]
new3 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence < 26.9571),]
new4 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence >= 26.9571),]

temp_model1 <- nnet(data = new1, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,

size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model2 <- nnet(data = new2, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model3 <- nnet(data = new3, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model4 <- nnet(data = new4, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)

Best MSE Using Artificial Neural Network:

1474897.26758
XGboost
(Regarding our best model in terms of smallest MSE)

Predictors Used: year (numeric), Dispatch.Sequence (numeric), Unit.Type (with all

the 41 levels), Dispatch.Status (with all the 12 levels), PPE.Level (with 2 levels).

Dealing With Missing Values: Removed all of them using na.omit in training data,
and replaced with mean in testing data.

Parameter Tuning: Used CV; started with the following range for each of the
parameters:

max_depth 5~10 eta 0~0.3 gamma 0~0.2 subsample 0.5~0.8

colsample_bytree 0.6~0.9
XGBoost--Parameters Introduction
max_depth stands for the maximum depth of a tree; increasing this value will make
the model more complex; default=6, range=[0,infinity)
Eta stands for the step size shrinkage used in update to prevent overfitting.
Default=0.3, range=[0,1].
Gamma stands for minimum loss reduction required to make a further partition on a
leaf node of the tree. The larger, the more conservative the algorithm will be.
Default=0, range=[0,infinity)
Subsample stands for the subsample ratio of the training instance. Setting it to 0.5
means that XGBoost randomly collected half of the data instances to grow trees and
this will prevent overfitting.
Default=1, range=(0,1].
Colsample_bytree stands for the subsample ratio of columns for each split, in each
level.
Default=1, range=(0,1].
XGBoost (Continued; Code for Parameter Tuning)
modelmtx1=sparse.model.matrix(elapsed_time~.- cv.nround = 100
1,data=traindata)
cv.nfold = 5
train=xgb.DMatrix(data=modelmtx1,label=traindata$elapsed
_time) mdcv <- xgb.cv(data=train, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,verbose = TRUE)
for (iter in 1:5) {
min_rmse = min(mdcv$evaluation_log[,"test_rmse_mean"])
param <- list(objective = "reg:linear",
min_rmse_index =
eval_metric = "rmse", which.min(as.matrix(mdcv$evaluation_log[,"test_rmse_mea
n"]))
max_depth = sample(5:10, 1),
if (min_rmse < best_min_rmse) {
eta = runif(1, 0, .3), best_min_rmse = min_rmse
best_min_rmse_index = min_rmse_index
gamma = runif(1, 0.0, 0.2), best_param = param
}
subsample = runif(1,0.5,0.8), }

colsample_bytree = runif(1,0.6,0.9) nround = best_min_rmse_index

)
XGBoost (Continued; Code for Running It and Doing
Predictions) max_depth=9, eta=0.2,
gamma=0.12, subsample=0.64,
Found Using CV
colsample_by_tree=0.62

Training <-xgb.train(params = best_param, data = train, nrounds=35,watchlist =

list(train = train),verbose = TRUE,print_every_n = 1,nthread = 6)
Columns for year, Dispatch.Sequence, Unit.Type,
finaltest=lafdtest[,c(3,6,7,8,9)] Dispatch.Status, PPE.Level

sparsemtxtest=sparse.model.matrix(~.-1,data=finaltest)

testdata=xgb.DMatrix(data=sparsemtxtest)

prediction=predict(Training,testdata)
Best MSE Using XGBoost:1394460.86615
XGBoost Importance Plot For This Case
Conclusion--Regarding the Variables
Year: Factor Numerical Not included in the model

Not included in the model

Time: Separated Not Separated
Not Divided
Dispatch Sequence: Divided Not included in the model
Not Combined

Unit Type & Dispatch Status: Levels Combined

Omittedin the model
Not included
Replaced With Mean (=2)

NAs in Training Data : Replaced

Conclusion--Regarding the Algorithms
Algorithms: Random Forest

Lasso

Artificial Neural Network

XGBoost
Final Conclusion
In this case, XGboost yields the minimal MSE among all the algorithms we
have tried, which proves that it is indeed the go-to algorithm for Kaggle
competitive data science platform (what we found on one of the XGBoost
introduction webpages).

Whats more, it turns out that the data cleaning procedures we have tried did
not necessarily improve on our results, because our best MSE is produced by
completely omitting the NAs in the training dataset and using the variables as
they were in the original dataset. For example, we tried applying XGBoost to
the new data with some levels of Unit.Type and Dispatch.Status combined,
but the results were not as good.
Possible Further Improvements
We should spend more time learning about and trying the
XGBoost algorithm.

Also, we should perhaps try more methods for data

cleaning; we might hopefully find a method that
outcompetes what we have been trying for cleaning the
data.
Q&A

Thank You All For Watching

&
Wish You Best of Luck in All of Your
Future Endeavors

Fixed Assets Chapter 1 in D365 F&O
No ratings yet
Fixed Assets Chapter 1 in D365 F&O
7 pages
2048 Report
No ratings yet
2048 Report
9 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
Codecademy SQL
No ratings yet
Codecademy SQL
11 pages
Dvcon Us 2021 Paper Making Your Dpi C Interface A Fast River of Data Redelman
100% (1)
Dvcon Us 2021 Paper Making Your Dpi C Interface A Fast River of Data Redelman
22 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Numpy Cheat Sheet & Quick Reference
100% (1)
Numpy Cheat Sheet & Quick Reference
6 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
Informatics Practices: Numpy - Array
100% (1)
Informatics Practices: Numpy - Array
28 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Quiz Feedback1 - Coursera
100% (1)
Quiz Feedback1 - Coursera
7 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Homework 2
100% (1)
Homework 2
12 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
100% (1)
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
63 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Homework 2
100% (1)
Homework 2
14 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Import As
100% (1)
Import As
27 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Introduction To Python and Computer Programming 1704298503
No ratings yet
Introduction To Python and Computer Programming 1704298503
44 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Tutor
100% (1)
Tutor
309 pages
Heteroskedasticity
100% (1)
Heteroskedasticity
23 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
LPTHW
100% (1)
LPTHW
220 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Forecasting of Stock Prices Using Multi Layer Perceptron
100% (1)
Forecasting of Stock Prices Using Multi Layer Perceptron
6 pages
Glass Classification
100% (2)
Glass Classification
3 pages
Correlation and Regression - The Simple Case
100% (2)
Correlation and Regression - The Simple Case
106 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
100% (1)
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
33 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
Poly
100% (1)
Poly
108 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
R Workshop
No ratings yet
R Workshop
47 pages
Neural Lab 1
No ratings yet
Neural Lab 1
5 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Mini Project Clustering
50% (2)
Mini Project Clustering
33 pages
Inferential Statistics Project2
No ratings yet
Inferential Statistics Project2
3 pages
Effect Size
No ratings yet
Effect Size
9 pages
Command Line: Change Directory
No ratings yet
Command Line: Change Directory
12 pages
Series Notes
No ratings yet
Series Notes
3 pages
3 3 Data Wrangling Project
No ratings yet
3 3 Data Wrangling Project
5 pages
Data Camp - Matplot
No ratings yet
Data Camp - Matplot
3 pages
Pandas Notes: """ Useful Data Analysis Tool """
No ratings yet
Pandas Notes: """ Useful Data Analysis Tool """
11 pages
Codeacademy Python
No ratings yet
Codeacademy Python
37 pages
OWO300090 WCDMA Call Drop Problem Analysis ISSUE1.00
No ratings yet
OWO300090 WCDMA Call Drop Problem Analysis ISSUE1.00
40 pages
Base 900: Type Specimen
No ratings yet
Base 900: Type Specimen
17 pages
03 - LTE Dimensioning Guidelines - Outdoor Link Budget - FDD - Ed2.9 - Internal
100% (1)
03 - LTE Dimensioning Guidelines - Outdoor Link Budget - FDD - Ed2.9 - Internal
61 pages
Thesis Oath
100% (3)
Thesis Oath
8 pages
MTAP Grade4 Division Orals 2008G4
82% (34)
MTAP Grade4 Division Orals 2008G4
2 pages
Massachusetts - ITS55 - IBM Passport Advantage Software Pricing 05-10-2022 Vrs2
No ratings yet
Massachusetts - ITS55 - IBM Passport Advantage Software Pricing 05-10-2022 Vrs2
4,994 pages
Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization
No ratings yet
Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization
42 pages
Basic 840D SL Control Manual V0 R1 (04!1!2015)
No ratings yet
Basic 840D SL Control Manual V0 R1 (04!1!2015)
20 pages
How To Create Iris Service
No ratings yet
How To Create Iris Service
15 pages
Presented By: Borromeo, John Alex R. Marimon, Jeden Mataac, Jane Cloyene B. Mingi, John Patrick R. (BSECE-4)
No ratings yet
Presented By: Borromeo, John Alex R. Marimon, Jeden Mataac, Jane Cloyene B. Mingi, John Patrick R. (BSECE-4)
23 pages
Release Notes D-Sheet Piling 23.1
No ratings yet
Release Notes D-Sheet Piling 23.1
2 pages
Ic List Coad
No ratings yet
Ic List Coad
3 pages
11-IM-240ANE-HC Specification Sheet
No ratings yet
11-IM-240ANE-HC Specification Sheet
2 pages
Letter To Senator Burr 1.15.17 With Attachment
No ratings yet
Letter To Senator Burr 1.15.17 With Attachment
11 pages
bs_40060_afinc
No ratings yet
bs_40060_afinc
59 pages
View Assignment
No ratings yet
View Assignment
8 pages
Metal Expansion Joint 2020 v2 20MB
No ratings yet
Metal Expansion Joint 2020 v2 20MB
116 pages
TMS IntraWeb Grids
No ratings yet
TMS IntraWeb Grids
31 pages
S4 Shock & Vibration Sensor: Available Products
No ratings yet
S4 Shock & Vibration Sensor: Available Products
3 pages
Bach Collegium Japan Birthday Cantatas
No ratings yet
Bach Collegium Japan Birthday Cantatas
1 page
Efficient Transformer Survey-dual
No ratings yet
Efficient Transformer Survey-dual
56 pages
MT6739 Android Scatter
33% (3)
MT6739 Android Scatter
12 pages
Makerere Undergraduate Fees Structure 2013 14 CoCIS 0
No ratings yet
Makerere Undergraduate Fees Structure 2013 14 CoCIS 0
2 pages
Download ebooks file Visual Studio Code Distilled: Evolved Code Editing for Windows, macOS, and Linux 3 / converted Edition Alessandro Del Sole all chapters
100% (5)
Download ebooks file Visual Studio Code Distilled: Evolved Code Editing for Windows, macOS, and Linux 3 / converted Edition Alessandro Del Sole all chapters
46 pages
MQTT Work
No ratings yet
MQTT Work
9 pages
Inkscape Tutorial For Beginners: How To Make A Yoga Classes Flyer
No ratings yet
Inkscape Tutorial For Beginners: How To Make A Yoga Classes Flyer
21 pages
Lamination Glue Quantity Form: Structure: Opp20/ Mcpp25
No ratings yet
Lamination Glue Quantity Form: Structure: Opp20/ Mcpp25
2 pages
Votano 100: Voltage Transformer Testing, Calibration and Assessment
No ratings yet
Votano 100: Voltage Transformer Testing, Calibration and Assessment
16 pages

Stats 101c Final Project

Uploaded by

Stats 101c Final Project

Uploaded by

Stats 101C Final Project

The only significant variable is Dispatch.Sequence. The tree is generated by

Best MSE Using Random Forest: 1446915.91501

cv.out=cv.glmnet(x,y,alpha=1) Best MSE Using Lasso:

temp_model1 <- nnet(data = new1, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,

Best MSE Using Artificial Neural Network:

Predictors Used: year (numeric), Dispatch.Sequence (numeric), Unit.Type (with all

max_depth 5~10 eta 0~0.3 gamma 0~0.2 subsample 0.5~0.8

colsample_bytree = runif(1,0.6,0.9) nround = best_min_rmse_index

Training <-xgb.train(params = best_param, data = train, nrounds=35,watchlist =

Not included in the model

Unit Type & Dispatch Status: Levels Combined

NAs in Training Data : Replaced

Artificial Neural Network

Also, we should perhaps try more methods for data

Thank You All For Watching

You might also like