Stats 101c Final Project
Stats 101c Final Project
--Team YYY
Yiklun Kei
([email protected])
Yongkai Zhu
([email protected])
Qianhao Yu
([email protected])
Data Cleaning Procedure
1. Delete Emergency Dispatch Code since it has only one level, and also
remove incident.ID and row.ID when running models.
2. Separate Time variable by Hour, Min and Sec or
3. Divide Time variable into different intervals (We tried to divide it into either two
levels (day: (6 AM, 6 PM]; night: (6 PM, 6 AM]), or four levels (morning: [6
AM, 12 PM); afternoon: [12 PM, 6 PM); evening: [6 PM, 12 AM), late night:
[12AM, 6 AM)).
4. Divide Dispatch.Sequence into four intervals based on tree model
5. Combine levels in factor variables, especially the ones that have many levels
e.g. Unit.Type and Dispatch.Status (We combined all the entries that are less
than 10000 for Unit.Type (ended up with 7 levels) and Dispatch.Status (ended
up with 6 levels)).
6. Remove or replace NA with mean and median in training and testing data
Algorithms that we have tried
Random Forest (can handle variables with mixed types, and is robust to
outliers)
Lasso (the model matrix command can accept factor input and can perform
variable selection)
Artificial Neural Network (can induce hypotheses that generalize better than
other algorithms, but will create complex model that is hard to explain and
take longer training time)
Random Forest
tree.lafire <- tree(elapsed_time ~.,lafire)
cv.X <- cv.tree(tree.lafire)
prune.tree <- prune.tree(tree.lafire,best=4)
new.predict <- predict(prune.tree,test)
tree.predict <- predict(tree.lafire,test)
lasso.mod=glmnet(x,y,alpha=1,lambda=grid)
lafdtest$elapsed_time=NULL
newx=model.matrix(~.,lafdtest[-c(1,2,5,6,7,8,10,13,16:22)])[,-1]
lasso.pred=predict(lasso.mod,s=bestlam,newx=newx,type="response")
lassoprediction=data.frame(lafdtest$row.id,lasso.pred)
Artificial Nerual Network
lafire[,c(2,3,8,9,10)] <- scale(lafire[,c(2,3,8,9,10)])
new1 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence < 8.38764),]
new2 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence >= 8.38764),]
new3 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence < 26.9571),]
new4 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence >= 26.9571),]
Dealing With Missing Values: Removed all of them using na.omit in training data,
and replaced with mean in testing data.
Parameter Tuning: Used CV; started with the following range for each of the
parameters:
sparsemtxtest=sparse.model.matrix(~.-1,data=finaltest)
testdata=xgb.DMatrix(data=sparsemtxtest)
prediction=predict(Training,testdata)
Best MSE Using XGBoost:1394460.86615
XGBoost Importance Plot For This Case
Conclusion--Regarding the Variables
Year: Factor Numerical Not included in the model
Lasso
XGBoost
Final Conclusion
In this case, XGboost yields the minimal MSE among all the algorithms we
have tried, which proves that it is indeed the go-to algorithm for Kaggle
competitive data science platform (what we found on one of the XGBoost
introduction webpages).
Whats more, it turns out that the data cleaning procedures we have tried did
not necessarily improve on our results, because our best MSE is produced by
completely omitting the NAs in the training dataset and using the variables as
they were in the original dataset. For example, we tried applying XGBoost to
the new data with some levels of Unit.Type and Dispatch.Status combined,
but the results were not as good.
Possible Further Improvements
We should spend more time learning about and trying the
XGBoost algorithm.