Project on nypd accident analysis using hadoop environment

Analysis of NYPD Accident Big Data Using Hadoop Environment
Siddharth Chaudhary
National College of Ireland
Msc in Data Analytics
X16137001
Abstract-Traffic casualties and accidents are
the major issues in most of the cities in the world. To
reduce the rate of accidents and casualties it’s
necessary to take some pre-cautionary steps. To shrink
down the accidents frequency good approach is needed
and that can be done by analysing past several years
generated data. Millions of traffic accidents might had
happened in past years therefore the volume of data is
very huge. To process such kind of data well suited data
processing environment is needed. In this project,
processing of such accident big data will be discussed
as well as some analytical result will be carried out to
tackle or to avoid such accident in future. For this
project New York’s motor collision dataset will be used
and to process such huge dataset Hadoop distributed
ecosystem will be used.
Introduction
Traffic accident is a considerable issue of every country.
It causes many problems like traffic jams, severe
injuries and even leads to death. Traffic accident is
pervasive especially in metropolitan cities due to
several factors: increasing vehicle, intersection of
roads in cities, narrow street roads, high speed
highways and some other factors like weather and
driver distraction, rush hours is also responsible. Due
to these several factors most of the accident happens
and are recorded by the government. Analysis on these
accident data is one of the necessary step to avoid
future accident. Everyday huge amount of traffic
accident data is generated and stored in Big data
environment. Such kind of data contain millions of
rows and for processing that kind of data an effective
processing unit is needed. For this project NYC motor-
vehicle-collisions dataset will be used which is
processed in Hadoop ecosystem using map reduce and
other techniques for analysis and visualisation.
The following section will give us the detail 1.Related
work, 2.Methodology, 3.Result, 4.Conclusion 5.Future
work and 6.Reference.
1.Related work
From the past few years, traffic and road safety has
been the real challenge across the globe. To reduce the
traffic related accident many researches have been
done . Kitchin, R proposed a model based on IOT which
use real time data system to predict the traffic’s
outcome .On basis of his research planning of smart
cities was carried out[1].
D. Marx[2] used analysis application named as ELK
stack(Elasticsearch,Logstash,Kibana) to find various
patterns and trends of the New York City motor
collision dataset. NYC is an open dataset portal for
public. Various interactive visualisation of this dataset
are made using APIs which presented some interesting
fact about accidents due to weather condition.
Technique used to visualise this dataset is APIs rather
than map reduce.
Mannering F. and Poch M.[3] proposed an approach to
carry out correlation analysis on accident big data.
Although at that time there were no much advanced
data storage systems like Hadoop, the data is
processed in small chunks using map reduce (parallel
processing). Furthermore the processed data is used to
prevent accidents in Washington city on basis of
prediction which is carried out using correlation
analysis on this data. Bos, P.I. and Wouters[4]
proposed an approach to decrease the number of
accident based on the data collector device fitted in the
vehicle. This device generate data per second and
sends the data related to the location, weather and
speed of the vehicle to big data environment(remote
system) for analysing. Due to this frequency of analysis
accidents were reduced by 20%.
Glenda ascencio[5] had done the research and carried
out the analysis regarding major factors responsible for
accidents. The outcome of the analysis states that the
majority accident happened in summer and the
visualisation is done using tableau.

2.Methodology
A.Description of dataset
Dataset for this project was obtained from the NYC
open portal [6] and this dataset is available for public.
Originally there were 30 columns and more than
1,048,576 rows. Out of which 4 columns are deleted
and 2 new columns are added. 1st column is named as
s.id which contain 1 in each row and 2nd
column is of
season which contains the four different seasons of
New York city on the basis of months [7]. Data used for
this project is for years 2013-2016 and there are
854,654 rows and 28 columns. Out of which 14
columns and 854654 rows are used. Below is the
description table (fig.1) of the dataset which explains
important field and the reason for their selection.
Name Selected/Reason
s.id Yes/helpful in finding
total number of
accident
Date No
Day No
Month Yes/ it is of use to find
accident wrt. month
Season Yes/ helpful in finding
whether season affect
events
Year Yes/it is of use to find
the pattern of event
yearly
Time No
Time_in_hour Yes/helpful in finding
the occurrence of an
event on hourly basis.
Borough Yes/helps in borough
based analysis
Zip_code No
Latitude No
Longitude No
On_Street Yes/helps in finding
which street is prone to
accident
Cross_Street No
Off_Street No
Number_of_Person_injur
ed
Yes/helps in finding
person injured in an
accident
Number_of_Person_Kille
d
person killed in an
accident
Number_of_Pedestrians_
Killed
No
Number_of_Pedestrians_
injured
No
Number_of_Cyclist_Injur
ed
cyclist injured in an
accident
Number_of_Cyclist_killed Yes/helps in finding
cyclist injured in an
accident
Number_of_Motorist_
injured
motorist injured in an
accident
Number_of_Motorist_
killed
motorist injured in an
accident
Contributing_factor_
vehicle1
Yes/which are the most
common factor for
accident
Contributing_factor_
vehicle2
No
Unique_key No
Vehicle_type_1 No
Vehicle_type_2 No
Fig.1
B.Data Processing
(I). Above mentioned dataset is stored on the local
memory of the system.
(ii). Then this resultant dataset is loaded into the mysql
database after creating the proper schema for the
dataset.
(iii). The data from mysql is then loaded in to HDFS
using Scoop for further processing of map reduce.
(iv). Three map reduce processing are done on this
dataset in eclipse/HDFS environment using java. The
output generated is stored in HDFS.
(v). The generated output is then extracted from HDFS
and stored in HBase database. Then these outputs are
transferred from HBase into local memory for
visualisation.
(vi). Then two pig scripts were processed on the data
dataset stored in HDFS using Hadoop map reduce
environment. Generated output is stored in HDFS
(vii). Three hive scripts were processed using Hadoop
map reduce environment.
(viii). Output of Pig and Hive is then loaded into local
memory for visualisation.
Architecture given below (Fig.2) is the flowchart of
above data process flow that will give the insight how
the Hadoop ecosystem is used to process the dataset.

Fig.2. Data processing architecture
C. Justification for chosen technologies
MYSQL is chosen because of it’s availability as an open
source and free to use which is best suited for storing
this kind of dataset. As it has capability of storing huge
amount of data it can store big datasets like NYC motor
collision dataset. Mysql is fast in storing as well as fast
in fetching the data from it. It is easy to use and query.
SCOOP is an efficient tool which can transfer huge data
from relational database like mysql into Hadoop.it
transfers the data in Hadoop in same schema as it is
present in mysql
Eclipse Environment and Java makes the data
processing fast and easy as it has pre-build Hadoop
mapper and reducer libraries which helps in creating
classes for mapper and reducer. It helps in giving
output very fast as the selected data is processed
parallelly.
Hbase is a Nosql and distributed column based
database and its output is accessed randomly and can
be directly used for visualisation.
PIG and Hive can also process semi structured dataset.
It is different from Hadoop’s raw map reduce
components like Eclipse Environment as it only uses
structured dataset. Pig and Hive are similar to SQL to
an extent which makes them preferable choice for
processing this NYC kind of dataset.
D. Description of Map Reduce algorithms
(i). Eclipse environment with java:-For this project
three map reduce processing is done using eclipse
environment with java. To carry out map reduce
processing, configuration of eclipse environment is
done using Hadoop’s pre-defined map reduce libraries.
(a). Map Reduce 1
Input taken for map reduce are attributes s.id and
Season. This key and value pair is passed to reducer.
The reducer gives sum of s.id as total number of
accidents grouped by Season as the output.
MapReduce 1
Mapper 1 -
Input- s.id, Season
Output -
Key - Season
Value – s.id
Reducer 1 - (Season, Accident)
(b). Map Reduce 2
Input for reducer mapper are attributes s.id and Year.
These key/value pair is passed to reducer. The reducer
gives sum of s.id as total number of accidents grouped
by year as the output.
MapReduce 2
Mapper 2 -
Input- s.id, Year
Output -
Key - Year
Value – s.id
(c). Map Reduce 3
Input for map reduce in this query are attributes s.id
and Time_in_hour. This key and value pair is passed to
reducer. The reducer gives sum of s.id as total number
of accidents grouped by Time_in_hour as the output.
MapReduce 3
Mapper 3 -
Input- s.id, Time_in_hour
Output -
Key – Time_in_hour
Value – s.id
Reducer 3 - (Time_in_hourwise, Accident)
(ii)Pig with map reduce environment:-Two pig
scripts have been used for two different case studies
for this project. Appropriate schema named nypd was
made and the data stored in HDFS is extracted to store
the attribute values in nypd.

(a). Pig script 1 (Top 20 rows)
Nypd is grouped by the column name
“on_street_name”. Then for every value in
“0n_street_name” sum is carried out on the column
name “accident” of nypd schema which has the value
of s.id of the data stored in HDFS. Then the output
generated is ordered in descending order. Further limit
function is applied to take top 20 rows.
Pig script 1
Input-nypd
Group by- on_street_name
Sum-(nypd.accident)
Order by-DESC
Top rows -Limit(function)
Output-Top 20 accident prone streets
(b). Pig script 2 (Factors responsible for accident)
Nypd is grouped by the column name
“factors_for_vehicle_1”. Then for every value in
“factors_for_vehicle_1” sum is carried out on the
column name “accident” of nypd schema which has the
value of s.id of data stored in HDFS. The generated
output are important factors responsible for accident.
Pig script 2
Input-nypd
Group by- factors_for_vehicle_1
Sum-(nypd.accident)
Output-factors responsible for accidents.
(iii)Hive with map reduce environment:-Five Hive
queries has been used for two case studies for this
project. Table named “data” is created for storing the
data which is present in nypd dataset.
(a) Hive Case study 1 (1 query used)
Output of queries are number of accident happened in
years 2013-2016 and in which borough. where clause
is applied in the query on borough(column name) as
dataset contains five boroughs and some null values.
So, to select all boroughs the where clause is used in
the query of this case study.
Query 1
From table named data, columns selected were
borough, year,no_of_person_killed.Then where clause
is applied. The table is grouped by year and borough
and sum by accident.
Input-table data
Select-year, borough,accident
Where-borough
(Bronx,Brooklyn,Manhattan,Queens,Staten island)
Group by-borough,year
Output:-no. of accidents per year borough wise
(b). Hive Case study 2 (4 queries used)
Output of case study is the number of cyclist/motorist
who were injured/killed in different seasons.
Input-table data
Select-cyclist_killed,cyclist_injured,motorist_killed
,motorist_injured, season.
Sum-cyclist_killed,cyclist_injured,motorist_killed
,motorist_injured
Group by-season
Output:-accidents related to cyclist and motorist
season wise
3.Visualisation and Result
Tableau and excel are used to carry out visualisation,
interpretation on the map reduce outputs for carrying
out various case studies. First three case studies are
from the output of Map reduce using java. Followed by
two case studies using pig script output and two case
studies by hive.
Case Study:1
In this case study, we will try to analyse how many
accidents had happened in different seasons in
different years(2013-2016).Does season affect rate of
accidents.

Fig.3
Analysis:-From the above graph(Fig.3) we can conclude
that highest number of accidents happened in summer
very closely followed by Autumn. In winter least
number of accidents happened. In spring around
213,000(apprx.) accidents ocurred. It concludes that
season is an important factor which affect the rate of
accident.
Case study :2
In this case study we will try to check and analyse the
pattern followed by the rate of accident in years 2013-
2016.
Fig.4
Analysis:-The above graph(Fig.4) shows that the
number of accidents increased from 2013 to 2016. The
line shows that the rate of accident increased gradually
from 2013 to 2014 and then from 2014 to 2016 it
shows a sudden increase in rate of accidents .The
pattern of line graph shows that the incident of
accidents is growing year by year.
Case study :3
In this case study analysis of accident is carried out on
hourly basis in a day. Is there any trend in accidents
during the hours of day.
Fig.5
Analysis:-From the above area graph(Fig.5) we can
conclude that there is a trend in rate of accidents
during hours of a day. The values on the x-axis is the
time in hours of a day.1 denotes to 01:00 and 15
denotes the time 15:00. Number of accident is on y-
axis. This graph states that the lowest rate of accident
in a day is between 12:00 am to 05:00am as people
generally sleep at this time traffic on the road is least
at this time.The rate of accident starts increasing and
reaches the highest peak in morning at around 08:00
am as these few hours in morning are rush hours. Rate
of accidents dip down little bit but it increases
gradually and reach the highest peak of day at 05:00
pm.Between 16:00 and 19:00 in the evening most of
the accident happens.Therefore people should drive
their vehicle carefully during this time.
Case study :4
In this case study we will try to understand what are
the most common factors responsible for accidents.
Fig.6
190000
195000
200000
205000
210000
215000
220000
225000
Autumn Spring Summer Winter
No.ofaccidents2013-2016
Season
Accidents in four years in
different season
190000
200000
210000
220000
230000
2013 2014 2015 2016
Accidents
years
Yeary accident
0
20000
40000
60000
80000
1 3 5 7 9 11 13 15 17 19 21 23
Accident
Hourly
Chart Title
Time number of accident

Analysis:-The above bubble chart(Fig.6) states some
common factors responsible for accidents.As the size
of this bubble chart is decreased to fit in IEEE format
some of the information is lost.But the top most
responsible factors are Driver inattention, Fatigue,
Failure to yield, Other vehicular, Backing Unsafely.The
size of bubble shows the frequency of the factor. As
bigger the size of bubble that attribute is more involved
in the event. Driver inattention is one of the major
causes of accident followed by the accidents due to
drivers fatigue. To reduce the rate of accidents driver
should be made aware of these factors as these should
be concerned highly.
Case study:5
In this case study we will try to analyse which top ten
streets are prone to accident.People should carefully
drive on these street.
Fig.7
Analysis:-The above clustered bar graph shows the top
10 dangerous streets of New York city.Y-axis signifies
the name of street and number of accident happened
is on x-axis.Broadway street is the most dangerous
street of new York.As more than 8000 accident had
happened on this street followed by Atlantic avenue
with around 8000 accident. People should drive with
extra caution on these roads and government should
need to take some pre-cautionary steps to reduce the
rate of accident.
Case study :6
In this case study we will be analysing the accident
happened in five boroughs of new York and we will try
to understand certain characteristics of the city
Fig.8
Analysis:-The above cluster bar shows that the most
unsafe borough roads are of Brooklyn followed by
Manhattan and Queens as the accident happened in
both these borough are quite similar.Staten Island got
the least number of accident in the year 2013-
2016.Considering the difference between the
accidents happened in Brooklyn and Staten island we
can conclude that Brooklyn is highly crowded borough.
And had highest number of recorded accident event in
2013-2016.
Case study :7
In this case study, we will try to find out the effect of
season on cyclist and motorist accident Trend.
Fig.9
Analysis:-The above clustered column chart(Fig.9)
shows that the majority of accident happened in
summer followed by autumn spring and winter. winter
is the season in which people use more public transport
rather than cycle and motor bike which is also a factor
of the least number of accident. Autumn (Fall) season
is the season of rain which makes the road slippery and
that slippery road is one of the cause of the accidents
for cyclist/motorist. Summer season is the season in
which people prefer to use more personal vehicle to
visit places.so the accident rate is high.The graph
shows highest number of motorist injured in each
season. Therefore, people should be made aware of
this to reduce rate of accident.
0 4000 8000
BROADWAY
ATLANTIC AVENUE
NORTHERN BOULEVARD
3 AVENUE
FLATBUSH AVENUE
QUEENS BOULEVARD
LINDEN BOULEVARD
2 AVENUE
JAMAICA AVENUE
5 AVENUE
Number of accident
0
10000
20000
30000
40000
50000
Autumn Spring Summer Winter
Accident
Season
Chart Title
Cyclist injured cyclist killed
motorist injured motorist killed

4.Conclusion
This project is the combination of different
technologies related to Hadoop which are generally
used in Big Data Universe to analyse and carry out
meaning full outcome from huge datasets like NYC
motor collision.Hadoop tools like HDFS,Mapreduce,
Mysql,HBase,Pig and Hive were able to store and
process huge amount of data in few seconds.Hence,
from our analysis of NYC dataset which is processed in
Hadoop ecosystem using these technologies we can
conclude that we can make smart decision in traffic
system in order to improve transport system whish will
eventually help in minimising the rate of accident as
well as risk of happening accident.
5.Future Work
The dataset (NYC motor collision) used for this project
is updated every week.which will eventually increases
its size to a stage that it won’t be able to processed
using the map reduce approach.A best suited
alternative for this kind of dataset is Apache
Spark.Spark processes the huge dataset much faster
than mapreduce.Spark will eventually suffice the need
for processing huge amount of data in Hadoop.
6.Reference
[1]. Kitchin, R., 2014. The real-time city? Big data and smart
urbanism. GeoJournal, 79(1), pp.1-14.
[2]. Dimitri Marx, “BYODemos: New York City Traffic
Incidents,” https://www.elastic.co/blog/byodemos-new-
york-city-traffic-incidents , 2014.
[3]. Mannering F, and Poch M. Negative binomial analysis of
intersection-accident frequencies. Journal of transportation
engineering. 1996 Mar;122(2):105-13
[4]. Bos, P.I. and Wouters, J.M., 2000. Traffic accident
reduction by monitoring driver behaviour with in-car data
recorders. Accident Analysis & Prevention, 32(5), pp.643-
650.
[5]. Glenda Ascencio “NYPD Motor Vehicle Collisions
Research Part1 ,https://rstudio-pubs-
static.s3.amazonaws.com/217730_0625ca1f20b34fe983efe0
7f786a73ee.html,2016
[6]. https://data.cityofnewyork.us/Public-Safety/NYPD-
Motor-Vehicle-Collisions/h9gi-nx95##
[7]. http://www.nyc.com/visitor_guide/weather_facts.75835/

Project on nypd accident analysis using hadoop environment

Recommended

More Related Content

Similar to Project on nypd accident analysis using hadoop environment (20)

More from Siddharth Chaudhary (19)

Recently uploaded (20)

Project on nypd accident analysis using hadoop environment