SlideShare a Scribd company logo
Analysis of NYPD Accident Big Data Using Hadoop Environment
Siddharth Chaudhary
National College of Ireland
Msc in Data Analytics
X16137001
Abstract-Traffic casualties and accidents are
the major issues in most of the cities in the world. To
reduce the rate of accidents and casualties it’s
necessary to take some pre-cautionary steps. To shrink
down the accidents frequency good approach is needed
and that can be done by analysing past several years
generated data. Millions of traffic accidents might had
happened in past years therefore the volume of data is
very huge. To process such kind of data well suited data
processing environment is needed. In this project,
processing of such accident big data will be discussed
as well as some analytical result will be carried out to
tackle or to avoid such accident in future. For this
project New York’s motor collision dataset will be used
and to process such huge dataset Hadoop distributed
ecosystem will be used.
Introduction
Traffic accident is a considerable issue of every country.
It causes many problems like traffic jams, severe
injuries and even leads to death. Traffic accident is
pervasive especially in metropolitan cities due to
several factors: increasing vehicle, intersection of
roads in cities, narrow street roads, high speed
highways and some other factors like weather and
driver distraction, rush hours is also responsible. Due
to these several factors most of the accident happens
and are recorded by the government. Analysis on these
accident data is one of the necessary step to avoid
future accident. Everyday huge amount of traffic
accident data is generated and stored in Big data
environment. Such kind of data contain millions of
rows and for processing that kind of data an effective
processing unit is needed. For this project NYC motor-
vehicle-collisions dataset will be used which is
processed in Hadoop ecosystem using map reduce and
other techniques for analysis and visualisation.
The following section will give us the detail 1.Related
work, 2.Methodology, 3.Result, 4.Conclusion 5.Future
work and 6.Reference.
1.Related work
From the past few years, traffic and road safety has
been the real challenge across the globe. To reduce the
traffic related accident many researches have been
done . Kitchin, R proposed a model based on IOT which
use real time data system to predict the traffic’s
outcome .On basis of his research planning of smart
cities was carried out[1].
D. Marx[2] used analysis application named as ELK
stack(Elasticsearch,Logstash,Kibana) to find various
patterns and trends of the New York City motor
collision dataset. NYC is an open dataset portal for
public. Various interactive visualisation of this dataset
are made using APIs which presented some interesting
fact about accidents due to weather condition.
Technique used to visualise this dataset is APIs rather
than map reduce.
Mannering F. and Poch M.[3] proposed an approach to
carry out correlation analysis on accident big data.
Although at that time there were no much advanced
data storage systems like Hadoop, the data is
processed in small chunks using map reduce (parallel
processing). Furthermore the processed data is used to
prevent accidents in Washington city on basis of
prediction which is carried out using correlation
analysis on this data. Bos, P.I. and Wouters[4]
proposed an approach to decrease the number of
accident based on the data collector device fitted in the
vehicle. This device generate data per second and
sends the data related to the location, weather and
speed of the vehicle to big data environment(remote
system) for analysing. Due to this frequency of analysis
accidents were reduced by 20%.
Glenda ascencio[5] had done the research and carried
out the analysis regarding major factors responsible for
accidents. The outcome of the analysis states that the
majority accident happened in summer and the
visualisation is done using tableau.
2.Methodology
A.Description of dataset
Dataset for this project was obtained from the NYC
open portal [6] and this dataset is available for public.
Originally there were 30 columns and more than
1,048,576 rows. Out of which 4 columns are deleted
and 2 new columns are added. 1st column is named as
s.id which contain 1 in each row and 2nd
column is of
season which contains the four different seasons of
New York city on the basis of months [7]. Data used for
this project is for years 2013-2016 and there are
854,654 rows and 28 columns. Out of which 14
columns and 854654 rows are used. Below is the
description table (fig.1) of the dataset which explains
important field and the reason for their selection.
Name Selected/Reason
s.id Yes/helpful in finding
total number of
accident
Date No
Day No
Month Yes/ it is of use to find
accident wrt. month
Season Yes/ helpful in finding
whether season affect
events
Year Yes/it is of use to find
the pattern of event
yearly
Time No
Time_in_hour Yes/helpful in finding
the occurrence of an
event on hourly basis.
Borough Yes/helps in borough
based analysis
Zip_code No
Latitude No
Longitude No
On_Street Yes/helps in finding
which street is prone to
accident
Cross_Street No
Off_Street No
Number_of_Person_injur
ed
Yes/helps in finding
person injured in an
accident
Number_of_Person_Kille
d
Yes/helps in finding
person killed in an
accident
Number_of_Pedestrians_
Killed
No
Number_of_Pedestrians_
injured
No
Number_of_Cyclist_Injur
ed
Yes/helps in finding
cyclist injured in an
accident
Number_of_Cyclist_killed Yes/helps in finding
cyclist injured in an
accident
Number_of_Motorist_
injured
Yes/helps in finding
motorist injured in an
accident
Number_of_Motorist_
killed
Yes/helps in finding
motorist injured in an
accident
Contributing_factor_
vehicle1
Yes/which are the most
common factor for
accident
Contributing_factor_
vehicle2
No
Unique_key No
Vehicle_type_1 No
Vehicle_type_2 No
Fig.1
B.Data Processing
(I). Above mentioned dataset is stored on the local
memory of the system.
(ii). Then this resultant dataset is loaded into the mysql
database after creating the proper schema for the
dataset.
(iii). The data from mysql is then loaded in to HDFS
using Scoop for further processing of map reduce.
(iv). Three map reduce processing are done on this
dataset in eclipse/HDFS environment using java. The
output generated is stored in HDFS.
(v). The generated output is then extracted from HDFS
and stored in HBase database. Then these outputs are
transferred from HBase into local memory for
visualisation.
(vi). Then two pig scripts were processed on the data
dataset stored in HDFS using Hadoop map reduce
environment. Generated output is stored in HDFS
(vii). Three hive scripts were processed using Hadoop
map reduce environment.
(viii). Output of Pig and Hive is then loaded into local
memory for visualisation.
Architecture given below (Fig.2) is the flowchart of
above data process flow that will give the insight how
the Hadoop ecosystem is used to process the dataset.
Fig.2. Data processing architecture
C. Justification for chosen technologies
MYSQL is chosen because of it’s availability as an open
source and free to use which is best suited for storing
this kind of dataset. As it has capability of storing huge
amount of data it can store big datasets like NYC motor
collision dataset. Mysql is fast in storing as well as fast
in fetching the data from it. It is easy to use and query.
SCOOP is an efficient tool which can transfer huge data
from relational database like mysql into Hadoop.it
transfers the data in Hadoop in same schema as it is
present in mysql
Eclipse Environment and Java makes the data
processing fast and easy as it has pre-build Hadoop
mapper and reducer libraries which helps in creating
classes for mapper and reducer. It helps in giving
output very fast as the selected data is processed
parallelly.
Hbase is a Nosql and distributed column based
database and its output is accessed randomly and can
be directly used for visualisation.
PIG and Hive can also process semi structured dataset.
It is different from Hadoop’s raw map reduce
components like Eclipse Environment as it only uses
structured dataset. Pig and Hive are similar to SQL to
an extent which makes them preferable choice for
processing this NYC kind of dataset.
D. Description of Map Reduce algorithms
(i). Eclipse environment with java:-For this project
three map reduce processing is done using eclipse
environment with java. To carry out map reduce
processing, configuration of eclipse environment is
done using Hadoop’s pre-defined map reduce libraries.
(a). Map Reduce 1
Input taken for map reduce are attributes s.id and
Season. This key and value pair is passed to reducer.
The reducer gives sum of s.id as total number of
accidents grouped by Season as the output.
MapReduce 1
Mapper 1 -
Input- s.id, Season
Output -
Key - Season
Value – s.id
Reducer 1 - (Season, Accident)
(b). Map Reduce 2
Input for reducer mapper are attributes s.id and Year.
These key/value pair is passed to reducer. The reducer
gives sum of s.id as total number of accidents grouped
by year as the output.
MapReduce 2
Mapper 2 -
Input- s.id, Year
Output -
Key - Year
Value – s.id
(c). Map Reduce 3
Input for map reduce in this query are attributes s.id
and Time_in_hour. This key and value pair is passed to
reducer. The reducer gives sum of s.id as total number
of accidents grouped by Time_in_hour as the output.
MapReduce 3
Mapper 3 -
Input- s.id, Time_in_hour
Output -
Key – Time_in_hour
Value – s.id
Reducer 3 - (Time_in_hourwise, Accident)
(ii)Pig with map reduce environment:-Two pig
scripts have been used for two different case studies
for this project. Appropriate schema named nypd was
made and the data stored in HDFS is extracted to store
the attribute values in nypd.
(a). Pig script 1 (Top 20 rows)
Nypd is grouped by the column name
“on_street_name”. Then for every value in
“0n_street_name” sum is carried out on the column
name “accident” of nypd schema which has the value
of s.id of the data stored in HDFS. Then the output
generated is ordered in descending order. Further limit
function is applied to take top 20 rows.
Pig script 1
Input-nypd
Group by- on_street_name
Sum-(nypd.accident)
Order by-DESC
Top rows -Limit(function)
Output-Top 20 accident prone streets
(b). Pig script 2 (Factors responsible for accident)
Nypd is grouped by the column name
“factors_for_vehicle_1”. Then for every value in
“factors_for_vehicle_1” sum is carried out on the
column name “accident” of nypd schema which has the
value of s.id of data stored in HDFS. The generated
output are important factors responsible for accident.
Pig script 2
Input-nypd
Group by- factors_for_vehicle_1
Sum-(nypd.accident)
Output-factors responsible for accidents.
(iii)Hive with map reduce environment:-Five Hive
queries has been used for two case studies for this
project. Table named “data” is created for storing the
data which is present in nypd dataset.
(a) Hive Case study 1 (1 query used)
Output of queries are number of accident happened in
years 2013-2016 and in which borough. where clause
is applied in the query on borough(column name) as
dataset contains five boroughs and some null values.
So, to select all boroughs the where clause is used in
the query of this case study.
Query 1
From table named data, columns selected were
borough, year,no_of_person_killed.Then where clause
is applied. The table is grouped by year and borough
and sum by accident.
Input-table data
Select-year, borough,accident
Where-borough
(Bronx,Brooklyn,Manhattan,Queens,Staten island)
Group by-borough,year
Output:-no. of accidents per year borough wise
(b). Hive Case study 2 (4 queries used)
Output of case study is the number of cyclist/motorist
who were injured/killed in different seasons.
Input-table data
Select-cyclist_killed,cyclist_injured,motorist_killed
,motorist_injured, season.
Sum-cyclist_killed,cyclist_injured,motorist_killed
,motorist_injured
Group by-season
Output:-accidents related to cyclist and motorist
season wise
3.Visualisation and Result
Tableau and excel are used to carry out visualisation,
interpretation on the map reduce outputs for carrying
out various case studies. First three case studies are
from the output of Map reduce using java. Followed by
two case studies using pig script output and two case
studies by hive.
Case Study:1
In this case study, we will try to analyse how many
accidents had happened in different seasons in
different years(2013-2016).Does season affect rate of
accidents.
Fig.3
Analysis:-From the above graph(Fig.3) we can conclude
that highest number of accidents happened in summer
very closely followed by Autumn. In winter least
number of accidents happened. In spring around
213,000(apprx.) accidents ocurred. It concludes that
season is an important factor which affect the rate of
accident.
Case study :2
In this case study we will try to check and analyse the
pattern followed by the rate of accident in years 2013-
2016.
Fig.4
Analysis:-The above graph(Fig.4) shows that the
number of accidents increased from 2013 to 2016. The
line shows that the rate of accident increased gradually
from 2013 to 2014 and then from 2014 to 2016 it
shows a sudden increase in rate of accidents .The
pattern of line graph shows that the incident of
accidents is growing year by year.
Case study :3
In this case study analysis of accident is carried out on
hourly basis in a day. Is there any trend in accidents
during the hours of day.
Fig.5
Analysis:-From the above area graph(Fig.5) we can
conclude that there is a trend in rate of accidents
during hours of a day. The values on the x-axis is the
time in hours of a day.1 denotes to 01:00 and 15
denotes the time 15:00. Number of accident is on y-
axis. This graph states that the lowest rate of accident
in a day is between 12:00 am to 05:00am as people
generally sleep at this time traffic on the road is least
at this time.The rate of accident starts increasing and
reaches the highest peak in morning at around 08:00
am as these few hours in morning are rush hours. Rate
of accidents dip down little bit but it increases
gradually and reach the highest peak of day at 05:00
pm.Between 16:00 and 19:00 in the evening most of
the accident happens.Therefore people should drive
their vehicle carefully during this time.
Case study :4
In this case study we will try to understand what are
the most common factors responsible for accidents.
Fig.6
190000
195000
200000
205000
210000
215000
220000
225000
Autumn Spring Summer Winter
No.ofaccidents2013-2016
Season
Accidents in four years in
different season
190000
200000
210000
220000
230000
2013 2014 2015 2016
Accidents
years
Yeary accident
0
20000
40000
60000
80000
1 3 5 7 9 11 13 15 17 19 21 23
Accident
Hourly
Chart Title
Time number of accident
Analysis:-The above bubble chart(Fig.6) states some
common factors responsible for accidents.As the size
of this bubble chart is decreased to fit in IEEE format
some of the information is lost.But the top most
responsible factors are Driver inattention, Fatigue,
Failure to yield, Other vehicular, Backing Unsafely.The
size of bubble shows the frequency of the factor. As
bigger the size of bubble that attribute is more involved
in the event. Driver inattention is one of the major
causes of accident followed by the accidents due to
drivers fatigue. To reduce the rate of accidents driver
should be made aware of these factors as these should
be concerned highly.
Case study:5
In this case study we will try to analyse which top ten
streets are prone to accident.People should carefully
drive on these street.
Fig.7
Analysis:-The above clustered bar graph shows the top
10 dangerous streets of New York city.Y-axis signifies
the name of street and number of accident happened
is on x-axis.Broadway street is the most dangerous
street of new York.As more than 8000 accident had
happened on this street followed by Atlantic avenue
with around 8000 accident. People should drive with
extra caution on these roads and government should
need to take some pre-cautionary steps to reduce the
rate of accident.
Case study :6
In this case study we will be analysing the accident
happened in five boroughs of new York and we will try
to understand certain characteristics of the city
Fig.8
Analysis:-The above cluster bar shows that the most
unsafe borough roads are of Brooklyn followed by
Manhattan and Queens as the accident happened in
both these borough are quite similar.Staten Island got
the least number of accident in the year 2013-
2016.Considering the difference between the
accidents happened in Brooklyn and Staten island we
can conclude that Brooklyn is highly crowded borough.
And had highest number of recorded accident event in
2013-2016.
Case study :7
In this case study, we will try to find out the effect of
season on cyclist and motorist accident Trend.
Fig.9
Analysis:-The above clustered column chart(Fig.9)
shows that the majority of accident happened in
summer followed by autumn spring and winter. winter
is the season in which people use more public transport
rather than cycle and motor bike which is also a factor
of the least number of accident. Autumn (Fall) season
is the season of rain which makes the road slippery and
that slippery road is one of the cause of the accidents
for cyclist/motorist. Summer season is the season in
which people prefer to use more personal vehicle to
visit places.so the accident rate is high.The graph
shows highest number of motorist injured in each
season. Therefore, people should be made aware of
this to reduce rate of accident.
0 4000 8000
BROADWAY
ATLANTIC AVENUE
NORTHERN BOULEVARD
3 AVENUE
FLATBUSH AVENUE
QUEENS BOULEVARD
LINDEN BOULEVARD
2 AVENUE
JAMAICA AVENUE
5 AVENUE
Number of accident
0
10000
20000
30000
40000
50000
Autumn Spring Summer Winter
Accident
Season
Chart Title
Cyclist injured cyclist killed
motorist injured motorist killed
4.Conclusion
This project is the combination of different
technologies related to Hadoop which are generally
used in Big Data Universe to analyse and carry out
meaning full outcome from huge datasets like NYC
motor collision.Hadoop tools like HDFS,Mapreduce,
Mysql,HBase,Pig and Hive were able to store and
process huge amount of data in few seconds.Hence,
from our analysis of NYC dataset which is processed in
Hadoop ecosystem using these technologies we can
conclude that we can make smart decision in traffic
system in order to improve transport system whish will
eventually help in minimising the rate of accident as
well as risk of happening accident.
5.Future Work
The dataset (NYC motor collision) used for this project
is updated every week.which will eventually increases
its size to a stage that it won’t be able to processed
using the map reduce approach.A best suited
alternative for this kind of dataset is Apache
Spark.Spark processes the huge dataset much faster
than mapreduce.Spark will eventually suffice the need
for processing huge amount of data in Hadoop.
6.Reference
[1]. Kitchin, R., 2014. The real-time city? Big data and smart
urbanism. GeoJournal, 79(1), pp.1-14.
[2]. Dimitri Marx, “BYODemos: New York City Traffic
Incidents,” https://www.elastic.co/blog/byodemos-new-
york-city-traffic-incidents , 2014.
[3]. Mannering F, and Poch M. Negative binomial analysis of
intersection-accident frequencies. Journal of transportation
engineering. 1996 Mar;122(2):105-13
[4]. Bos, P.I. and Wouters, J.M., 2000. Traffic accident
reduction by monitoring driver behaviour with in-car data
recorders. Accident Analysis & Prevention, 32(5), pp.643-
650.
[5]. Glenda Ascencio “NYPD Motor Vehicle Collisions
Research Part1 ,https://rstudio-pubs-
static.s3.amazonaws.com/217730_0625ca1f20b34fe983efe0
7f786a73ee.html,2016
[6]. https://data.cityofnewyork.us/Public-Safety/NYPD-
Motor-Vehicle-Collisions/h9gi-nx95##
[7]. http://www.nyc.com/visitor_guide/weather_facts.75835/
Ad

More Related Content

Similar to Project on nypd accident analysis using hadoop environment (20)

Accident Prediction System Using Machine Learning
Accident Prediction System Using Machine LearningAccident Prediction System Using Machine Learning
Accident Prediction System Using Machine Learning
IRJET Journal
 
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET Journal
 
Density of route frequency for enforcement
Density of route frequency for enforcement Density of route frequency for enforcement
Density of route frequency for enforcement
Conference Papers
 
Analysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduceAnalysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduce
Kaushik Rajan
 
IRJET- Road Accident Prediction using Machine Learning Algorithm
IRJET- Road Accident Prediction using Machine Learning AlgorithmIRJET- Road Accident Prediction using Machine Learning Algorithm
IRJET- Road Accident Prediction using Machine Learning Algorithm
IRJET Journal
 
15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx
15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx
15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx
aulasnilda
 
Car Traffic Sign Annunciator
Car Traffic Sign AnnunciatorCar Traffic Sign Annunciator
Car Traffic Sign Annunciator
rahulmonikasharma
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
IRJET Journal
 
Visual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in ItalyVisual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in Italy
Roberto Falconi
 
TRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNING
TRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNINGTRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNING
TRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNING
IRJET Journal
 
Analysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic ToolsAnalysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic Tools
ijeei-iaes
 
Data Analytics using AOT: A Survey
Data Analytics using AOT: A SurveyData Analytics using AOT: A Survey
Data Analytics using AOT: A Survey
IRJET Journal
 
SCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial IntelligenceSCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial Intelligence
vivatechijri
 
IRJET- Identification of Crime and Accidental Area using IoT
IRJET- Identification of Crime and Accidental Area using IoTIRJET- Identification of Crime and Accidental Area using IoT
IRJET- Identification of Crime and Accidental Area using IoT
IRJET Journal
 
GurminderBharani_Masters_Thesis
GurminderBharani_Masters_ThesisGurminderBharani_Masters_Thesis
GurminderBharani_Masters_Thesis
bharanigurminder
 
Smart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportSmart Traffic Monitoring System Report
Smart Traffic Monitoring System Report
ALi Baker
 
Info_Edge_Ventures_AI_Hackathon_Submission.pptx
Info_Edge_Ventures_AI_Hackathon_Submission.pptxInfo_Edge_Ventures_AI_Hackathon_Submission.pptx
Info_Edge_Ventures_AI_Hackathon_Submission.pptx
Saranshtripathi2
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
yashbheda
 
IRJET- Traffic Prediction Techniques: Comprehensive analysis
IRJET- Traffic Prediction Techniques: Comprehensive analysisIRJET- Traffic Prediction Techniques: Comprehensive analysis
IRJET- Traffic Prediction Techniques: Comprehensive analysis
IRJET Journal
 
IRJET- Projecting Climate Impacts on Transportation by Diagnosing and Exa...
IRJET-  	  Projecting Climate Impacts on Transportation by Diagnosing and Exa...IRJET-  	  Projecting Climate Impacts on Transportation by Diagnosing and Exa...
IRJET- Projecting Climate Impacts on Transportation by Diagnosing and Exa...
IRJET Journal
 
Accident Prediction System Using Machine Learning
Accident Prediction System Using Machine LearningAccident Prediction System Using Machine Learning
Accident Prediction System Using Machine Learning
IRJET Journal
 
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET Journal
 
Density of route frequency for enforcement
Density of route frequency for enforcement Density of route frequency for enforcement
Density of route frequency for enforcement
Conference Papers
 
Analysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduceAnalysis of Crime Big Data using MapReduce
Analysis of Crime Big Data using MapReduce
Kaushik Rajan
 
IRJET- Road Accident Prediction using Machine Learning Algorithm
IRJET- Road Accident Prediction using Machine Learning AlgorithmIRJET- Road Accident Prediction using Machine Learning Algorithm
IRJET- Road Accident Prediction using Machine Learning Algorithm
IRJET Journal
 
15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx
15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx
15 minutes agoKalyan Pradyumna Peddinti Complex Systems and .docx
aulasnilda
 
Car Traffic Sign Annunciator
Car Traffic Sign AnnunciatorCar Traffic Sign Annunciator
Car Traffic Sign Annunciator
rahulmonikasharma
 
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNINGCRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
CRIME ANALYSIS AND PREDICTION USING MACHINE LEARNING
IRJET Journal
 
Visual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in ItalyVisual Analytics: Traffic Collisions in Italy
Visual Analytics: Traffic Collisions in Italy
Roberto Falconi
 
TRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNING
TRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNINGTRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNING
TRAFFIC FORECAST FOR INTELLECTUAL TRANSPORTATION SYSTEM USING MACHINE LEARNING
IRJET Journal
 
Analysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic ToolsAnalysing Transportation Data with Open Source Big Data Analytic Tools
Analysing Transportation Data with Open Source Big Data Analytic Tools
ijeei-iaes
 
Data Analytics using AOT: A Survey
Data Analytics using AOT: A SurveyData Analytics using AOT: A Survey
Data Analytics using AOT: A Survey
IRJET Journal
 
SCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial IntelligenceSCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial Intelligence
vivatechijri
 
IRJET- Identification of Crime and Accidental Area using IoT
IRJET- Identification of Crime and Accidental Area using IoTIRJET- Identification of Crime and Accidental Area using IoT
IRJET- Identification of Crime and Accidental Area using IoT
IRJET Journal
 
GurminderBharani_Masters_Thesis
GurminderBharani_Masters_ThesisGurminderBharani_Masters_Thesis
GurminderBharani_Masters_Thesis
bharanigurminder
 
Smart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportSmart Traffic Monitoring System Report
Smart Traffic Monitoring System Report
ALi Baker
 
Info_Edge_Ventures_AI_Hackathon_Submission.pptx
Info_Edge_Ventures_AI_Hackathon_Submission.pptxInfo_Edge_Ventures_AI_Hackathon_Submission.pptx
Info_Edge_Ventures_AI_Hackathon_Submission.pptx
Saranshtripathi2
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
yashbheda
 
IRJET- Traffic Prediction Techniques: Comprehensive analysis
IRJET- Traffic Prediction Techniques: Comprehensive analysisIRJET- Traffic Prediction Techniques: Comprehensive analysis
IRJET- Traffic Prediction Techniques: Comprehensive analysis
IRJET Journal
 
IRJET- Projecting Climate Impacts on Transportation by Diagnosing and Exa...
IRJET-  	  Projecting Climate Impacts on Transportation by Diagnosing and Exa...IRJET-  	  Projecting Climate Impacts on Transportation by Diagnosing and Exa...
IRJET- Projecting Climate Impacts on Transportation by Diagnosing and Exa...
IRJET Journal
 

More from Siddharth Chaudhary (19)

Certificate importing data in python from relational database,xls and flat fi...
Certificate importing data in python from relational database,xls and flat fi...Certificate importing data in python from relational database,xls and flat fi...
Certificate importing data in python from relational database,xls and flat fi...
Siddharth Chaudhary
 
Certificate cleaning data in python
Certificate cleaning data in pythonCertificate cleaning data in python
Certificate cleaning data in python
Siddharth Chaudhary
 
Certificate network analysis
Certificate network analysisCertificate network analysis
Certificate network analysis
Siddharth Chaudhary
 
Certificate pandas foundation
Certificate pandas foundationCertificate pandas foundation
Certificate pandas foundation
Siddharth Chaudhary
 
Certificate Supervised learning with scikit learn
Certificate Supervised learning with scikit learnCertificate Supervised learning with scikit learn
Certificate Supervised learning with scikit learn
Siddharth Chaudhary
 
Certificate unsupervised learning in python
Certificate unsupervised learning in pythonCertificate unsupervised learning in python
Certificate unsupervised learning in python
Siddharth Chaudhary
 
Certificate cleaning data in r
Certificate cleaning data in rCertificate cleaning data in r
Certificate cleaning data in r
Siddharth Chaudhary
 
Machine learning project
Machine learning projectMachine learning project
Machine learning project
Siddharth Chaudhary
 
Certificate joining data in postgre sql course
Certificate joining data in postgre sql courseCertificate joining data in postgre sql course
Certificate joining data in postgre sql course
Siddharth Chaudhary
 
Certificate introduction to r for finance
Certificate introduction to r for financeCertificate introduction to r for finance
Certificate introduction to r for finance
Siddharth Chaudhary
 
Certificate forecsating using r
Certificate forecsating using rCertificate forecsating using r
Certificate forecsating using r
Siddharth Chaudhary
 
Certificate arima modeling with r
Certificate arima modeling with rCertificate arima modeling with r
Certificate arima modeling with r
Siddharth Chaudhary
 
Certificate introduction to r course
Certificate introduction to r courseCertificate introduction to r course
Certificate introduction to r course
Siddharth Chaudhary
 
Thesis report
Thesis reportThesis report
Thesis report
Siddharth Chaudhary
 
Project on visualization
Project on visualizationProject on visualization
Project on visualization
Siddharth Chaudhary
 
Data warehouse project on retail store
Data warehouse project on retail storeData warehouse project on retail store
Data warehouse project on retail store
Siddharth Chaudhary
 
Salesforce project
Salesforce projectSalesforce project
Salesforce project
Siddharth Chaudhary
 
Automated home secuirty project
Automated home secuirty projectAutomated home secuirty project
Automated home secuirty project
Siddharth Chaudhary
 
Statistics report
Statistics reportStatistics report
Statistics report
Siddharth Chaudhary
 
Certificate importing data in python from relational database,xls and flat fi...
Certificate importing data in python from relational database,xls and flat fi...Certificate importing data in python from relational database,xls and flat fi...
Certificate importing data in python from relational database,xls and flat fi...
Siddharth Chaudhary
 
Certificate cleaning data in python
Certificate cleaning data in pythonCertificate cleaning data in python
Certificate cleaning data in python
Siddharth Chaudhary
 
Certificate Supervised learning with scikit learn
Certificate Supervised learning with scikit learnCertificate Supervised learning with scikit learn
Certificate Supervised learning with scikit learn
Siddharth Chaudhary
 
Certificate unsupervised learning in python
Certificate unsupervised learning in pythonCertificate unsupervised learning in python
Certificate unsupervised learning in python
Siddharth Chaudhary
 
Certificate joining data in postgre sql course
Certificate joining data in postgre sql courseCertificate joining data in postgre sql course
Certificate joining data in postgre sql course
Siddharth Chaudhary
 
Certificate introduction to r for finance
Certificate introduction to r for financeCertificate introduction to r for finance
Certificate introduction to r for finance
Siddharth Chaudhary
 
Certificate arima modeling with r
Certificate arima modeling with rCertificate arima modeling with r
Certificate arima modeling with r
Siddharth Chaudhary
 
Certificate introduction to r course
Certificate introduction to r courseCertificate introduction to r course
Certificate introduction to r course
Siddharth Chaudhary
 
Data warehouse project on retail store
Data warehouse project on retail storeData warehouse project on retail store
Data warehouse project on retail store
Siddharth Chaudhary
 
Ad

Recently uploaded (20)

chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Ad

Project on nypd accident analysis using hadoop environment

  • 1. Analysis of NYPD Accident Big Data Using Hadoop Environment Siddharth Chaudhary National College of Ireland Msc in Data Analytics X16137001 Abstract-Traffic casualties and accidents are the major issues in most of the cities in the world. To reduce the rate of accidents and casualties it’s necessary to take some pre-cautionary steps. To shrink down the accidents frequency good approach is needed and that can be done by analysing past several years generated data. Millions of traffic accidents might had happened in past years therefore the volume of data is very huge. To process such kind of data well suited data processing environment is needed. In this project, processing of such accident big data will be discussed as well as some analytical result will be carried out to tackle or to avoid such accident in future. For this project New York’s motor collision dataset will be used and to process such huge dataset Hadoop distributed ecosystem will be used. Introduction Traffic accident is a considerable issue of every country. It causes many problems like traffic jams, severe injuries and even leads to death. Traffic accident is pervasive especially in metropolitan cities due to several factors: increasing vehicle, intersection of roads in cities, narrow street roads, high speed highways and some other factors like weather and driver distraction, rush hours is also responsible. Due to these several factors most of the accident happens and are recorded by the government. Analysis on these accident data is one of the necessary step to avoid future accident. Everyday huge amount of traffic accident data is generated and stored in Big data environment. Such kind of data contain millions of rows and for processing that kind of data an effective processing unit is needed. For this project NYC motor- vehicle-collisions dataset will be used which is processed in Hadoop ecosystem using map reduce and other techniques for analysis and visualisation. The following section will give us the detail 1.Related work, 2.Methodology, 3.Result, 4.Conclusion 5.Future work and 6.Reference. 1.Related work From the past few years, traffic and road safety has been the real challenge across the globe. To reduce the traffic related accident many researches have been done . Kitchin, R proposed a model based on IOT which use real time data system to predict the traffic’s outcome .On basis of his research planning of smart cities was carried out[1]. D. Marx[2] used analysis application named as ELK stack(Elasticsearch,Logstash,Kibana) to find various patterns and trends of the New York City motor collision dataset. NYC is an open dataset portal for public. Various interactive visualisation of this dataset are made using APIs which presented some interesting fact about accidents due to weather condition. Technique used to visualise this dataset is APIs rather than map reduce. Mannering F. and Poch M.[3] proposed an approach to carry out correlation analysis on accident big data. Although at that time there were no much advanced data storage systems like Hadoop, the data is processed in small chunks using map reduce (parallel processing). Furthermore the processed data is used to prevent accidents in Washington city on basis of prediction which is carried out using correlation analysis on this data. Bos, P.I. and Wouters[4] proposed an approach to decrease the number of accident based on the data collector device fitted in the vehicle. This device generate data per second and sends the data related to the location, weather and speed of the vehicle to big data environment(remote system) for analysing. Due to this frequency of analysis accidents were reduced by 20%. Glenda ascencio[5] had done the research and carried out the analysis regarding major factors responsible for accidents. The outcome of the analysis states that the majority accident happened in summer and the visualisation is done using tableau.
  • 2. 2.Methodology A.Description of dataset Dataset for this project was obtained from the NYC open portal [6] and this dataset is available for public. Originally there were 30 columns and more than 1,048,576 rows. Out of which 4 columns are deleted and 2 new columns are added. 1st column is named as s.id which contain 1 in each row and 2nd column is of season which contains the four different seasons of New York city on the basis of months [7]. Data used for this project is for years 2013-2016 and there are 854,654 rows and 28 columns. Out of which 14 columns and 854654 rows are used. Below is the description table (fig.1) of the dataset which explains important field and the reason for their selection. Name Selected/Reason s.id Yes/helpful in finding total number of accident Date No Day No Month Yes/ it is of use to find accident wrt. month Season Yes/ helpful in finding whether season affect events Year Yes/it is of use to find the pattern of event yearly Time No Time_in_hour Yes/helpful in finding the occurrence of an event on hourly basis. Borough Yes/helps in borough based analysis Zip_code No Latitude No Longitude No On_Street Yes/helps in finding which street is prone to accident Cross_Street No Off_Street No Number_of_Person_injur ed Yes/helps in finding person injured in an accident Number_of_Person_Kille d Yes/helps in finding person killed in an accident Number_of_Pedestrians_ Killed No Number_of_Pedestrians_ injured No Number_of_Cyclist_Injur ed Yes/helps in finding cyclist injured in an accident Number_of_Cyclist_killed Yes/helps in finding cyclist injured in an accident Number_of_Motorist_ injured Yes/helps in finding motorist injured in an accident Number_of_Motorist_ killed Yes/helps in finding motorist injured in an accident Contributing_factor_ vehicle1 Yes/which are the most common factor for accident Contributing_factor_ vehicle2 No Unique_key No Vehicle_type_1 No Vehicle_type_2 No Fig.1 B.Data Processing (I). Above mentioned dataset is stored on the local memory of the system. (ii). Then this resultant dataset is loaded into the mysql database after creating the proper schema for the dataset. (iii). The data from mysql is then loaded in to HDFS using Scoop for further processing of map reduce. (iv). Three map reduce processing are done on this dataset in eclipse/HDFS environment using java. The output generated is stored in HDFS. (v). The generated output is then extracted from HDFS and stored in HBase database. Then these outputs are transferred from HBase into local memory for visualisation. (vi). Then two pig scripts were processed on the data dataset stored in HDFS using Hadoop map reduce environment. Generated output is stored in HDFS (vii). Three hive scripts were processed using Hadoop map reduce environment. (viii). Output of Pig and Hive is then loaded into local memory for visualisation. Architecture given below (Fig.2) is the flowchart of above data process flow that will give the insight how the Hadoop ecosystem is used to process the dataset.
  • 3. Fig.2. Data processing architecture C. Justification for chosen technologies MYSQL is chosen because of it’s availability as an open source and free to use which is best suited for storing this kind of dataset. As it has capability of storing huge amount of data it can store big datasets like NYC motor collision dataset. Mysql is fast in storing as well as fast in fetching the data from it. It is easy to use and query. SCOOP is an efficient tool which can transfer huge data from relational database like mysql into Hadoop.it transfers the data in Hadoop in same schema as it is present in mysql Eclipse Environment and Java makes the data processing fast and easy as it has pre-build Hadoop mapper and reducer libraries which helps in creating classes for mapper and reducer. It helps in giving output very fast as the selected data is processed parallelly. Hbase is a Nosql and distributed column based database and its output is accessed randomly and can be directly used for visualisation. PIG and Hive can also process semi structured dataset. It is different from Hadoop’s raw map reduce components like Eclipse Environment as it only uses structured dataset. Pig and Hive are similar to SQL to an extent which makes them preferable choice for processing this NYC kind of dataset. D. Description of Map Reduce algorithms (i). Eclipse environment with java:-For this project three map reduce processing is done using eclipse environment with java. To carry out map reduce processing, configuration of eclipse environment is done using Hadoop’s pre-defined map reduce libraries. (a). Map Reduce 1 Input taken for map reduce are attributes s.id and Season. This key and value pair is passed to reducer. The reducer gives sum of s.id as total number of accidents grouped by Season as the output. MapReduce 1 Mapper 1 - Input- s.id, Season Output - Key - Season Value – s.id Reducer 1 - (Season, Accident) (b). Map Reduce 2 Input for reducer mapper are attributes s.id and Year. These key/value pair is passed to reducer. The reducer gives sum of s.id as total number of accidents grouped by year as the output. MapReduce 2 Mapper 2 - Input- s.id, Year Output - Key - Year Value – s.id (c). Map Reduce 3 Input for map reduce in this query are attributes s.id and Time_in_hour. This key and value pair is passed to reducer. The reducer gives sum of s.id as total number of accidents grouped by Time_in_hour as the output. MapReduce 3 Mapper 3 - Input- s.id, Time_in_hour Output - Key – Time_in_hour Value – s.id Reducer 3 - (Time_in_hourwise, Accident) (ii)Pig with map reduce environment:-Two pig scripts have been used for two different case studies for this project. Appropriate schema named nypd was made and the data stored in HDFS is extracted to store the attribute values in nypd.
  • 4. (a). Pig script 1 (Top 20 rows) Nypd is grouped by the column name “on_street_name”. Then for every value in “0n_street_name” sum is carried out on the column name “accident” of nypd schema which has the value of s.id of the data stored in HDFS. Then the output generated is ordered in descending order. Further limit function is applied to take top 20 rows. Pig script 1 Input-nypd Group by- on_street_name Sum-(nypd.accident) Order by-DESC Top rows -Limit(function) Output-Top 20 accident prone streets (b). Pig script 2 (Factors responsible for accident) Nypd is grouped by the column name “factors_for_vehicle_1”. Then for every value in “factors_for_vehicle_1” sum is carried out on the column name “accident” of nypd schema which has the value of s.id of data stored in HDFS. The generated output are important factors responsible for accident. Pig script 2 Input-nypd Group by- factors_for_vehicle_1 Sum-(nypd.accident) Output-factors responsible for accidents. (iii)Hive with map reduce environment:-Five Hive queries has been used for two case studies for this project. Table named “data” is created for storing the data which is present in nypd dataset. (a) Hive Case study 1 (1 query used) Output of queries are number of accident happened in years 2013-2016 and in which borough. where clause is applied in the query on borough(column name) as dataset contains five boroughs and some null values. So, to select all boroughs the where clause is used in the query of this case study. Query 1 From table named data, columns selected were borough, year,no_of_person_killed.Then where clause is applied. The table is grouped by year and borough and sum by accident. Input-table data Select-year, borough,accident Where-borough (Bronx,Brooklyn,Manhattan,Queens,Staten island) Group by-borough,year Output:-no. of accidents per year borough wise (b). Hive Case study 2 (4 queries used) Output of case study is the number of cyclist/motorist who were injured/killed in different seasons. Input-table data Select-cyclist_killed,cyclist_injured,motorist_killed ,motorist_injured, season. Sum-cyclist_killed,cyclist_injured,motorist_killed ,motorist_injured Group by-season Output:-accidents related to cyclist and motorist season wise 3.Visualisation and Result Tableau and excel are used to carry out visualisation, interpretation on the map reduce outputs for carrying out various case studies. First three case studies are from the output of Map reduce using java. Followed by two case studies using pig script output and two case studies by hive. Case Study:1 In this case study, we will try to analyse how many accidents had happened in different seasons in different years(2013-2016).Does season affect rate of accidents.
  • 5. Fig.3 Analysis:-From the above graph(Fig.3) we can conclude that highest number of accidents happened in summer very closely followed by Autumn. In winter least number of accidents happened. In spring around 213,000(apprx.) accidents ocurred. It concludes that season is an important factor which affect the rate of accident. Case study :2 In this case study we will try to check and analyse the pattern followed by the rate of accident in years 2013- 2016. Fig.4 Analysis:-The above graph(Fig.4) shows that the number of accidents increased from 2013 to 2016. The line shows that the rate of accident increased gradually from 2013 to 2014 and then from 2014 to 2016 it shows a sudden increase in rate of accidents .The pattern of line graph shows that the incident of accidents is growing year by year. Case study :3 In this case study analysis of accident is carried out on hourly basis in a day. Is there any trend in accidents during the hours of day. Fig.5 Analysis:-From the above area graph(Fig.5) we can conclude that there is a trend in rate of accidents during hours of a day. The values on the x-axis is the time in hours of a day.1 denotes to 01:00 and 15 denotes the time 15:00. Number of accident is on y- axis. This graph states that the lowest rate of accident in a day is between 12:00 am to 05:00am as people generally sleep at this time traffic on the road is least at this time.The rate of accident starts increasing and reaches the highest peak in morning at around 08:00 am as these few hours in morning are rush hours. Rate of accidents dip down little bit but it increases gradually and reach the highest peak of day at 05:00 pm.Between 16:00 and 19:00 in the evening most of the accident happens.Therefore people should drive their vehicle carefully during this time. Case study :4 In this case study we will try to understand what are the most common factors responsible for accidents. Fig.6 190000 195000 200000 205000 210000 215000 220000 225000 Autumn Spring Summer Winter No.ofaccidents2013-2016 Season Accidents in four years in different season 190000 200000 210000 220000 230000 2013 2014 2015 2016 Accidents years Yeary accident 0 20000 40000 60000 80000 1 3 5 7 9 11 13 15 17 19 21 23 Accident Hourly Chart Title Time number of accident
  • 6. Analysis:-The above bubble chart(Fig.6) states some common factors responsible for accidents.As the size of this bubble chart is decreased to fit in IEEE format some of the information is lost.But the top most responsible factors are Driver inattention, Fatigue, Failure to yield, Other vehicular, Backing Unsafely.The size of bubble shows the frequency of the factor. As bigger the size of bubble that attribute is more involved in the event. Driver inattention is one of the major causes of accident followed by the accidents due to drivers fatigue. To reduce the rate of accidents driver should be made aware of these factors as these should be concerned highly. Case study:5 In this case study we will try to analyse which top ten streets are prone to accident.People should carefully drive on these street. Fig.7 Analysis:-The above clustered bar graph shows the top 10 dangerous streets of New York city.Y-axis signifies the name of street and number of accident happened is on x-axis.Broadway street is the most dangerous street of new York.As more than 8000 accident had happened on this street followed by Atlantic avenue with around 8000 accident. People should drive with extra caution on these roads and government should need to take some pre-cautionary steps to reduce the rate of accident. Case study :6 In this case study we will be analysing the accident happened in five boroughs of new York and we will try to understand certain characteristics of the city Fig.8 Analysis:-The above cluster bar shows that the most unsafe borough roads are of Brooklyn followed by Manhattan and Queens as the accident happened in both these borough are quite similar.Staten Island got the least number of accident in the year 2013- 2016.Considering the difference between the accidents happened in Brooklyn and Staten island we can conclude that Brooklyn is highly crowded borough. And had highest number of recorded accident event in 2013-2016. Case study :7 In this case study, we will try to find out the effect of season on cyclist and motorist accident Trend. Fig.9 Analysis:-The above clustered column chart(Fig.9) shows that the majority of accident happened in summer followed by autumn spring and winter. winter is the season in which people use more public transport rather than cycle and motor bike which is also a factor of the least number of accident. Autumn (Fall) season is the season of rain which makes the road slippery and that slippery road is one of the cause of the accidents for cyclist/motorist. Summer season is the season in which people prefer to use more personal vehicle to visit places.so the accident rate is high.The graph shows highest number of motorist injured in each season. Therefore, people should be made aware of this to reduce rate of accident. 0 4000 8000 BROADWAY ATLANTIC AVENUE NORTHERN BOULEVARD 3 AVENUE FLATBUSH AVENUE QUEENS BOULEVARD LINDEN BOULEVARD 2 AVENUE JAMAICA AVENUE 5 AVENUE Number of accident 0 10000 20000 30000 40000 50000 Autumn Spring Summer Winter Accident Season Chart Title Cyclist injured cyclist killed motorist injured motorist killed
  • 7. 4.Conclusion This project is the combination of different technologies related to Hadoop which are generally used in Big Data Universe to analyse and carry out meaning full outcome from huge datasets like NYC motor collision.Hadoop tools like HDFS,Mapreduce, Mysql,HBase,Pig and Hive were able to store and process huge amount of data in few seconds.Hence, from our analysis of NYC dataset which is processed in Hadoop ecosystem using these technologies we can conclude that we can make smart decision in traffic system in order to improve transport system whish will eventually help in minimising the rate of accident as well as risk of happening accident. 5.Future Work The dataset (NYC motor collision) used for this project is updated every week.which will eventually increases its size to a stage that it won’t be able to processed using the map reduce approach.A best suited alternative for this kind of dataset is Apache Spark.Spark processes the huge dataset much faster than mapreduce.Spark will eventually suffice the need for processing huge amount of data in Hadoop. 6.Reference [1]. Kitchin, R., 2014. The real-time city? Big data and smart urbanism. GeoJournal, 79(1), pp.1-14. [2]. Dimitri Marx, “BYODemos: New York City Traffic Incidents,” https://www.elastic.co/blog/byodemos-new- york-city-traffic-incidents , 2014. [3]. Mannering F, and Poch M. Negative binomial analysis of intersection-accident frequencies. Journal of transportation engineering. 1996 Mar;122(2):105-13 [4]. Bos, P.I. and Wouters, J.M., 2000. Traffic accident reduction by monitoring driver behaviour with in-car data recorders. Accident Analysis & Prevention, 32(5), pp.643- 650. [5]. Glenda Ascencio “NYPD Motor Vehicle Collisions Research Part1 ,https://rstudio-pubs- static.s3.amazonaws.com/217730_0625ca1f20b34fe983efe0 7f786a73ee.html,2016 [6]. https://data.cityofnewyork.us/Public-Safety/NYPD- Motor-Vehicle-Collisions/h9gi-nx95## [7]. http://www.nyc.com/visitor_guide/weather_facts.75835/