Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data
Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data
AdvancedData Science
Fatima K. Abu Salem
Lecture 3
Exploratory Analysis
Driving Visual Analysis with Automobile Data
Part I – Introduction to EDA
Exploratory Data Analysis
• Exploratory data analysis (EDA) is a critical part of the data science
process, and the first step toward building a model.
• John Tukey, a mathematician at Bell Labs, developed exploratory data
analysis in contrast to confirmatory data analysis, which concerns
itself with modeling and hypothesis.
• In EDA, there is no hypothesis and there is no model. The
“exploratory” aspect means that your understanding of the problem
you are solving is changing as you go.
• The basic tools of EDA are plots, graphs and summary statistics.
Exploratory Data Analysis
• going through the data
• plotting distributions of all variables (using box plots)
• plotting time series of data
• transforming variables
• looking at all pairwise relationships between variables using
scatterplot matrices
• generating summary statistics for all of them e.g. mean, minimum,
maximum, the upper and lower quartiles, and identifying outliers.
Exploratory Data Analysis
• a mindset about your relationship with the data.:
You want to understand the data
gain intuition
try to connect your understanding of the process that generated the data to
the data itself
making sure the data is on the scale you expect, in the format you thought it
should be
• EDA happens between you and the data and isn’t about proving
anything to anyone else yet.
Exploratory Data Analysis
• Although there’s lots of visualization involved in EDA, we distinguish
between EDA and data visualization:
EDA is done toward the beginning of analysis
data visualization is done toward the end to communicate one’s findings.
With EDA, the graphics are solely done for you to understand what’s going on.
• We also distinguish between EDA and ML (Machine Learning):
Plotting data and making comparisons can get you extremely far, and is far
better to do than getting a dataset and immediately running a regression just
because you know how.
Machine learning cannot sail you away from every data storm, and without
investing enough time in EDA, you’ll realize that you are struggling at improving
model’s accuracy.
Exploratory Data Analysis
• It’s been a disservice to analysts and data scientists that EDA has not been enforced as a
critical part of the process of working with data. Take this opportunity to make it part of
your process!
• Here are some references to help you understand best practices and historical context:
Exploratory Data Analysis by John Tukey (Pearson)
The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)
The Elements of Graphing Data by William S. Cleveland (Hobart Press)
Statistical Graphics for Visualizing Multivariate Data by William G. Jacoby (Sage)
“Exploratory Data Analysis for Complex Models” by Andrew Gelman (American Statistical
Association)
The Future of Data Analysis by John Tukey. Annals of Mathematical Statistics, Volume 33, Number 1
(1962), 1-67.
Data Analysis, Exploratory by David Brillinger [8-page excerpt from International Encyclopedia of
Political Science (Sage)]
Data Science Cycle
Part II – A case study in
EDA
Driving Visual Analysis with Automobile
Data
In this chapter, we will cover the following:
• Acquiring automobile fuel efficiency data
• Importing automobile fuel efficiency data into R
• Exploring and describing fuel efficiency data
• Analyzing automobile fuel efficiency over time
• Investigating the makes and models of automobiles (HW)
Introduction
5.Select and copy all the text below the vehicle heading under Data Description,
and paste it into a text file. Do not include the emissions heading.
6.Save this file in your working directory as varlabels.txt.
The first five lines of the file are as follows:
atvtype - type of alternative fuel or advanced technology vehicle
barrels08 - annual petroleum consumption in barrels for fuelType1 (1)
barrelsA08 - annual petroleum consumption in barrels for fuelType2 (1)
charge120 - time to charge an electric vehicle in hours at 120 V
charge240 - time to charge an electric vehicle in hours at 240 V
Preparing R for your first project
1.First, set the working directory to the location where we saved the vehicles.csv.zip file:
?
Substitute the path for the actual directory.
2.We can load the data directly from compressed (ZIP) files:
?
3.To see whether this worked, display the first few rows of data using the head command:
?
Importing automobile fuel efficiency data into R
1.First, set the working directory to the location where we saved the vehicles.csv.zip file:
setwd("path")
Substitute the path for the actual directory.
2.We can load the data directly from compressed (ZIP) files:
vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"), stringsAsFactors = F)
3.To see whether this worked, display the first few rows of data using the head command:
head(vehicles)
You should see the first few rows of the dataset printed on your screen.
Note that we could have used the tail command.
Importing automobile fuel efficiency data into R
The read.csv function call included stringsAsFactors = F as its final parameter.
Factors are the names for R's categorical datatype, which can be thought of as a label
or tag applied to the data.
When importing data into R, we often run into the situation where a column of
numeric data might contain an entry that is non-numeric. In this case, R might
import the column of data as factors, which is often not what was intended by
the data scientist.
Importing automobile fuel efficiency data into R
4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
?
Importing automobile fuel efficiency data into R
4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work!
Why?
Importing automobile fuel efficiency data into R
4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work! A closer look at the error shows that in line 11 of the data file,
there are two - symbols, and it thus gets broken into three parts rather than two,
unlike the other rows.
We need to change our file-reading approach to ignore hyphenated words:
?
Importing automobile fuel efficiency data into R
4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work! A closer look at the error shows that in line 11 of the data file,
there are two - symbols, and it thus gets broken into three parts rather than two,
unlike the other rows.
We need to change our file-reading approach to ignore hyphenated words:
labels <- do.call(rbind, strsplit(readLines("varlabels.txt"), " - "))
Importing automobile fuel efficiency data into R
?
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.
The list of years from 1984 to 2014 should contain 31 unique values!
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.
The list of years from 1984 to 2014 should contain 31 unique values!
Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
?
Exploring and describing fuel efficiency
data
Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587
Observations?
Exploring and describing fuel efficiency
data
Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587
Observations?
Most cars in the dataset use regular gasoline,
and the second most common fuel type is premium gasoline.
Exploring and describing fuel efficiency
data
Use the substr function to extract the first four characters of each trany
column value and determine whether it is equal to Auto.
If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set
to Manual:
Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual.
?
Use the substr function to extract the first four characters of each trany
column value and determine whether it is equal to Auto.
If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set to
Manual:
To provide a cautionary tale about data import, let's look at the sCharger and tCharger
columns more closely.
Note that these columns indicate whether the car contains a super charger or
a turbo charger, respectively.
Starting with sCharger, we look at the class of the variable and the unique values
present in the data frame:
?
How it works
To provide a cautionary tale about data import, let's look at the sCharger and tCharger
columns more closely.
Note that these columns indicate whether the car contains a super charger or
a turbo charger, respectively.
Starting with sCharger, we look at the class of the variable and the unique values
present in the data frame:
>class(vehicles$sCharger)
[1] "character“
> unique(vehicles$sCharger)
[1] "" "S"
How it works
However, this can be a little misleading as there have been more hybrid
and non-gasoline vehicles in the later years.
However, this can be a little misleading as there have been more hybrid and non-
gasoline vehicles in the later years, which is shown as follows:
>table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
Why?
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
How?
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
• Use the subset function to create a new data frame, gasCars, which
only contains the rows of vehicles in which the fuelType1 variable is
one among a subset of values:
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-gasoline powered cars, and
redraw the preceding plot.
• Use the subset function to create a new data frame, gasCars, which only contains the rows of
vehicles in which the fuelType1 variable is one among a subset of values:
gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline", "Premium Gasoline", "Midgrade
Gasoline") & fuelType2 == "" & atvType != "Hybrid")
mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08))
ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") +
ylab("Average MPG") + ggtitle("Gasoline cars")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
Analyzing automobile fuel efficiency over
time
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this
can explain the increase.
How?
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this
can explain the increase.
How?
• First, let's verify whether cars with larger engines have worse fuel
efficiency. We note that the displ variable, which represents the
displacement of the engine in liters, is currently a string variable that
we need to convert to a numeric variable:
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this can explain the
increase.
How?
• First, let's verify whether cars with larger engines have worse fuel efficiency. We note that the
displ variable, which represents the displacement of the engine in liters, is currently a string
variable that we need to convert to a numeric variable:
>typeof(gasCars$displ)
## "character"
>gasCars$displ <- as.numeric(gasCars$displ)
>ggplot(gasCars, aes(displ, comb08)) + geom_point() + geom_smooth()
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• There is a negative, or even inverse correlation, between engine
displacement and fuel efficiency; thus, smaller cars tend to be more
fuel-efficient.
Analyzing automobile fuel efficiency over
time
• Now, let's see whether more small cars were made in later years,
which can explain the drastic increase in fuel efficiency.
How?
Analyzing automobile fuel efficiency over
time
• Now, let's see whether more small cars were made in later years, which can
explain the drastic increase in fuel efficiency.
>avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ))
• This recipe will investigate the makes and models of automobiles and
how they have changed over time:
• Let's look at how the makes and models of cars inform fuel efficiency
over time.
• First, let's look at the frequency of the makes and models of cars
available in the US over this time and concentrate on four-cylinder
cars:
?
Investigating the makes and models of automobiles
• This recipe will investigate the makes and models of automobiles and
how they have changed over time:
• Let's look at how the makes and models of cars inform fuel efficiency over
time.
• First, let's look at the frequency of the makes and models of cars available
in the US over this time and concentrate on four-cylinder cars:
>carsMake <- ddply(gasCars4, ~year, summarise, numberOfMakes =
length(unique(make)))
>ggplot(carsMake, aes(year, numberOfMakes)) + geom_point() + labs(x =
"Year", y = "Number of available makes") + ggtitle("Four cylinder cars")
Investigating the makes and models of automobiles
• Observations?
Investigating the makes and models of automobiles
• Can we look at the makes that have been available for every year of
this study?
?
Investigating the makes and models of automobiles
• Can we look at the makes that have been available for every year of
this study?
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep"
Investigating the makes and models of automobiles
• Can we look at the makes that have been available for every year of this
study?
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep“
We find there are only 12 manufactures that made four-cylinder cars every year
during this period
Investigating the makes and models of automobiles
• How have these manufacturers done over time with respect to fuel
efficiency?
?
Investigating the makes and models of automobiles
• How have these manufacturers done over time with respect to fuel
efficiency?
>carsCommonMakes4 <- subset(gasCars4, make %in%
commonMakes)
>avgMPG_commonMakes <- ddply(carsCommonMakes4, ~year +
make, summarise, avgMPG = mean(comb08))
>ggplot(avgMPG_commonMakes, aes(year, avgMPG)) + geom_line() +
facet_wrap(~make, nrow = 3)
Investigating the makes and models of
automobiles
• Observations?
Investigating the makes and models of
automobiles
Most manufacturers have shown
improvement over this time, though
several manufacturers have
demonstrated quite sharp fuel
efficiency increases in the last 5 years.
Investigating the makes and models of automobiles
• We use dlply (not ddply) to take the gasCars4 data frame, split it by year,
and then apply the unique function to the make variable.
• For each year, a list of the unique available automobile makes is computed,
and then dlply returns a list of these lists (one element each year).
• Note dlply, and not ddply, because it takes a data frame (d) as input and
returns a list (l) as output, whereas ddply takes a data frame (d) as input
and outputs a data frame (d):
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
Investigating the makes and models of automobiles
• The next line uses the Reduce higher order function, and this is the same
Reduce function and idea in the mapreduce programming paradigm
introduced by Google that underlies Hadoop.
• R is a functional programming language and offers several higher order
functions as part of its core.
• A higher order function accepts another function as input. In this line, we
pass the intersect function to Reduce, which will apply the intersect function
pairwise to each element in the list of unique makes per year that was
created previously.
• Ultimately, this results in a single list of automobile makes that is present
every year.
Investigating the makes and models of automobiles