100% found this document useful (1 vote)
59 views

Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data

This document discusses exploratory data analysis (EDA) and provides an example of using EDA to analyze automobile fuel efficiency data. It begins by introducing EDA, its key principles and tools. It then describes acquiring automobile fuel efficiency data from online sources, importing the data into R, and preparing to explore the data. The goal of EDA in this case is to summarize and group the data to understand fuel efficiency trends over time and between vehicle groups.

Uploaded by

Hussein ElGhoul
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
59 views

Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data

This document discusses exploratory data analysis (EDA) and provides an example of using EDA to analyze automobile fuel efficiency data. It begins by introducing EDA, its key principles and tools. It then describes acquiring automobile fuel efficiency data from online sources, importing the data into R, and preparing to explore the data. The goal of EDA in this case is to summarize and group the data to understand fuel efficiency trends over time and between vehicle groups.

Uploaded by

Hussein ElGhoul
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 113

CMPS 396X

AdvancedData Science
Fatima K. Abu Salem
Lecture 3
Exploratory Analysis
Driving Visual Analysis with Automobile Data
Part I – Introduction to EDA
Exploratory Data Analysis
• Exploratory data analysis (EDA) is a critical part of the data science
process, and the first step toward building a model.
• John Tukey, a mathematician at Bell Labs, developed exploratory data
analysis in contrast to confirmatory data analysis, which concerns
itself with modeling and hypothesis.
• In EDA, there is no hypothesis and there is no model. The
“exploratory” aspect means that your understanding of the problem
you are solving is changing as you go.
• The basic tools of EDA are plots, graphs and summary statistics.
Exploratory Data Analysis
• going through the data
• plotting distributions of all variables (using box plots)
• plotting time series of data
• transforming variables
• looking at all pairwise relationships between variables using
scatterplot matrices
• generating summary statistics for all of them e.g. mean, minimum,
maximum, the upper and lower quartiles, and identifying outliers.
Exploratory Data Analysis
• a mindset about your relationship with the data.:
You want to understand the data
gain intuition
try to connect your understanding of the process that generated the data to
the data itself
making sure the data is on the scale you expect, in the format you thought it
should be
• EDA happens between you and the data and isn’t about proving
anything to anyone else yet.
Exploratory Data Analysis
• Although there’s lots of visualization involved in EDA, we distinguish
between EDA and data visualization:
 EDA is done toward the beginning of analysis
 data visualization is done toward the end to communicate one’s findings.
 With EDA, the graphics are solely done for you to understand what’s going on.
• We also distinguish between EDA and ML (Machine Learning):
Plotting data and making comparisons can get you extremely far, and is far
better to do than getting a dataset and immediately running a regression just
because you know how.
Machine learning cannot sail you away from every data storm, and without
investing enough time in EDA, you’ll realize that you are struggling at improving
model’s accuracy.
Exploratory Data Analysis
• It’s been a disservice to analysts and data scientists that EDA has not been enforced as a
critical part of the process of working with data. Take this opportunity to make it part of
your process!
• Here are some references to help you understand best practices and historical context:
Exploratory Data Analysis by John Tukey (Pearson)
The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)
The Elements of Graphing Data by William S. Cleveland (Hobart Press)
Statistical Graphics for Visualizing Multivariate Data by William G. Jacoby (Sage)
“Exploratory Data Analysis for Complex Models” by Andrew Gelman (American Statistical
Association)
The Future of Data Analysis by John Tukey. Annals of Mathematical Statistics, Volume 33, Number 1
(1962), 1-67.
Data Analysis, Exploratory by David Brillinger [8-page excerpt from International Encyclopedia of
Political Science (Sage)]
Data Science Cycle
Part II – A case study in
EDA
Driving Visual Analysis with Automobile
Data
In this chapter, we will cover the following:
• Acquiring automobile fuel efficiency data
• Importing automobile fuel efficiency data into R
• Exploring and describing fuel efficiency data
• Analyzing automobile fuel efficiency over time
• Investigating the makes and models of automobiles (HW)
Introduction

• The first project we will introduce is an analysis of automobile fuel


economy
• The recipes in this chapter will roughly follow these five steps in the
data science pipeline:
Acquisition
Exploration and understanding
Munging, wrangling, and manipulation
Analysis and modeling
Communication and operationalization data.
And just touching at the skirmishes of statistical inference
Acquiring automobile fuel efficiency data

• We will dive into a dataset that contains fuel efficiency performance


metrics, measured in miles per gallon (MPG) over time, for most
makes and models of automobiles available in the U.S. since 1984.
Courtesy of the U.S. Department of Energy and the US Environmental Protection Agency.

• The dataset also contains several features and attributes of the


automobiles listed:
o Goal: summarize and group data to determine which groups tend to have
better fuel efficiency historically and how this has changed over the years.
The latest version of the dataset is available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip, and information about the variables
in the dataset can be found at http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle. The data was last updated on December 4, 2013 and
was downloaded on December 8, 2013.
Acquiring automobile fuel efficiency data

To acuire the data needed:


1.Download the dataset from
http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip.
2.Unzip vehicles.csv and move it to your working code directory.
3.Open the unzipped vehicles.csv file with Microsoft Excel, Google Spreadsheet,
or a simple text editor.
Comma-separated value (csv) files are very convenient to work with as they can be
edited and viewed with very basic, freely available tools.
4.Navigate to http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle.
Acquiring automobile fuel efficiency data

5.Select and copy all the text below the vehicle heading under Data Description,
and paste it into a text file. Do not include the emissions heading.
6.Save this file in your working directory as varlabels.txt.
The first five lines of the file are as follows:
atvtype - type of alternative fuel or advanced technology vehicle
barrels08 - annual petroleum consumption in barrels for fuelType1 (1)
barrelsA08 - annual petroleum consumption in barrels for fuelType2 (1)
charge120 - time to charge an electric vehicle in hours at 120 V
charge240 - time to charge an electric vehicle in hours at 240 V
Preparing R for your first project

If you are using Rstudio:


1.Launch RStudio on your computer.
2.At the R console prompt, install the two R packages needed for this project:
install.packages("plyr")
install.packages("ggplot2")
install.packages("reshape2")
3.Load the R packages, as follows:
library(plyr)
library(ggplot2)
library(reshape2)
Packages
plyr: Tools for splitting, applying and combining Data
• A set of tools that solves a common set of problems:
Say you need to break a big problem down into manageable pieces, operate on each
piece and then put all the pieces back together.
 For example, you might want to fit a model to each spatial location or time point in your
study, summarise data by panels or collapse high-dimensional arrays to simpler summary
statistics.

ggplot2 will make complex data visualizations significantly easier.


reshape2.Plyr will also be used to apply the split-apply-combine data
analysis pattern
Importing automobile fuel efficiency data into R

1.First, set the working directory to the location where we saved the vehicles.csv.zip file:
?
Substitute the path for the actual directory.
2.We can load the data directly from compressed (ZIP) files:
?
3.To see whether this worked, display the first few rows of data using the head command:
?
Importing automobile fuel efficiency data into R

1.First, set the working directory to the location where we saved the vehicles.csv.zip file:
setwd("path")
Substitute the path for the actual directory.
2.We can load the data directly from compressed (ZIP) files:
vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"), stringsAsFactors = F)
3.To see whether this worked, display the first few rows of data using the head command:
head(vehicles)
You should see the first few rows of the dataset printed on your screen.
Note that we could have used the tail command.
Importing automobile fuel efficiency data into R
The read.csv function call included stringsAsFactors = F as its final parameter.

By default, R converts strings to a datatype, known as factors in many cases.

Factors are the names for R's categorical datatype, which can be thought of as a label
or tag applied to the data.

Internally, R stores factors as integers with a mapping to the


appropriate label. This technique allows older versions of R to store factors in much less
memory than the corresponding character.

When importing data into R, we often run into the situation where a column of
numeric data might contain an entry that is non-numeric. In this case, R might
import the column of data as factors, which is often not what was intended by
the data scientist.
Importing automobile fuel efficiency data into R

4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
?
Importing automobile fuel efficiency data into R

4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work!
Why?
Importing automobile fuel efficiency data into R

4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work! A closer look at the error shows that in line 11 of the data file,
there are two - symbols, and it thus gets broken into three parts rather than two,
unlike the other rows.
We need to change our file-reading approach to ignore hyphenated words:
?
Importing automobile fuel efficiency data into R

4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work! A closer look at the error shows that in line 11 of the data file,
there are two - symbols, and it thus gets broken into three parts rather than two,
unlike the other rows.
We need to change our file-reading approach to ignore hyphenated words:
labels <- do.call(rbind, strsplit(readLines("varlabels.txt"), " - "))
Importing automobile fuel efficiency data into R

6.To check whether it works, we use the head function again:


head(labels)
[,1] [,2]
[1,] "atvtype" "type of alternative fuel or advanced technology vehicle"
[2,] "barrels08" "annual petroleum consumption in barrels for fuelType1 (1)"
[3,] "barrelsA08" "annual petroleum consumption in barrels for fuelType2 (1)"
[4,] "charge120" "time to charge an electric vehicle in hours at 120 V"
[5,] "charge240" "time to charge an electric vehicle in hours at 240 V"
Importing automobile fuel efficiency data into R

Let's break down the last complex statement in step 5, piece-by-piece:


First, let's read the file line by line:
x <- readLines("varlabels.txt")
Each line needs to be split at the string -.
The spaces are important, so we don't split hyphenated words (such as in line 11).
This results in each line split into two parts as a vector of strings,
and the vectors stored in a single list:
y <- strsplit(x, " - ")
Now, we stack these vectors together to make a matrix of strings,
where the first column is the variable name and the second column
is the description of the variable:
labels <- do.call(rbind, y)
Exploring and describing fuel efficiency
data
The next step is to do some preliminary analysis of the dataset:
1. First, let's find out how many observations (rows) are in our data:
nrow(vehicles)
##34287
2. Next, let's find out how many variables (columns) are in our data:
ncol(vehicles)
## 74
Exploring and describing fuel efficiency
data
Now, let's get a sense of which columns of data are present in the data frame:
?
Exploring and describing fuel efficiency
data
Now, let's get a sense of which columns of data are present in the data frame:
names(vehicles)
Exploring and describing fuel efficiency
data
Now, let's get a sense of which columns of data are present in the data frame:
names(vehicles)
Exploring and describing fuel efficiency
data
• determine the first and last years present in the dataset:
?
Exploring and describing fuel efficiency
data
• Determine the first and last years present in the dataset:
first_year <- min(vehicles[, "year"])
## 1984
last_year <- max(vehicles[, "year"])
## 2014
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.

?
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.

The list of years from 1984 to 2014 should contain 31 unique values!
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.

The list of years from 1984 to 2014 should contain 31 unique values!

To test this, use the following command:


length(unique(vehicles$year))
[1] 31
Exploring and describing fuel efficiency
data

Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
?
Exploring and describing fuel efficiency
data

Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587

Observations?
Exploring and describing fuel efficiency
data
Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587

Observations?
Most cars in the dataset use regular gasoline,
and the second most common fuel type is premium gasoline.
Exploring and describing fuel efficiency
data

Let's explore the types of transmissions used by these automobiles.


We first need to take care of all missing data by setting it to NA:
?
Exploring and describing fuel efficiency
data

Let's explore the types of transmissions used by these automobiles.


We first need to take care of all missing data by setting it to NA:

vehicles$trany[vehicles$trany == ""] <- NA


Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual. How to summarise all the text?
?
Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual.

Use the substr function to extract the first four characters of each trany
column value and determine whether it is equal to Auto.
If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set
to Manual:
Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual.
?
Use the substr function to extract the first four characters of each trany
column value and determine whether it is equal to Auto.
If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set to
Manual:

vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4) == "Auto", "Auto",


"Manual")
Exploring and describing fuel efficiency
data
Finally, we convert the new variable to a factor and then use the table
function to see the distribution of values:
?
Exploring and describing fuel efficiency
data
Finally, we convert the new variable to a factor and then use the table
function to see the distribution of values:
vehicles$trany2 <- as.factor(vehicles$trany2)
table(vehicles$trany2)
## Auto Manual
## 22451 11825
Observations?
Exploring and describing fuel efficiency
data
Finally, we convert the new variable to a factor and then use the table
function to see the distribution of values:
vehicles$trany <- as.factor(vehicles$trany)
table(vehicles$trany2)
## Auto Manual
## 22451 11825
Observations?
There are roughly twice as many automobile models with automatic
transmission as there are models with manual transmission.
How it works
We used the powerful table function to create a count of the
occurrence of values for the fuelType1 variable. This function is capable
of much more, including cross tabulations, as follows:
with(vehicles, table(sCharger, year))
How it works
We used the powerful table function to create a count of the
occurrence of values for the fuelType1 variable. This function is capable
of much more, including cross tabulations, as follows:
with(vehicles, table(sCharger, year))
How it works

1. We looked at the number of automobile models by year, with a super charger.


2. We saw that super chargers have seemingly become more popular more recently than
they were in the past.
3. The with command tells R to use vehicles as the default data when performing the s
ubsequent command, in this case, table. Thus, we can omit prefacing the sCharger and
year column names with the name of the data frame and vehicles, followed by the dollar sig
How it works

To provide a cautionary tale about data import, let's look at the sCharger and tCharger
columns more closely.

Note that these columns indicate whether the car contains a super charger or
a turbo charger, respectively.

Starting with sCharger, we look at the class of the variable and the unique values
present in the data frame:
?
How it works
To provide a cautionary tale about data import, let's look at the sCharger and tCharger
columns more closely.

Note that these columns indicate whether the car contains a super charger or
a turbo charger, respectively.

Starting with sCharger, we look at the class of the variable and the unique values
present in the data frame:
>class(vehicles$sCharger)
[1] "character“
> unique(vehicles$sCharger)
[1] "" "S"
How it works

We next look at tCharger, expecting things to be the same:


>class(vehicles$tCharger)
[1] "logical“
>unique(vehicles$tCharger)
[1] NA TRUE
How it works

We next look at tCharger, expecting things to be the same:


>class(vehicles$tCharger)
[1] "logical“
>unique(vehicles$tCharger)
[1] NA TRUE
These two seemingly similar variables are different datatypes completely.
While the tCharger variable is a logical variable, the sCharger variable appears
to be the more general character datatype. Something seems wrong.
How it works

We next look at tCharger, expecting things to be the same:


>class(vehicles$tCharger)
[1] "logical“
>unique(vehicles$tCharger)
[1] NA TRUE
Open the .csv file: we can see that sCharger and tCharger data columns either
are blank or contains an S or T, respectively.
Thus, R has read in the T character in the tCharger column as a Boolean TRUE variable,
as opposed to the character T. This isn't a fatal flaw and might not impact an analysis.
However, undetected bugs such as this can cause problems far down the analytical
pipeline and necessitate significant repeated work.
Analyzing automobile fuel efficiency over
time
• We continue the exploration by looking at some of the fuel efficiency
metrics over time and in relation to other data points.

Analyzing automobile fuel efficiency over
time
• We continue the exploration by looking at some of the fuel efficiency metrics over time and in
relation to other data points.
• Let's start by looking at whether there is an overall trend of how MPG changes over time on an
average.
- We use the ddply function from the plyr package to take the vehicles data frame, aggregate rows
by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency.
- The result is then assigned to a new data frame, mpgByYr.
• Note that this is our first example of split-apply-combine:
- We split the data frame into groups by year, we apply the mean function to specific variables,
and then we combine the results into a new data frame:
>mpgByYr <- ddply(vehicles, ~year, summarise, avgMPG = mean(comb08), avgHghy =
mean(highway08), avgCity = mean(city08))
 
Analyzing automobile fuel efficiency over
time
- combA08 - combined MPG for fueltype02
- fuelType 2: For dual fuel vehicles, this will be the alternative fuel (e.g. E85, Electricity,
CNG, LPG). For single fuel vehicles, this field is not used. For electric and CNG vehicles
the MPG number is MPGe (gasoline equivalent miles per gallon)
 
- highway08 - highway MPG for fueltype01
- fuelType1 - fuel type 1. For single fuel vehicles, this will be the only fuel. For dual fuel
vehicles, this will be the conventional fuel

- city08 - city MPG for fueltype1


 
Analyzing automobile fuel efficiency over
time
• To gain a better understanding of this new data frame, we pass it to
the ggplot function, telling it to plot the avgMPG variable against the
year variable, using points.
• In addition, we specify that we want axis labels, a title, and even a
smoothed conditional mean (geom_smooth()) represented as a
shaded region of the plot:
?
Analyzing automobile fuel efficiency over
time
• To gain a better understanding of this new data frame, we pass it to the ggplot
function, telling it to plot the avgMPG variable against the year variable, using
points.
• In addition, we specify that we want axis labels, a title, and even a smoothed
conditional mean (geom_smooth()) represented as a shaded region of the
plot:
>ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() +
xlab("Year") + ylab("Average MPG") + ggtitle("All cars")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
Analyzing automobile fuel efficiency over
time
Based on this visualization, one might conclude that there has been a
tremendous increase in the fuel economy of cars sold in the last few
years.

Is that really the case?


Analyzing automobile fuel efficiency over
time
Based on this visualization, one might conclude that there has been a
tremendous increase in the fuel economy of cars sold in the last few
years.

However, this can be a little misleading as there have been more hybrid
and non-gasoline vehicles in the later years.

How do we verify that?


Analyzing automobile fuel efficiency over
time
Based on this visualization, one might conclude that there has been a tremendous
increase in the fuel economy of cars sold in the last few years.

However, this can be a little misleading as there have been more hybrid and non-
gasoline vehicles in the later years, which is shown as follows:

>table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
Why?
 
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
How?
 
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
• Use the subset function to create a new data frame, gasCars, which
only contains the rows of vehicles in which the fuelType1 variable is
one among a subset of values: 
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-gasoline powered cars, and
redraw the preceding plot.
• Use the subset function to create a new data frame, gasCars, which only contains the rows of
vehicles in which the fuelType1 variable is one among a subset of values: 
gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline", "Premium Gasoline", "Midgrade
Gasoline") & fuelType2 == "" & atvType != "Hybrid")
 
mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08))
ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") +
ylab("Average MPG") + ggtitle("Gasoline cars") 
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
 
Analyzing automobile fuel efficiency over
time
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this
can explain the increase.
How? 
 
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this
can explain the increase.
How? 
• First, let's verify whether cars with larger engines have worse fuel
efficiency. We note that the displ variable, which represents the
displacement of the engine in liters, is currently a string variable that
we need to convert to a numeric variable:
 
 
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this can explain the
increase.
How? 
• First, let's verify whether cars with larger engines have worse fuel efficiency. We note that the
displ variable, which represents the displacement of the engine in liters, is currently a string
variable that we need to convert to a numeric variable:
>typeof(gasCars$displ)
## "character"
>gasCars$displ <- as.numeric(gasCars$displ)
>ggplot(gasCars, aes(displ, comb08)) + geom_point() + geom_smooth()
 
 
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• There is a negative, or even inverse correlation, between engine
displacement and fuel efficiency; thus, smaller cars tend to be more
fuel-efficient.
Analyzing automobile fuel efficiency over
time
• Now, let's see whether more small cars were made in later years,
which can explain the drastic increase in fuel efficiency.
How?
 
Analyzing automobile fuel efficiency over
time
• Now, let's see whether more small cars were made in later years, which can
explain the drastic increase in fuel efficiency.
>avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ))

>ggplot(avgCarSize, aes(year, avgDispl)) + geom_point() + geom_smooth() +


xlab("Year") + ylab("Average engine displacement (l)")
 
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
 
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• The average engine displacement has decreased substantially since
2008.
• Need to get a better sense of the impact this might have had on fuel
efficiency.
Analyzing automobile fuel efficiency over
time
• To do that, we can put both MPG and displacement by year on the
same graph.
How?
Analyzing automobile fuel efficiency over
time
• To do that, we can put both MPG and displacement by year on the
same graph.
• Using ddply, we create a new data frame, byYear, which contains both
the average fuel efficiency and the average engine displacement by
year:
Analyzing automobile fuel efficiency over
time
• To do that, we can put both MPG and displacement by year on the same graph.
• Using ddply, we create a new data frame, byYear, which contains both the average fuel efficiency
and the average engine displacement by year:
>byYear <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08), avgDispl = mean(displ))
> head(byYear)
year avgMPG avgDispl
1 1984 19.12162 3.068449
2 1985 19.39469 NA
3 1986 19.32046 3.126514
4 1987 19.16457 3.096474
5 1988 19.36761 3.113558
6 1989 19.14196 3.133393
Analyzing automobile fuel efficiency over
time
• The head function shows us that the resulting data frame has three columns: year, avgMPG, and avgDispl.  
• To use the faceting capability of ggplot2 to display Average MPG and Avg engine displacement by year on
separate but aligned plots, we must melt the data frame, converting it from what is known as a wide format to
a long format:
>byYear2 = melt(byYear, id = "year")levels(byYear2$variable) <- c("Average MPG", "Avg engine
displacement")
>head(byYear2)
year variable value
1 1984 Average MPG 19.12162
2 1985 Average MPG 19.39469
3 1986 Average MPG 19.32046
4 1987 Average MPG 19.16457
5 1988 Average MPG 19.36761
6 1989 Average MPG 19.14196
Analyzing automobile fuel efficiency over
time
• If we use the nrow function, we can see that the byYear2 data frame
has 62 rows and the byYear data frame has only 31.
• The two separate columns from byYear (avgMPG and avgDispl) have
now been melted into one new column (value) in the byYear2 data
frame.
• The variable column in the byYear2 data frame serves to identify the
column that the value represents:
Analyzing automobile fuel efficiency over
time
• Now let us plot:
>ggplot(byYear2, aes(year, value)) + geom_point() + geom_smooth() +
facet_wrap(~variable, ncol = 1, scales = "free_y") + xlab("Year") +
ylab("")
## geom_smooth: method="auto" and size of largest group is <1000, so
using
## loess. Use 'method = x' to change the smoothing method.##
geom_smooth: method="auto" and size of largest group is <1000, so
using
## loess. Use 'method = x' to change the smoothing method.
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
1- Engine sizes have generally increased until 2008, with a sudden
increase in large cars between 2006 and 2008.
Analyzing automobile fuel efficiency over
time
2- Since 2009, there has been a decrease in the average car size, which
partially explains the increase in fuel efficiency.
Analyzing automobile fuel efficiency over
time
3- Until 2005, there was an increase in the average car size, but the fuel
efficiency remained roughly constant. This seems to indicate that
engine efficiency has increased over the years.
Analyzing automobile fuel efficiency over
time
4- The years 2006–2008 are interesting. Though the average engine size
increased quite suddenly, the MPG remained roughly the same as in
previous years. This seeming discrepancy might require more
investigation.
Analyzing automobile fuel efficiency over
time
• Given the trend toward smaller displacement engines, let's see
whether automatic or manual transmissions are more efficient for
four cylinder engines, and how the efficiencies have changed over
time.
How?
Analyzing automobile fuel efficiency over
time
• Given the trend toward smaller displacement engines, let's see
whether automatic or manual transmissions are more efficient for
four cylinder engines, and how the efficiencies have changed over
time.
>gasCars4 <- subset(gasCars, cylinders == "4")

>ggplot(gasCars4, aes(factor(year), comb08)) + geom_boxplot() +


facet_wrap(~trany2, ncol = 1) + theme(axis.text.x =
element_text(angle = 45)) + labs(x = "Year", y = "MPG")
Analyzing automobile fuel efficiency over
time
• This time, ggplot2 was used to create box plots that help visualize the
distribution of values (and not just a single value, such as a mean) for
each year.
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• It appears that manual transmissions are more efficient than
automatic transmissions, and they both exhibit the same increase, on
an average, since 2008.
Analyzing automobile fuel efficiency over
time
• Next, let's look at the change in proportion of manual cars available
each year:
How?
 
Analyzing automobile fuel efficiency over
time
• Next, let's look at the change in proportion of manual cars available
each year:
>ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) +
geom_bar(position = "fill") + labs(x = "Year", y = "Proportion of cars",
fill = "Transmission") + theme(axis.text.x = element_text(angle = 45))
+ geom_hline(yintercept = 0.5, linetype = 2)
 
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• Recall, in step 9, it appears that manual transmissions are more
efficient than automatic transmissions, and they both exhibit the
same increase, on an average, since 2008.
Analyzing automobile fuel efficiency over
time
• However, there is something odd here. There appear to be many very
efficient cars (less than 40 MPG) with automatic transmissions in later
years, and almost no manual transmission cars with similar
efficiencies in the same time frame.
Analyzing automobile fuel efficiency over
time
• The pattern is reversed in earlier years.
• Is there a change in the proportion of manual cars available each
year? Yes.
Analyzing automobile fuel efficiency over
time
• What are these very efficient cars?
• In the next section, we look at the makes and models of the cars in
the database.
Investigating the makes and models of automobiles

• This recipe will investigate the makes and models of automobiles and
how they have changed over time:
• Let's look at how the makes and models of cars inform fuel efficiency
over time.
• First, let's look at the frequency of the makes and models of cars
available in the US over this time and concentrate on four-cylinder
cars:
?
Investigating the makes and models of automobiles

• This recipe will investigate the makes and models of automobiles and
how they have changed over time:
• Let's look at how the makes and models of cars inform fuel efficiency over
time.
• First, let's look at the frequency of the makes and models of cars available
in the US over this time and concentrate on four-cylinder cars:
>carsMake <- ddply(gasCars4, ~year, summarise, numberOfMakes =
length(unique(make)))
>ggplot(carsMake, aes(year, numberOfMakes)) + geom_point() + labs(x =
"Year", y = "Number of available makes") + ggtitle("Four cylinder cars")
Investigating the makes and models of automobiles

• Observations?
Investigating the makes and models of automobiles

There has been a


decline in the number
of makes available
over this period,
though there has
been a small uptick
in recent times.
Investigating the makes and models of automobiles

• Can we look at the makes that have been available for every year of
this study?
?
Investigating the makes and models of automobiles

• Can we look at the makes that have been available for every year of
this study?
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep"
Investigating the makes and models of automobiles

• Can we look at the makes that have been available for every year of this
study?
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep“
We find there are only 12 manufactures that made four-cylinder cars every year
during this period
Investigating the makes and models of automobiles

• How have these manufacturers done over time with respect to fuel
efficiency?
?
Investigating the makes and models of automobiles

• How have these manufacturers done over time with respect to fuel
efficiency?
>carsCommonMakes4 <- subset(gasCars4, make %in%
commonMakes)
>avgMPG_commonMakes <- ddply(carsCommonMakes4, ~year +
make, summarise, avgMPG = mean(comb08))
>ggplot(avgMPG_commonMakes, aes(year, avgMPG)) + geom_line() +
facet_wrap(~make, nrow = 3)
Investigating the makes and models of
automobiles
• Observations?
Investigating the makes and models of
automobiles
Most manufacturers have shown
improvement over this time, though
several manufacturers have
demonstrated quite sharp fuel
efficiency increases in the last 5 years.
Investigating the makes and models of automobiles

• We use dlply (not ddply) to take the gasCars4 data frame, split it by year,
and then apply the unique function to the make variable.
• For each year, a list of the unique available automobile makes is computed,
and then dlply returns a list of these lists (one element each year).
• Note dlply, and not ddply, because it takes a data frame (d) as input and
returns a list (l) as output, whereas ddply takes a data frame (d) as input
and outputs a data frame (d):
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
Investigating the makes and models of automobiles

• The next line uses the Reduce higher order function, and this is the same
Reduce function and idea in the mapreduce programming paradigm
introduced by Google that underlies Hadoop.
• R is a functional programming language and offers several higher order
functions as part of its core.
• A higher order function accepts another function as input. In this line, we
pass the intersect function to Reduce, which will apply the intersect function
pairwise to each element in the list of unique makes per year that was
created previously.
• Ultimately, this results in a single list of automobile makes that is present
every year.
Investigating the makes and models of automobiles

• In the final graph, adding + facet_wrap(~make, nrow = 3) tells ggplot2


that we want a separate set of axes for each make of automobile and
distribute these subplots between three different rows.
• This is an incredibly powerful data visualization technique as it allows
us to clearly see patterns that might only manifest for a particular
value of a variable.
• We kept things simple in this first data science project. The dataset
itself was small—only 12 megabytes uncompressed, easily stored, and
handled on a basic laptop.

You might also like