100% found this document useful (1 vote)

59 views

Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data

This document discusses exploratory data analysis (EDA) and provides an example of using EDA to analyze automobile fuel efficiency data. It begins by introducing EDA, its key principles and tools. It then describes acquiring automobile fuel efficiency data from online sources, importing the data into R, and preparing to explore the data. The goal of EDA in this case is to summarize and group the data to understand fuel efficiency trends over time and between vehicle groups.

Uploaded by

Hussein ElGhoul

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

59 views

Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data

Uploaded by

Hussein ElGhoul

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 113

CMPS 396X

AdvancedData Science
Fatima K. Abu Salem
Lecture 3
Exploratory Analysis
Driving Visual Analysis with Automobile Data
Part I – Introduction to EDA
Exploratory Data Analysis
• Exploratory data analysis (EDA) is a critical part of the data science
process, and the first step toward building a model.
• John Tukey, a mathematician at Bell Labs, developed exploratory data
analysis in contrast to confirmatory data analysis, which concerns
itself with modeling and hypothesis.
• In EDA, there is no hypothesis and there is no model. The
“exploratory” aspect means that your understanding of the problem
you are solving is changing as you go.
• The basic tools of EDA are plots, graphs and summary statistics.
Exploratory Data Analysis
• going through the data
• plotting distributions of all variables (using box plots)
• plotting time series of data
• transforming variables
• looking at all pairwise relationships between variables using
scatterplot matrices
• generating summary statistics for all of them e.g. mean, minimum,
maximum, the upper and lower quartiles, and identifying outliers.
Exploratory Data Analysis
• a mindset about your relationship with the data.:
You want to understand the data
gain intuition
try to connect your understanding of the process that generated the data to
the data itself
making sure the data is on the scale you expect, in the format you thought it
should be
• EDA happens between you and the data and isn’t about proving
anything to anyone else yet.
Exploratory Data Analysis
• Although there’s lots of visualization involved in EDA, we distinguish
between EDA and data visualization:
 EDA is done toward the beginning of analysis
 data visualization is done toward the end to communicate one’s findings.
 With EDA, the graphics are solely done for you to understand what’s going on.
• We also distinguish between EDA and ML (Machine Learning):
Plotting data and making comparisons can get you extremely far, and is far
better to do than getting a dataset and immediately running a regression just
because you know how.
Machine learning cannot sail you away from every data storm, and without
investing enough time in EDA, you’ll realize that you are struggling at improving
model’s accuracy.
Exploratory Data Analysis
• It’s been a disservice to analysts and data scientists that EDA has not been enforced as a
critical part of the process of working with data. Take this opportunity to make it part of
your process!
• Here are some references to help you understand best practices and historical context:
Exploratory Data Analysis by John Tukey (Pearson)
The Visual Display of Quantitative Information by Edward Tufte (Graphics Press)
The Elements of Graphing Data by William S. Cleveland (Hobart Press)
Statistical Graphics for Visualizing Multivariate Data by William G. Jacoby (Sage)
“Exploratory Data Analysis for Complex Models” by Andrew Gelman (American Statistical
Association)
The Future of Data Analysis by John Tukey. Annals of Mathematical Statistics, Volume 33, Number 1
(1962), 1-67.
Data Analysis, Exploratory by David Brillinger [8-page excerpt from International Encyclopedia of
Political Science (Sage)]
Data Science Cycle
Part II – A case study in
EDA
Driving Visual Analysis with Automobile
Data
In this chapter, we will cover the following:
• Acquiring automobile fuel efficiency data
• Importing automobile fuel efficiency data into R
• Exploring and describing fuel efficiency data
• Analyzing automobile fuel efficiency over time
• Investigating the makes and models of automobiles (HW)
Introduction

• The first project we will introduce is an analysis of automobile fuel

economy
• The recipes in this chapter will roughly follow these five steps in the
data science pipeline:
Acquisition
Exploration and understanding
Munging, wrangling, and manipulation
Analysis and modeling
Communication and operationalization data.
And just touching at the skirmishes of statistical inference
Acquiring automobile fuel efficiency data

• We will dive into a dataset that contains fuel efficiency performance

metrics, measured in miles per gallon (MPG) over time, for most
makes and models of automobiles available in the U.S. since 1984.
Courtesy of the U.S. Department of Energy and the US Environmental Protection Agency.

• The dataset also contains several features and attributes of the

automobiles listed:
o Goal: summarize and group data to determine which groups tend to have
better fuel efficiency historically and how this has changed over the years.
The latest version of the dataset is available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip, and information about the variables
in the dataset can be found at http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle. The data was last updated on December 4, 2013 and
was downloaded on December 8, 2013.
Acquiring automobile fuel efficiency data

To acuire the data needed:

1.Download the dataset from
http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip.
2.Unzip vehicles.csv and move it to your working code directory.
3.Open the unzipped vehicles.csv file with Microsoft Excel, Google Spreadsheet,
or a simple text editor.
Comma-separated value (csv) files are very convenient to work with as they can be
edited and viewed with very basic, freely available tools.
4.Navigate to http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle.
Acquiring automobile fuel efficiency data

5.Select and copy all the text below the vehicle heading under Data Description,
and paste it into a text file. Do not include the emissions heading.
6.Save this file in your working directory as varlabels.txt.
The first five lines of the file are as follows:
atvtype - type of alternative fuel or advanced technology vehicle
barrels08 - annual petroleum consumption in barrels for fuelType1 (1)
barrelsA08 - annual petroleum consumption in barrels for fuelType2 (1)
charge120 - time to charge an electric vehicle in hours at 120 V
charge240 - time to charge an electric vehicle in hours at 240 V
Preparing R for your first project

If you are using Rstudio:

1.Launch RStudio on your computer.
2.At the R console prompt, install the two R packages needed for this project:
install.packages("plyr")
install.packages("ggplot2")
install.packages("reshape2")
3.Load the R packages, as follows:
library(plyr)
library(ggplot2)
library(reshape2)
Packages
plyr: Tools for splitting, applying and combining Data
• A set of tools that solves a common set of problems:
Say you need to break a big problem down into manageable pieces, operate on each
piece and then put all the pieces back together.
 For example, you might want to fit a model to each spatial location or time point in your
study, summarise data by panels or collapse high-dimensional arrays to simpler summary
statistics.

ggplot2 will make complex data visualizations significantly easier.

reshape2.Plyr will also be used to apply the split-apply-combine data
analysis pattern
Importing automobile fuel efficiency data into R

1.First, set the working directory to the location where we saved the vehicles.csv.zip file:
?
Substitute the path for the actual directory.
2.We can load the data directly from compressed (ZIP) files:
?
3.To see whether this worked, display the first few rows of data using the head command:
?
Importing automobile fuel efficiency data into R

1.First, set the working directory to the location where we saved the vehicles.csv.zip file:
setwd("path")
Substitute the path for the actual directory.
2.We can load the data directly from compressed (ZIP) files:
vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"), stringsAsFactors = F)
3.To see whether this worked, display the first few rows of data using the head command:
head(vehicles)
You should see the first few rows of the dataset printed on your screen.
Note that we could have used the tail command.
Importing automobile fuel efficiency data into R
The read.csv function call included stringsAsFactors = F as its final parameter.

By default, R converts strings to a datatype, known as factors in many cases.

Factors are the names for R's categorical datatype, which can be thought of as a label
or tag applied to the data.

Internally, R stores factors as integers with a mapping to the

appropriate label. This technique allows older versions of R to store factors in much less
memory than the corresponding character.

When importing data into R, we often run into the situation where a column of
numeric data might contain an entry that is non-numeric. In this case, R might
import the column of data as factors, which is often not what was intended by
the data scientist.
Importing automobile fuel efficiency data into R

4.The labels command gives the variable labels for the vehicles.csv file.
A quick look at the file shows that the variable names and their
explanations are separated by -. So, we will try to read the file using - as the separator:
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
5.This doesn't work! A closer look at the error shows that in line 11 of the data file,
there are two - symbols, and it thus gets broken into three parts rather than two,
unlike the other rows.
We need to change our file-reading approach to ignore hyphenated words:
?
Importing automobile fuel efficiency data into R

6.To check whether it works, we use the head function again:

head(labels)
[,1] [,2]
[1,] "atvtype" "type of alternative fuel or advanced technology vehicle"
[2,] "barrels08" "annual petroleum consumption in barrels for fuelType1 (1)"
[3,] "barrelsA08" "annual petroleum consumption in barrels for fuelType2 (1)"
[4,] "charge120" "time to charge an electric vehicle in hours at 120 V"
[5,] "charge240" "time to charge an electric vehicle in hours at 240 V"
Importing automobile fuel efficiency data into R

Let's break down the last complex statement in step 5, piece-by-piece:

First, let's read the file line by line:
x <- readLines("varlabels.txt")
Each line needs to be split at the string -.
The spaces are important, so we don't split hyphenated words (such as in line 11).
This results in each line split into two parts as a vector of strings,
and the vectors stored in a single list:
y <- strsplit(x, " - ")
Now, we stack these vectors together to make a matrix of strings,
where the first column is the variable name and the second column
is the description of the variable:
labels <- do.call(rbind, y)
Exploring and describing fuel efficiency
data
The next step is to do some preliminary analysis of the dataset:
1. First, let's find out how many observations (rows) are in our data:
nrow(vehicles)
##34287
2. Next, let's find out how many variables (columns) are in our data:
ncol(vehicles)
## 74
Exploring and describing fuel efficiency
data
Now, let's get a sense of which columns of data are present in the data frame:
?
Exploring and describing fuel efficiency
data
Now, let's get a sense of which columns of data are present in the data frame:
names(vehicles)
Exploring and describing fuel efficiency
data
Now, let's get a sense of which columns of data are present in the data frame:
names(vehicles)
Exploring and describing fuel efficiency
data
• determine the first and last years present in the dataset:
?
Exploring and describing fuel efficiency
data
• Determine the first and last years present in the dataset:
first_year <- min(vehicles[, "year"])
## 1984
last_year <- max(vehicles[, "year"])
## 2014
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.

?
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.

The list of years from 1984 to 2014 should contain 31 unique values!
Exploring and describing fuel efficiency
data
Since we might use the year variable a lot, let's make sure that we have each year covered.

The list of years from 1984 to 2014 should contain 31 unique values!

To test this, use the following command:

length(unique(vehicles$year))
[1] 31
Exploring and describing fuel efficiency
data

Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
?
Exploring and describing fuel efficiency
data

Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587

Observations?
Exploring and describing fuel efficiency
data
Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587

Observations?
Most cars in the dataset use regular gasoline,
and the second most common fuel type is premium gasoline.
Exploring and describing fuel efficiency
data

Let's explore the types of transmissions used by these automobiles.

We first need to take care of all missing data by setting it to NA:
?
Exploring and describing fuel efficiency
data

Let's explore the types of transmissions used by these automobiles.

We first need to take care of all missing data by setting it to NA:

vehicles$trany[vehicles$trany == ""] <- NA

Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual. How to summarise all the text?
?
Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual.

Use the substr function to extract the first four characters of each trany
column value and determine whether it is equal to Auto.
If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set
to Manual:
Exploring and describing fuel efficiency
data
Now, the trany column is text, and we only care whether the car's
transmission is automatic or manual.
?
Use the substr function to extract the first four characters of each trany
column value and determine whether it is equal to Auto.
If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set to
Manual:

vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4) == "Auto", "Auto",

"Manual")
Exploring and describing fuel efficiency
data
Finally, we convert the new variable to a factor and then use the table
function to see the distribution of values:
?
Exploring and describing fuel efficiency
data
Finally, we convert the new variable to a factor and then use the table
function to see the distribution of values:
vehicles$trany2 <- as.factor(vehicles$trany2)
table(vehicles$trany2)
## Auto Manual
## 22451 11825
Observations?
Exploring and describing fuel efficiency
data
Finally, we convert the new variable to a factor and then use the table
function to see the distribution of values:
vehicles$trany <- as.factor(vehicles$trany)
table(vehicles$trany2)
## Auto Manual
## 22451 11825
Observations?
There are roughly twice as many automobile models with automatic
transmission as there are models with manual transmission.
How it works
We used the powerful table function to create a count of the
occurrence of values for the fuelType1 variable. This function is capable
of much more, including cross tabulations, as follows:
with(vehicles, table(sCharger, year))
How it works
We used the powerful table function to create a count of the
occurrence of values for the fuelType1 variable. This function is capable
of much more, including cross tabulations, as follows:
with(vehicles, table(sCharger, year))
How it works

1. We looked at the number of automobile models by year, with a super charger.

2. We saw that super chargers have seemingly become more popular more recently than
they were in the past.
3. The with command tells R to use vehicles as the default data when performing the s
ubsequent command, in this case, table. Thus, we can omit prefacing the sCharger and
year column names with the name of the data frame and vehicles, followed by the dollar sig
How it works

To provide a cautionary tale about data import, let's look at the sCharger and tCharger
columns more closely.

Note that these columns indicate whether the car contains a super charger or
a turbo charger, respectively.

Starting with sCharger, we look at the class of the variable and the unique values
present in the data frame:
?
How it works
To provide a cautionary tale about data import, let's look at the sCharger and tCharger
columns more closely.

Note that these columns indicate whether the car contains a super charger or
a turbo charger, respectively.

Starting with sCharger, we look at the class of the variable and the unique values
present in the data frame:
>class(vehicles$sCharger)
[1] "character“
> unique(vehicles$sCharger)
[1] "" "S"
How it works

We next look at tCharger, expecting things to be the same:

>class(vehicles$tCharger)
[1] "logical“
>unique(vehicles$tCharger)
[1] NA TRUE
How it works

We next look at tCharger, expecting things to be the same:

>class(vehicles$tCharger)
[1] "logical“
>unique(vehicles$tCharger)
[1] NA TRUE
These two seemingly similar variables are different datatypes completely.
While the tCharger variable is a logical variable, the sCharger variable appears
to be the more general character datatype. Something seems wrong.
How it works

We next look at tCharger, expecting things to be the same:

>class(vehicles$tCharger)
[1] "logical“
>unique(vehicles$tCharger)
[1] NA TRUE
Open the .csv file: we can see that sCharger and tCharger data columns either
are blank or contains an S or T, respectively.
Thus, R has read in the T character in the tCharger column as a Boolean TRUE variable,
as opposed to the character T. This isn't a fatal flaw and might not impact an analysis.
However, undetected bugs such as this can cause problems far down the analytical
pipeline and necessitate significant repeated work.
Analyzing automobile fuel efficiency over
time
• We continue the exploration by looking at some of the fuel efficiency
metrics over time and in relation to other data points.
?
Analyzing automobile fuel efficiency over
time
• We continue the exploration by looking at some of the fuel efficiency metrics over time and in
relation to other data points.
• Let's start by looking at whether there is an overall trend of how MPG changes over time on an
average.
- We use the ddply function from the plyr package to take the vehicles data frame, aggregate rows
by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency.
- The result is then assigned to a new data frame, mpgByYr.
• Note that this is our first example of split-apply-combine:
- We split the data frame into groups by year, we apply the mean function to specific variables,
and then we combine the results into a new data frame:
>mpgByYr <- ddply(vehicles, ~year, summarise, avgMPG = mean(comb08), avgHghy =
mean(highway08), avgCity = mean(city08))

Analyzing automobile fuel efficiency over
time
- combA08 - combined MPG for fueltype02
- fuelType 2: For dual fuel vehicles, this will be the alternative fuel (e.g. E85, Electricity,
CNG, LPG). For single fuel vehicles, this field is not used. For electric and CNG vehicles
the MPG number is MPGe (gasoline equivalent miles per gallon)

- highway08 - highway MPG for fueltype01
- fuelType1 - fuel type 1. For single fuel vehicles, this will be the only fuel. For dual fuel
vehicles, this will be the conventional fuel

- city08 - city MPG for fueltype1

Analyzing automobile fuel efficiency over
time
• To gain a better understanding of this new data frame, we pass it to
the ggplot function, telling it to plot the avgMPG variable against the
year variable, using points.
• In addition, we specify that we want axis labels, a title, and even a
smoothed conditional mean (geom_smooth()) represented as a
shaded region of the plot:
?
Analyzing automobile fuel efficiency over
time
• To gain a better understanding of this new data frame, we pass it to the ggplot
function, telling it to plot the avgMPG variable against the year variable, using
points.
• In addition, we specify that we want axis labels, a title, and even a smoothed
conditional mean (geom_smooth()) represented as a shaded region of the
plot:
>ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() +
xlab("Year") + ylab("Average MPG") + ggtitle("All cars")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
Analyzing automobile fuel efficiency over
time
Based on this visualization, one might conclude that there has been a
tremendous increase in the fuel economy of cars sold in the last few
years.

Is that really the case?

Analyzing automobile fuel efficiency over
time
Based on this visualization, one might conclude that there has been a
tremendous increase in the fuel economy of cars sold in the last few
years.

However, this can be a little misleading as there have been more hybrid
and non-gasoline vehicles in the later years.

How do we verify that?

Analyzing automobile fuel efficiency over
time
Based on this visualization, one might conclude that there has been a tremendous
increase in the fuel economy of cars sold in the last few years.

However, this can be a little misleading as there have been more hybrid and non-
gasoline vehicles in the later years, which is shown as follows:

>table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
Why?

Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
How?

Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-
gasoline powered cars, and redraw the preceding plot.
• Use the subset function to create a new data frame, gasCars, which
only contains the rows of vehicles in which the fuelType1 variable is
one among a subset of values:
Analyzing automobile fuel efficiency over
time
• Let's look at just gasoline cars, even though there are not many non-gasoline powered cars, and
redraw the preceding plot.
• Use the subset function to create a new data frame, gasCars, which only contains the rows of
vehicles in which the fuelType1 variable is one among a subset of values:
gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline", "Premium Gasoline", "Midgrade
Gasoline") & fuelType2 == "" & atvType != "Hybrid")

mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08))
ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") +
ylab("Average MPG") + ggtitle("Gasoline cars")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

Analyzing automobile fuel efficiency over
time
Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this
can explain the increase.
How?

Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this
can explain the increase.
How?
• First, let's verify whether cars with larger engines have worse fuel
efficiency. We note that the displ variable, which represents the
displacement of the engine in liters, is currently a string variable that
we need to convert to a numeric variable:

Analyzing automobile fuel efficiency over
time
• Have fewer large engine (gasoline) cars been made recently? If so, this can explain the
increase.
How?
• First, let's verify whether cars with larger engines have worse fuel efficiency. We note that the
displ variable, which represents the displacement of the engine in liters, is currently a string
variable that we need to convert to a numeric variable:
>typeof(gasCars$displ)
## "character"
>gasCars$displ <- as.numeric(gasCars$displ)
>ggplot(gasCars, aes(displ, comb08)) + geom_point() + geom_smooth()

Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• There is a negative, or even inverse correlation, between engine
displacement and fuel efficiency; thus, smaller cars tend to be more
fuel-efficient.
Analyzing automobile fuel efficiency over
time
• Now, let's see whether more small cars were made in later years,
which can explain the drastic increase in fuel efficiency.
How?

Analyzing automobile fuel efficiency over
time
• Now, let's see whether more small cars were made in later years, which can
explain the drastic increase in fuel efficiency.
>avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ))

>ggplot(avgCarSize, aes(year, avgDispl)) + geom_point() + geom_smooth() +

xlab("Year") + ylab("Average engine displacement (l)")

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• The average engine displacement has decreased substantially since
2008.
• Need to get a better sense of the impact this might have had on fuel
efficiency.
Analyzing automobile fuel efficiency over
time
• To do that, we can put both MPG and displacement by year on the
same graph.
How?
Analyzing automobile fuel efficiency over
time
• To do that, we can put both MPG and displacement by year on the
same graph.
• Using ddply, we create a new data frame, byYear, which contains both
the average fuel efficiency and the average engine displacement by
year:
Analyzing automobile fuel efficiency over
time
• To do that, we can put both MPG and displacement by year on the same graph.
• Using ddply, we create a new data frame, byYear, which contains both the average fuel efficiency
and the average engine displacement by year:
>byYear <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08), avgDispl = mean(displ))
> head(byYear)
year avgMPG avgDispl
1 1984 19.12162 3.068449
2 1985 19.39469 NA
3 1986 19.32046 3.126514
4 1987 19.16457 3.096474
5 1988 19.36761 3.113558
6 1989 19.14196 3.133393
Analyzing automobile fuel efficiency over
time
• The head function shows us that the resulting data frame has three columns: year, avgMPG, and avgDispl.
• To use the faceting capability of ggplot2 to display Average MPG and Avg engine displacement by year on
separate but aligned plots, we must melt the data frame, converting it from what is known as a wide format to
a long format:
>byYear2 = melt(byYear, id = "year")levels(byYear2$variable) <- c("Average MPG", "Avg engine
displacement")
>head(byYear2)
year variable value
1 1984 Average MPG 19.12162
2 1985 Average MPG 19.39469
3 1986 Average MPG 19.32046
4 1987 Average MPG 19.16457
5 1988 Average MPG 19.36761
6 1989 Average MPG 19.14196
Analyzing automobile fuel efficiency over
time
• If we use the nrow function, we can see that the byYear2 data frame
has 62 rows and the byYear data frame has only 31.
• The two separate columns from byYear (avgMPG and avgDispl) have
now been melted into one new column (value) in the byYear2 data
frame.
• The variable column in the byYear2 data frame serves to identify the
column that the value represents:
Analyzing automobile fuel efficiency over
time
• Now let us plot:
>ggplot(byYear2, aes(year, value)) + geom_point() + geom_smooth() +
facet_wrap(~variable, ncol = 1, scales = "free_y") + xlab("Year") +
ylab("")
## geom_smooth: method="auto" and size of largest group is <1000, so
using
## loess. Use 'method = x' to change the smoothing method.##
geom_smooth: method="auto" and size of largest group is <1000, so
using
## loess. Use 'method = x' to change the smoothing method.
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
1- Engine sizes have generally increased until 2008, with a sudden
increase in large cars between 2006 and 2008.
Analyzing automobile fuel efficiency over
time
2- Since 2009, there has been a decrease in the average car size, which
partially explains the increase in fuel efficiency.
Analyzing automobile fuel efficiency over
time
3- Until 2005, there was an increase in the average car size, but the fuel
efficiency remained roughly constant. This seems to indicate that
engine efficiency has increased over the years.
Analyzing automobile fuel efficiency over
time
4- The years 2006–2008 are interesting. Though the average engine size
increased quite suddenly, the MPG remained roughly the same as in
previous years. This seeming discrepancy might require more
investigation.
Analyzing automobile fuel efficiency over
time
• Given the trend toward smaller displacement engines, let's see
whether automatic or manual transmissions are more efficient for
four cylinder engines, and how the efficiencies have changed over
time.
How?
Analyzing automobile fuel efficiency over
time
• Given the trend toward smaller displacement engines, let's see
whether automatic or manual transmissions are more efficient for
four cylinder engines, and how the efficiencies have changed over
time.
>gasCars4 <- subset(gasCars, cylinders == "4")

>ggplot(gasCars4, aes(factor(year), comb08)) + geom_boxplot() +

facet_wrap(~trany2, ncol = 1) + theme(axis.text.x =
element_text(angle = 45)) + labs(x = "Year", y = "MPG")
Analyzing automobile fuel efficiency over
time
• This time, ggplot2 was used to create box plots that help visualize the
distribution of values (and not just a single value, such as a mean) for
each year.
Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• It appears that manual transmissions are more efficient than
automatic transmissions, and they both exhibit the same increase, on
an average, since 2008.
Analyzing automobile fuel efficiency over
time
• Next, let's look at the change in proportion of manual cars available
each year:
How?

Analyzing automobile fuel efficiency over
time
• Next, let's look at the change in proportion of manual cars available
each year:
>ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) +
geom_bar(position = "fill") + labs(x = "Year", y = "Proportion of cars",
fill = "Transmission") + theme(axis.text.x = element_text(angle = 45))
+ geom_hline(yintercept = 0.5, linetype = 2)

Analyzing automobile fuel efficiency over
time
• Observations?
Analyzing automobile fuel efficiency over
time
• Recall, in step 9, it appears that manual transmissions are more
efficient than automatic transmissions, and they both exhibit the
same increase, on an average, since 2008.
Analyzing automobile fuel efficiency over
time
• However, there is something odd here. There appear to be many very
efficient cars (less than 40 MPG) with automatic transmissions in later
years, and almost no manual transmission cars with similar
efficiencies in the same time frame.
Analyzing automobile fuel efficiency over
time
• The pattern is reversed in earlier years.
• Is there a change in the proportion of manual cars available each
year? Yes.
Analyzing automobile fuel efficiency over
time
• What are these very efficient cars?
• In the next section, we look at the makes and models of the cars in
the database.
Investigating the makes and models of automobiles

• This recipe will investigate the makes and models of automobiles and
how they have changed over time:
• Let's look at how the makes and models of cars inform fuel efficiency over
time.
• First, let's look at the frequency of the makes and models of cars available
in the US over this time and concentrate on four-cylinder cars:
>carsMake <- ddply(gasCars4, ~year, summarise, numberOfMakes =
length(unique(make)))
>ggplot(carsMake, aes(year, numberOfMakes)) + geom_point() + labs(x =
"Year", y = "Number of available makes") + ggtitle("Four cylinder cars")
Investigating the makes and models of automobiles

• Observations?
Investigating the makes and models of automobiles

There has been a

decline in the number
of makes available
over this period,
though there has
been a small uptick
in recent times.
Investigating the makes and models of automobiles

• Can we look at the makes that have been available for every year of
this study?
?
Investigating the makes and models of automobiles

• Can we look at the makes that have been available for every year of
this study?
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep"
Investigating the makes and models of automobiles

• Can we look at the makes that have been available for every year of this
study?
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep“
We find there are only 12 manufactures that made four-cylinder cars every year
during this period
Investigating the makes and models of automobiles

• How have these manufacturers done over time with respect to fuel
efficiency?
?
Investigating the makes and models of automobiles

• How have these manufacturers done over time with respect to fuel
efficiency?
>carsCommonMakes4 <- subset(gasCars4, make %in%
commonMakes)
>avgMPG_commonMakes <- ddply(carsCommonMakes4, ~year +
make, summarise, avgMPG = mean(comb08))
>ggplot(avgMPG_commonMakes, aes(year, avgMPG)) + geom_line() +
facet_wrap(~make, nrow = 3)
Investigating the makes and models of
automobiles
• Observations?
Investigating the makes and models of
automobiles
Most manufacturers have shown
improvement over this time, though
several manufacturers have
demonstrated quite sharp fuel
efficiency increases in the last 5 years.
Investigating the makes and models of automobiles

• We use dlply (not ddply) to take the gasCars4 data frame, split it by year,
and then apply the unique function to the make variable.
• For each year, a list of the unique available automobile makes is computed,
and then dlply returns a list of these lists (one element each year).
• Note dlply, and not ddply, because it takes a data frame (d) as input and
returns a list (l) as output, whereas ddply takes a data frame (d) as input
and outputs a data frame (d):
>uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
>commonMakes <- Reduce(intersect, uniqMakes)
>commonMakes
Investigating the makes and models of automobiles

• The next line uses the Reduce higher order function, and this is the same
Reduce function and idea in the mapreduce programming paradigm
introduced by Google that underlies Hadoop.
• R is a functional programming language and offers several higher order
functions as part of its core.
• A higher order function accepts another function as input. In this line, we
pass the intersect function to Reduce, which will apply the intersect function
pairwise to each element in the list of unique makes per year that was
created previously.
• Ultimately, this results in a single list of automobile makes that is present
every year.
Investigating the makes and models of automobiles

• In the final graph, adding + facet_wrap(~make, nrow = 3) tells ggplot2

that we want a separate set of axes for each make of automobile and
distribute these subplots between three different rows.
• This is an incredibly powerful data visualization technique as it allows
us to clearly see patterns that might only manifest for a particular
value of a variable.
• We kept things simple in this first data science project. The dataset
itself was small—only 12 megabytes uncompressed, easily stored, and
handled on a basic laptop.

Integer Linear Optimization Models
No ratings yet
Integer Linear Optimization Models
66 pages
ClassVII Coding Student Handbook
50% (2)
ClassVII Coding Student Handbook
109 pages
Rosoft 3d Pipe Usermanualenu 6.64
No ratings yet
Rosoft 3d Pipe Usermanualenu 6.64
11 pages
Iecex Quality Assessment Report Summary
No ratings yet
Iecex Quality Assessment Report Summary
9 pages
Bangalore Ward Patient PDF
No ratings yet
Bangalore Ward Patient PDF
1 page
Scrape Data From PDF Files Using Python Towards Data Science
No ratings yet
Scrape Data From PDF Files Using Python Towards Data Science
8 pages
Noren Rest API (Original)
No ratings yet
Noren Rest API (Original)
133 pages
FORTUNA KRISHH_CCTV Quotation-2MP CAM
No ratings yet
FORTUNA KRISHH_CCTV Quotation-2MP CAM
3 pages
Mahesh Kaushik RSI Based Nifty Ki Dukan Sheet
No ratings yet
Mahesh Kaushik RSI Based Nifty Ki Dukan Sheet
268 pages
CSE-Machine Learning & Big Data - WSS Source Book
No ratings yet
CSE-Machine Learning & Big Data - WSS Source Book
181 pages
The Winds of Python
No ratings yet
The Winds of Python
308 pages
Logistic Regression With NumPy and Python
No ratings yet
Logistic Regression With NumPy and Python
1 page
RCA Proceduree
0% (1)
RCA Proceduree
7 pages
Apar Industries Ltd. (Purchase Order)
No ratings yet
Apar Industries Ltd. (Purchase Order)
1 page
Resume For Control & Instrumentation Engineer: 2. Continue)
No ratings yet
Resume For Control & Instrumentation Engineer: 2. Continue)
6 pages
BHELComputer Science
No ratings yet
BHELComputer Science
7 pages
FORTUNA KRISHH_CCTV_Quotation_4MP
No ratings yet
FORTUNA KRISHH_CCTV_Quotation_4MP
3 pages
Curriculum Vitae: Career Objective
No ratings yet
Curriculum Vitae: Career Objective
3 pages
Prob and Stats Formula Sheet
No ratings yet
Prob and Stats Formula Sheet
3 pages
Packaging Python Org en Latest
No ratings yet
Packaging Python Org en Latest
143 pages
XII STD - Statistics English Medium
No ratings yet
XII STD - Statistics English Medium
280 pages
Eir July2021
No ratings yet
Eir July2021
1,216 pages
Sercel Specs
No ratings yet
Sercel Specs
5 pages
INDE 3364 Final Exam Cheat Sheet
No ratings yet
INDE 3364 Final Exam Cheat Sheet
5 pages
RIOSEIS GX-Plastic Shell - ENG
No ratings yet
RIOSEIS GX-Plastic Shell - ENG
2 pages
Sercel Specs
No ratings yet
Sercel Specs
1 page
Python Programming Notes
No ratings yet
Python Programming Notes
144 pages
Controls Engineer
100% (1)
Controls Engineer
2 pages
Job Description: Global Compensation
100% (1)
Job Description: Global Compensation
4 pages
27 Jupyter Notebook
No ratings yet
27 Jupyter Notebook
42 pages
Power BI in Oil & Gas
No ratings yet
Power BI in Oil & Gas
5 pages
Voters List 14.11.2019
100% (1)
Voters List 14.11.2019
966 pages
Bluebeam Revu 3D Viewing
No ratings yet
Bluebeam Revu 3D Viewing
6 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Book IntroStatistics
No ratings yet
Book IntroStatistics
422 pages
Microsoft Free Practice Exam 70-486 Gratis-Exam
No ratings yet
Microsoft Free Practice Exam 70-486 Gratis-Exam
215 pages
Retrading Business
No ratings yet
Retrading Business
64 pages
2021-Data-Driven Stuck Pipe Prediction and Remedies
No ratings yet
2021-Data-Driven Stuck Pipe Prediction and Remedies
9 pages
Calling A Fortran DLL From Excel
No ratings yet
Calling A Fortran DLL From Excel
7 pages
ANL252 SU3 Jul2022
No ratings yet
ANL252 SU3 Jul2022
23 pages
OLGA Simulation Results
No ratings yet
OLGA Simulation Results
8 pages
TEC SPEC 10OFFSHORE Oct17 - Rev2
100% (1)
TEC SPEC 10OFFSHORE Oct17 - Rev2
16 pages
Time Series Forecasting of Petroleum Pro
No ratings yet
Time Series Forecasting of Petroleum Pro
11 pages
Finance Course
100% (1)
Finance Course
18 pages
Ai&ml Unit 3
No ratings yet
Ai&ml Unit 3
81 pages
chapter 3 Seismic Data Analysis
No ratings yet
chapter 3 Seismic Data Analysis
32 pages
Oil and Tanker Market Analysis
0% (1)
Oil and Tanker Market Analysis
80 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
Master of Technology: Power System Engineering Poornima College of Engg., Jaipur, (Raj.)
No ratings yet
Master of Technology: Power System Engineering Poornima College of Engg., Jaipur, (Raj.)
5 pages
TEC SPEC 40 HC Reefer Oct17
No ratings yet
TEC SPEC 40 HC Reefer Oct17
17 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
D-Series Syringe Pumps User Manual
No ratings yet
D-Series Syringe Pumps User Manual
344 pages
Edwin Montoya CV
No ratings yet
Edwin Montoya CV
6 pages
Real Statistics Examples Part 1A
No ratings yet
Real Statistics Examples Part 1A
853 pages
BSC Data Science Syllabus V6 - 20190528061619 PDF
No ratings yet
BSC Data Science Syllabus V6 - 20190528061619 PDF
116 pages
An Evolutionary Algorithm To Solve Crypt Arithmetic Problem
No ratings yet
An Evolutionary Algorithm To Solve Crypt Arithmetic Problem
3 pages
428XL v6 User3
No ratings yet
428XL v6 User3
376 pages
Data Science Course Content
No ratings yet
Data Science Course Content
8 pages
Misfired Charges: Cagc Best Practices
100% (1)
Misfired Charges: Cagc Best Practices
33 pages
7406HW03
No ratings yet
7406HW03
2 pages
Business Analytics and Data Mining Modeling Using R
No ratings yet
Business Analytics and Data Mining Modeling Using R
6 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Lecture 2
No ratings yet
Lecture 2
163 pages
Graphics PDF
No ratings yet
Graphics PDF
38 pages
Objective: HW3: Due Mon, Nov 2, 2020 at 11:00 AM
No ratings yet
Objective: HW3: Due Mon, Nov 2, 2020 at 11:00 AM
11 pages
Proto-oncogenes and oncogenes in cancer
No ratings yet
Proto-oncogenes and oncogenes in cancer
33 pages
Risipa_Alimentara
No ratings yet
Risipa_Alimentara
54 pages
Visual Saliency From Pixel Level to Object Level Analysis Jianming Zhang 2024 Scribd Download
100% (2)
Visual Saliency From Pixel Level to Object Level Analysis Jianming Zhang 2024 Scribd Download
65 pages
RPJ Dec 1869
No ratings yet
RPJ Dec 1869
32 pages
Geology and Geomorphology
No ratings yet
Geology and Geomorphology
8 pages
Design of Slab To BS8110
100% (1)
Design of Slab To BS8110
8 pages
Lecture 3 Fourier Series
No ratings yet
Lecture 3 Fourier Series
10 pages
Lecture 1 Environmental Science
No ratings yet
Lecture 1 Environmental Science
24 pages
Wired Controller
No ratings yet
Wired Controller
6 pages
Freeman, 2022
No ratings yet
Freeman, 2022
8 pages
CC TPL
No ratings yet
CC TPL
12 pages
Vastu for Office
No ratings yet
Vastu for Office
3 pages
Nitrogen Gas Generator: 0.1 To 1000 NM /HR
No ratings yet
Nitrogen Gas Generator: 0.1 To 1000 NM /HR
2 pages
Important Revision practice g12 Physics
No ratings yet
Important Revision practice g12 Physics
3 pages
The Shuffling of Mathematics Problems Improves Lea - Co Pia
No ratings yet
The Shuffling of Mathematics Problems Improves Lea - Co Pia
19 pages
Sizing and Selection Procedure of Pressure Relieving Devices - R0
100% (1)
Sizing and Selection Procedure of Pressure Relieving Devices - R0
47 pages
9.ISCA IRJEvS 2015 068
No ratings yet
9.ISCA IRJEvS 2015 068
5 pages
Pathology Mcqs Correct
88% (8)
Pathology Mcqs Correct
45 pages
Classification of The Hazards in The Cement Industry
100% (2)
Classification of The Hazards in The Cement Industry
12 pages
Fruit Plants: List of Fruit Plants On A Waiting List System
No ratings yet
Fruit Plants: List of Fruit Plants On A Waiting List System
3 pages
Lateral Bracing-Aegis
100% (1)
Lateral Bracing-Aegis
2 pages
Hanlon CV 2018
No ratings yet
Hanlon CV 2018
3 pages
Gagemaker Thread Disk: Gage Dimensions Report
No ratings yet
Gagemaker Thread Disk: Gage Dimensions Report
1 page
Datasheet-OMC-115-WindSonic-V20191001
No ratings yet
Datasheet-OMC-115-WindSonic-V20191001
2 pages
Hydrahib Ps
No ratings yet
Hydrahib Ps
1 page
Comvat Duo
No ratings yet
Comvat Duo
8 pages
X. Degree of Comparison
No ratings yet
X. Degree of Comparison
12 pages
The Islands Grapevine - 04-25-2019
No ratings yet
The Islands Grapevine - 04-25-2019
16 pages
Environmental Standards
No ratings yet
Environmental Standards
32 pages