Exploratory Data Analysis With R 2020 Update Roger Peng download
Exploratory Data Analysis With R 2020 Update Roger Peng download
https://ebookbell.com/product/exploratory-data-analysis-
with-r-2020-update-roger-peng-50733438
https://ebookbell.com/product/exploratory-data-analysis-with-r-roger-
d-peng-5317138
https://ebookbell.com/product/exploratory-data-analysis-with-r-roger-
d-peng-5857668
https://ebookbell.com/product/handson-exploratory-data-analysis-with-
r-become-an-expert-in-exploratory-data-analysis-using-r-packages-
radhika-datar-46830636
https://ebookbell.com/product/exploratory-data-analysis-with-python-
cookbook-over-50-recipes-to-analyze-visualize-and-extract-insights-
from-structured-and-unstructured-data-oluleye-56143624
Exploratory Data Analysis With Matlab 1st Edition Wendy L Martinez
https://ebookbell.com/product/exploratory-data-analysis-with-
matlab-1st-edition-wendy-l-martinez-4644642
Exploratory Data Analysis With Matlab Second Edition Chapman Hall Crc
Computer Science Data Analysis 2nd Edition Wendy L Martinez
https://ebookbell.com/product/exploratory-data-analysis-with-matlab-
second-edition-chapman-hall-crc-computer-science-data-analysis-2nd-
edition-wendy-l-martinez-1897092
https://ebookbell.com/product/time-series-analysis-with-python-
cookbook-practical-recipes-for-exploratory-data-analysis-data-
preparation-forecasting-tarek-a-atwan-47632810
https://ebookbell.com/product/handson-exploratory-data-analysis-with-
python-perform-eda-techniques-to-understand-summarize-and-investigate-
your-data-1st-edition-suresh-kumar-mukhiya-11063672
https://ebookbell.com/product/handson-exploratory-data-analysis-with-
python-perform-eda-techniques-to-understand-summarize-and-investigate-
your-data-1st-edition-suresh-kumar-mukhiya-11063674
Exploratory Data Analysis with R
Roger D. Peng
This book is for sale at http://leanpub.com/exdata
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean
Publishing process. Lean Publishing is the act of publishing an in-progress ebook using
lightweight tools and many iterations to get reader feedback, pivot until you have the
right book and build traction once you do.
1. Stay in Touch! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
7. Exploratory Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1 Characteristics of exploratory graphs . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Air Pollution in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.3 Getting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.4 Simple Summaries: One Dimension . . . . . . . . . . . . . . . . . . . . . . . . 45
7.5 Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.6 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.7 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.8 Overlaying Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.9 Barplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.10 Simple Summaries: Two Dimensions and Beyond . . . . . . . . . . . . . . . 56
7.11 Multiple Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.12 Multiple Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.13 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.14 Scatterplot - Using Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.15 Multiple Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.16 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8. Plotting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.1 The Base Plotting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 The Lattice System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3 The ggplot2 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9. Graphics Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.1 The Process of Making a Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.2 How Does a Plot Get Created? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.3 Graphics File Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.4 Multiple Open Graphics Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.5 Copying Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17. Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. 200
17.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
17.2 Loading and Processing the Raw Data . . . . . . . . . . . . . . . . . . . . . . . 200
17.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
of EDA. Rather, the goal is to show the data, summarize the evidence and identify
interesting patterns while eliminating ideas that likely won’t pan out.
Throughout the book, we will focus on the R statistical programming language. We
will cover the various plotting systems in R and how to use them effectively. We will
also discuss how to implement dimension reduction techniques like clustering and the
singular value decomposition. All of these techniques will help you to visualize your data
and to help you make key decisions in any data analysis.
3. Getting Started with R
3.1 Installation
The first thing you need to do to get started with R is to install it on your computer. R
works on pretty much every platform available, including the widely available Windows,
Mac OS X, and Linux systems. If you want to watch a step-by-step tutorial on how to
install R for Mac or Windows, you can watch these videos:
• Installing R on Windows1
• Installing R on the Mac2
• Installing RStudio3
After you install R you will need to launch it and start writing R code. Before we get to
exactly how to write R code, it’s useful to get a sense of how the system is organized. In
these two videos I talk about where to write code and how set your working directory,
which let’s R know where to find all of your files.
1 http://youtu.be/Ohnk9hcxf9M
2 https://youtu.be/uxuuWXU-7UQ
3 https://youtu.be/bM7Sfz-LADM
4 http://rstudio.com
5 https://youtu.be/8xT3hmJQskU
6 https://youtu.be/XBcvH1BpIBo
4. Managing Data Frames with the
dplyr package
Watch a video of this chapter1
The data frame is a key data structure in statistics and in R. The basic structure of a data
frame is that there is one observation per row and each column represents a variable, a
measure, feature, or characteristic of that observation. R has an internal implementation
of data frames that is likely the one you will use most often. However, there are packages
on CRAN that implement data frames via things like relational databases that allow you
to operate on very very large data frames (but we won’t discuss them here).
Given the importance of managing data frames, it’s important that we have good tools for
dealing with them. R obviously has some built-in tools like the subset() function and the
use of [ and $ operators to extract subsets of data frames. However, other operations, like
filtering, re-ordering, and collapsing, can often be tedious operations in R whose syntax
is not very intuitive. The dplyr package is designed to mitigate a lot of these problems and
to provide a highly optimized set of routines specifically for dealing with data frames.
The dplyr package was developed by Hadley Wickham of RStudio and is an optimized
and distilled version of his plyr package. The dplyr package does not provide any “new”
functionality to R per se, in the sense that everything dplyr does could already be done
with base R, but it greatly simplifies existing functionality in R.
One important contribution of the dplyr package is that it provides a “grammar” (in
particular, verbs) for data manipulation and for operating on data frames. With this
grammar, you can sensibly communicate what it is that you are doing to a data frame
that other people can understand (assuming they also know the grammar). This is useful
because it provides an abstraction for data manipulation that previously did not exist.
Another useful contribution is that the dplyr functions are very fast, as many key
operations are coded in C++.
1 https://youtu.be/aywFompr1F4
Managing Data Frames with the dplyr package 6
• select: return a subset of the columns of a data frame, using a flexible notation
• filter: extract a subset of rows from a data frame based on logical conditions
• arrange: reorder rows of a data frame
• rename: rename variables in a data frame
• mutate: add new variables/columns or transform existing variables
• summarise / summarize: generate summary statistics of different variables in the data
frame, possibly within strata
• %>%: the “pipe” operator is used to connect multiple verb actions together into a
pipeline
The dplyr package as a number of its own data types that it takes advantage of. For
example, there is a handy print method that prevents you from printing a lot of data
to the console. Most of the time, these additional data types are transparent to the user
and do not need to be worried about.
All of the functions that we will discuss in this Chapter will have a few common
characteristics. In particular,
The dplyr package can be installed from CRAN or from GitHub using the devtools
package and the install_github() function. The GitHub repository will usually contain
the latest updates to the package and the development version.
To install from CRAN, just run
2 http://www.jstatsoft.org/v59/i10/paper
Managing Data Frames with the dplyr package 7
> install.packages("dplyr")
> remotes::install_github("tidyverse/dplyr")
After installing the package it is important that you load it into your R session with the
library() function.
> library(dplyr)
filter, lag
The following objects are masked from 'package:base':
You may get some warnings when the package is loaded because there are functions in
the dplyr package that have the same name as functions in other packages. For now you
can ignore the warnings.
NOTE: If you ever run into a problem where R is getting confused over which function
you mean to call, you can specify the full name of a function using the :: operator. The
full name is simply the package name from which the function is defined followed by ::
and then the function name. For example, the filter function from the dplyr package
has the full name dplyr::filter. Calling functions with their full name will resolve any
confusion over which function was meant to be called.
4.5 select()
For the examples in this chapter we will be using a dataset containing air pollution and
temperature data for the city of Chicago3 in the U.S. The dataset is available from my
web site.
After unzipping the archive, you can load the data into R using the readRDS() function.
You can see some basic characteristics of the dataset with the dim() and str() functions.
3 http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip
Managing Data Frames with the dplyr package 8
> dim(chicago)
[1] 6940 8
> str(chicago)
'data.frame': 6940 obs. of 8 variables:
$ city : chr "chic" "chic" "chic" "chic" ...
$ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
$ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
$ date : Date, format: "1987-01-01" "1987-01-02" ...
$ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
$ pm10tmean2: num 34 NA 34.2 47 NA ...
$ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
The select() function can be used to select columns of a data frame that you want to
focus on. Often you’ll have a large data frame containing “all” of the data, but any given
analysis might only use a subset of variables or observations. The select() function
allows you to get the few columns you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We
could for example use numerical indices. But we can also use the names directly.
> names(chicago)[1:3]
[1] "city" "tmpd" "dptp"
> subset <- select(chicago, city:dptp)
> head(subset)
city tmpd dptp
1 chic 31.5 31.500
2 chic 33.0 29.875
3 chic 33.0 27.375
4 chic 29.0 28.625
5 chic 32.0 28.875
6 chic 40.0 35.125
Note that the : normally cannot be used with names or strings, but inside the select()
function you can use it to specify a range of variable names.
You can also omit variables using the select() function by using the negative sign. With
select() you can do
which indicates that we should include every variable except the variables city through
dptp. The equivalent code in base R would be
Managing Data Frames with the dplyr package 9
You can also use more general regular expressions if necessary. See the help page
(?select) for more details.
4.6 filter()
The filter() function is used to extract subsets of rows from a data frame. This function
is similar to the existing subset() function in R but is quite a bit faster in my experience.
Suppose we wanted to extract the rows of the chicago data frame where the levels of
PM2.5 are greater than 30 (which is a reasonably high level), we could do
Managing Data Frames with the dplyr package 10
You can see that there are now only 194 rows in the data frame and the distribution of
the pm25tmean2 values is.
> summary(chic.f$pm25tmean2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.05 32.12 35.04 36.63 39.53 61.50
We can place an arbitrarily complex logical sequence inside of filter(), so we could for
example extract the rows where PM2.5 is greater than 30 and temperature is greater than
80 degrees Fahrenheit.
> chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
> select(chic.f, date, tmpd, pm25tmean2)
date tmpd pm25tmean2
1 1998-08-23 81 39.60000
2 1998-09-06 81 31.50000
3 2001-07-20 82 32.30000
4 2001-08-01 84 43.70000
5 2001-08-08 85 38.83750
6 2001-08-09 84 38.20000
7 2002-06-20 82 33.00000
8 2002-06-23 82 42.50000
9 2002-07-08 81 33.10000
10 2002-07-18 82 38.85000
11 2003-06-25 82 33.90000
12 2003-07-04 84 32.90000
13 2005-06-24 86 31.85714
14 2005-06-27 82 51.53750
15 2005-06-28 85 31.20000
16 2005-07-17 84 32.70000
17 2005-08-03 84 37.90000
Now there are only 17 observations where both of those conditions are met.
Managing Data Frames with the dplyr package 11
4.7 arrange()
The arrange() function is used to reorder rows of a data frame according to one of the
variables/columns. Reordering rows of a data frame (while preserving corresponding
order of other columns) is normally a pain to do in R. The arrange() function simplifies
the process quite a bit.
Here we can order the rows of the data frame by date, so that the first row is the earliest
(oldest) observation and the last row is the latest (most recent) observation.
Columns can be arranged in descending order too by useing the special desc() operator.
Looking at the first three and last three rows shows the dates in descending order.
Managing Data Frames with the dplyr package 12
4.8 rename()
Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function
is designed to make this process easier.
Here you can see the names of the first five variables in the chicago data frame.
The dptp column is supposed to represent the dew point temperature and the pm25tmean2
column provides the PM2.5 data. However, these names are pretty obscure or awkward
and probably be renamed to something more sensible.
The syntax inside the rename() function is to have the new name on the left-hand side of
the = sign and the old name on the right-hand side.
I leave it as an exercise for the reader to figure how you do this in base R without dplyr.
4.9 mutate()
For example, with air pollution data, we often want to detrend the data by subtracting the
mean from the data. That way we can look at whether a given day’s air pollution level is
higher than or less than average (as opposed to looking at its absolute level).
Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.
There is also the related transmute() function, which does the same thing as mutate()
but then drops all non-transformed variables.
Here we detrend the PM10 and ozone (O3) variables.
> head(transmute(chicago,
+ pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),
+ o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)))
pm10detrend o3detrend
1 -10.395206 -16.904263
2 -14.695206 -16.401093
3 -10.395206 -12.640676
4 -6.395206 -16.175096
5 -6.895206 -14.966763
6 -25.395206 -5.393846
Note that there are only two columns in the transmuted data frame.
4.10 group_by()
The group_by() function is used to generate summary statistics from the data frame
within strata defined by a variable. For example, in this air pollution dataset, you might
want to know what the average annual level of PM2.5 is. So the stratum is the year,
and that is something we can derive from the date variable. In conjunction with the
Managing Data Frames with the dplyr package 14
group_by() function we often use the summarize() function (or summarise() for some
parts of the world).
The general operation here is a combination of splitting a data frame into separate pieces
defined by a variable or group of variables (group_by()), and then applying a summary
function across those subsets (summarize()).
First, we can create a year varible using as.POSIXlt().
Now we can create a separate data frame that splits the original data frame by year.
Finally, we compute summary statistics for each year in the data frame with the summa-
rize() function.
summarize() returns a data frame with year as the first column, and then the annual
averages of pm25, o3, and no2.
In a slightly more complicated example, we might want to know what are the average
levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to
Managing Data Frames with the dplyr package 15
do this would be through a regression model, but we can actually do this quickly with
group_by() and summarize().
Finally, we can compute the mean of o3 and no2 within quintiles of pm25.
From the table, it seems there isn’t a strong relationship between pm25 and o3, but there
appears to be a positive correlation between pm25 and no2. More sophisticated statistical
modeling can help to provide precise answers to these questions, but a simple application
of dplyr functions can often get you most of the way there.
4.11 %>%
The pipeline operater %>% is very handy for stringing together multiple dplyr functions in
a sequence of operations. Notice above that every time we wanted to apply more than one
function, the sequence gets buried in a sequence of nested function calls that is difficult
to read, i.e.
> third(second(first(x)))
This nesting is not a natural way to think about a sequence of operations. The %>%
operator allows you to string operations in a left-to-right fashion, i.e.
Managing Data Frames with the dplyr package 16
Take the example that we just did in the last section where we computed the mean of o3
and no2 within quintiles of pm25. There we had to
This way we don’t have to create a set of temporary variables along the way or create a
massive nested sequence of function calls.
Notice in the above code that I pass the chicago data frame to the first call to mutate(), but
then afterwards I do not have to pass the first argument to group_by() or summarize().
Once you travel down the pipeline with %>%, the first argument is taken to be the output
of the previous element in the pipeline.
Another example might be computing the average pollutant level by month. This could
be useful to see if there are any seasonal trends in the data.
Managing Data Frames with the dplyr package 17
Here we can see that o3 tends to be low in the winter months and high in the summer
while no2 is higher in the winter and lower in the summer.
4.12 Summary
The dplyr package provides a concise set of operations for managing data frames. With
these functions we can do a number of complex operations in just a few lines of code.
In particular, we can often conduct the beginnings of an exploratory analysis with the
powerful combination of group_by() and summarize().
Once you learn the dplyr grammar there are a few additional benefits
• dplyr can work with other data frame “backends” such as SQL databases. There is
an SQL interface for relational databases via the DBI package
• dplyr can be integrated with the data.table package for large fast tables
The dplyr package is handy way to both simplify and speed up your data frame manage-
ment code. It’s rare that you get such a combination at the same time!
5. Exploratory Data Analysis Checklist
In this chapter we will run through an informal “checklist” of things to do when
embarking on an exploratory data analysis. As a running example I will use a dataset on
hourly ozone levels in the United States for the year 2014. The elements of the checklist
are
Formulating a question can be a useful way to guide the exploratory data analysis process
and to limit the exponential number of paths that can be taken with any sizeable dataset.
In particular, a sharp question or hypothesis can serve as a dimension reduction tool that
can eliminate variables that are not immediately relevant to the question.
For example, in this chapter we will be looking at an air pollution dataset from the U.S.
Environmental Protection Agency (EPA). A general question one could as is
Are air pollution levels higher on the east coast than on the west coast?
Are hourly ozone levels on average higher in New York City than they are in
Los Angeles?
Exploratory Data Analysis Checklist 19
Note that both questions may be of interest, and neither is right or wrong. But the first
question requires looking at all pollutants across the entire east and west coasts, while
the second question only requires looking at single pollutant in two cities.
It’s usually a good idea to spend a few minutes to figure out what is the question you’re
really interested in, and narrow it down to be as specific as possible (without becoming
uninteresting).
For this chapter, we will focus on the following question:
Which counties in the United States have the highest levels of ambient ozone
pollution?
As a side note, one of the most important questions you can answer with an exploratory
data analysis is “Do I have the right data to answer this question?” Often this question is
difficult ot answer at first, but can become more clear as we sort through and look at the
data.
The next task in any exploratory data analysis is to read in some data. Sometimes the
data will come in a very messy format and you’ll need to do some cleaning. Other times,
someone else will have cleaned up that data for you so you’ll be spared the pain of having
to do the cleaning.
We won’t go through the pain of cleaning up a dataset here, not because it’s not important,
but rather because there’s often not much generalizable knowledge to obtain from going
through it. Every dataset has its unique quirks and so for now it’s probably best to not
get bogged down in the details.
Here we have a relatively clean dataset from the U.S. EPA on hourly ozone measurements
in the entire U.S. for the year 2014. The data are available from the EPA’s Air Quality
System web page1 . I’ve simply downloaded the zip file from the web site, unzipped the
archive, and put the resulting file in a directory called “data”. If you want to run this code
you’ll have to use the same directory structure.
The dataset is a comma-separated value (CSV) file, where each row of the file contains
one hourly measurement of ozone at some location in the country.
NOTE: Running the code below may take a few minutes. There are 7,147,884 rows in
the CSV file. If it takes too long, you can read in a subset by specifying a value for the
n_max argument to read_csv() that is greater than 0.
1 https://aqs.epa.gov/aqsweb/airdata/download_files.html
Exploratory Data Analysis Checklist 20
> library(readr)
> ozone <- read_csv("data/hourly_44201_2014.csv",
+ col_types = "ccccinnccccccncnncccccc")
The readr package by Hadley Wickham is a nice package for reading in flat files very fast,
or at least much faster than R’s built-in functions. It makes some tradeoffs to obtain that
speed, so these functions are not always appropriate, but they serve our purposes here.
The character string provided to the col_types argument specifies the class of each
column in the dataset. Each letter represents the class of a column: “c” for character, “n”
for numeric”, and “i” for integer. No, I didn’t magically know the classes of each column—
I just looked quickly at the file to see what the column classes were. If there are too many
columns, you can not specify col_types and read_csv() will try to figure it out for you.
Just as a convenience for later, we can rewrite the names of the columns to remove any
spaces.
Have you ever gotten a present before the time when you were allowed to open it? Sure,
we all have. The problem is that the present is wrapped, but you desperately want to
know what’s inside. What’s a person to do in those circumstances? Well, you can shake
the box a bit, maybe knock it with your knuckle to see if it makes a hollow sound, or even
weigh it to see how heavy it is. This is how you should think about your dataset before
you start analyzing it for real.
Assuming you don’t get any warnings or errors when reading in the dataset, you should
now have an object in your workspace named ozone. It’s usually a good idea to poke at
that object a little bit before we break open the wrapping paper.
For example, you can check the number of rows and columns.
> nrow(ozone)
[1] 7147884
> ncol(ozone)
[1] 23
Remember when I said there were 7,147,884 rows in the file? How does that match up
with what we’ve read in? This dataset also has relatively few columns, so you might be
able to check the original text file to see if the number of columns printed out (23) here
matches the number of columns you see in the original file.
Exploratory Data Analysis Checklist 21
Another thing you can do is run str() on the dataset. This is usually a safe operation in
the sense that even with a very large dataset, running str() shouldn’t take too long.
> str(ozone)
Classes 'tbl_df', 'tbl' and 'data.frame': 7147884 obs. of 23 variables:
$ State.Code : chr "01" "01" "01" "01" ...
$ County.Code : chr "003" "003" "003" "003" ...
$ Site.Num : chr "0010" "0010" "0010" "0010" ...
$ Parameter.Code : chr "44201" "44201" "44201" "44201" ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Latitude : num 30.5 30.5 30.5 30.5 30.5 ...
$ Longitude : num -87.9 -87.9 -87.9 -87.9 -87.9 ...
$ Datum : chr "NAD83" "NAD83" "NAD83" "NAD83" ...
$ Parameter.Name : chr "Ozone" "Ozone" "Ozone" "Ozone" ...
$ Date.Local : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-03-01" ...
$ Time.Local : chr "01:00" "02:00" "03:00" "04:00" ...
$ Date.GMT : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-03-01" ...
$ Time.GMT : chr "07:00" "08:00" "09:00" "10:00" ...
$ Sample.Measurement : num 0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.037 0.044 0.046 ...
$ Units.of.Measure : chr "Parts per million" "Parts per million" "Parts per million" "Parts per millio\
n" ...
$ MDL : num 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 ...
$ Uncertainty : num NA NA NA NA NA NA NA NA NA NA ...
$ Qualifier : chr "" "" "" "" ...
$ Method.Type : chr "FEM" "FEM" "FEM" "FEM" ...
$ Method.Name : chr "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - U\
LTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" ...
$ State.Name : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ County.Name : chr "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
$ Date.of.Last.Change: chr "2014-06-30" "2014-06-30" "2014-06-30" "2014-06-30" ...
The output for str() duplicates some information that we already have, like the number
of rows and columns. More importantly, you can examine the classes of each of the
columns to make sure they are correctly specified (i.e. numbers are numeric and strings
are character, etc.). Because I pre-specified all of the column classes in read_csv(), they
all should match up with what I specified.
Often, with just these simple maneuvers, you can identify potential problems with the
data before plunging in head first into a complicated data analysis.
I find it useful to look at the “beginning” and “end” of a dataset right after I check the
packaging. This lets me know if the data were read in properly, things are properly
Exploratory Data Analysis Checklist 22
formatted, and that everthing is there. If your data are time series data, then make sure
the dates at the beginning and end of the dataset match what you expect the beginning
and ending time period to be.
You can peek at the top and bottom of the data with the head() and tail() functions.
Here’s the top.
For brevity I’ve only taken a few columns. And here’s the bottom.
I find tail() to be particularly useful because often there will be some problem reading
the end of a dataset and if you don’t check that you’d never know. Sometimes there’s
weird formatting at the end or some extra comment lines that someone decided to stick
at the end.
Make sure to check all the columns and verify that all of the data in each column looks
the way it’s supposed to look. This isn’t a foolproof approach, because we’re only looking
at a few rows, but it’s a decent start.
In general, counting things is usually a good way to figure out if anything is wrong or not.
In the simplest case, if you’re expecting there to be 1,000 observations and it turns out
there’s only 20, you know something must have gone wrong somewhere. But there are
other areas that you can check depending on your application. To do this properly, you
need to identify some landmarks that can be used to check against your data. For example,
if you are collecting data on people, such as in a survey or clinical trial, then you should
Exploratory Data Analysis Checklist 23
know how many people there are in your study. That’s something you should check in
your dataset, to make sure that you have data on all the people you thought you would
have data on.
In this example, we will use the fact that the dataset purportedly contains hourly data for
the entire country. These will be our two landmarks for comparison.
Here, we have hourly ozone data that comes from monitors across the country. The
monitors should be monitoring continuously during the day, so all hours should be rep-
resented. We can take a look at the Time.Local variable to see what time measurements
are recorded as being taken.
> table(ozone$Time.Local)
One thing we notice here is that while almost all measurements in the dataset are
recorded as being taken on the hour, some are taken at slightly different times. Such
a small number of readings are taken at these off times that we might not want to care.
But it does seem a bit odd, so it might be worth a quick check.
We can take a look at which observations were measured at time “00:01”.
> library(dplyr)
> filter(ozone, Time.Local == "13:14") %>%
+ select(State.Name, County.Name, Date.Local,
+ Time.Local, Sample.Measurement)
# A tibble: 2 x 5
State.Name County.Name Date.Local Time.Local
<chr> <chr> <chr> <chr>
1 New York Franklin 2014-09-30 13:14
2 New York Franklin 2014-09-30 13:14
# … with 1 more variable:
# Sample.Measurement <dbl>
We can see that it’s a monitor in Franklin County, New York and that the measurements
were taken on September 30, 2014. What if we just pulled all of the measurements taken
at this monitor on this date?
Exploratory Data Analysis Checklist 24
Now we can see that this monitor just records its values at odd times, rather than on the
hour. It seems, from looking at the previous output, that this is the only monitor in the
country that does this, so it’s probably not something we should worry about.
Since EPA monitors pollution across the country, there should be a good representation
of states. Perhaps we should see exactly how many states are represented in this dataset.
So it seems the representation is a bit too good—there are 52 states in the dataset, but
only 50 states in the U.S.!
We can take a look at the unique elements of the State.Name variable to see what’s going
on.
> unique(ozone$State.Name)
[1] "Alabama" "Alaska"
[3] "Arizona" "Arkansas"
[5] "California" "Colorado"
[7] "Connecticut" "Delaware"
[9] "District Of Columbia" "Florida"
[11] "Georgia" "Hawaii"
[13] "Idaho" "Illinois"
[15] "Indiana" "Iowa"
[17] "Kansas" "Kentucky"
[19] "Louisiana" "Maine"
[21] "Maryland" "Massachusetts"
[23] "Michigan" "Minnesota"
[25] "Mississippi" "Missouri"
[27] "Montana" "Nebraska"
[29] "Nevada" "New Hampshire"
[31] "New Jersey" "New Mexico"
[33] "New York" "North Carolina"
[35] "North Dakota" "Ohio"
[37] "Oklahoma" "Oregon"
[39] "Pennsylvania" "Rhode Island"
[41] "South Carolina" "South Dakota"
[43] "Tennessee" "Texas"
[45] "Utah" "Vermont"
[47] "Virginia" "Washington"
[49] "West Virginia" "Wisconsin"
[51] "Wyoming" "Puerto Rico"
Now we can see that Washington, D.C. (District of Columbia) and Puerto Rico are the
“extra” states included in the dataset. Since they are clearly part of the U.S. (but not official
states of the union) that all seems okay.
Exploratory Data Analysis Checklist 26
This last bit of analysis made use of something we will discuss in the next section: external
data. We knew that there are only 50 states in the U.S., so seeing 52 state names was an
immediate trigger that something might be off. In this case, all was well, but validating
your data with an external data source can be very useful.
Making sure your data matches something outside of the dataset is very important. It
allows you to ensure that the measurements are roughly in line with what they should
be and it serves as a check on what other things might be wrong in your dataset. External
validation can often be as simple as checking your data against a single number, as we
will do here.
In the U.S. we have national ambient air quality standards, and for ozone, the current
standard2 set in 2008 is that the “annual fourth-highest daily maximum 8-hr concentra-
tion, averaged over 3 years” should not exceed 0.075 parts per million (ppm). The exact
details of how to calculate this are not important for this analysis, but roughly speaking,
the 8-hour average concentration should not be too much higher than 0.075 ppm (it can
be higher because of the way the standard is worded).
Let’s take a look at the hourly measurements of ozone.
> summary(ozone$Sample.Measurement)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02000 0.03200 0.03123 0.04200 0.34900
From the summary we can see that the maximum hourly concentration is quite high
(0.349 ppm) but that in general, the bulk of the distribution is far below 0.075.
We can get a bit more detail on the distribution by looking at deciles of the data.
> quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))
0% 10% 20% 30% 40% 50% 60% 70%
0.000 0.010 0.018 0.023 0.028 0.032 0.036 0.040
80% 90% 100%
0.044 0.051 0.349
Knowing that the national standard for ozone is something like 0.075, we can see from
the data that
• The data are at least of the right order of magnitude (i.e. the units are correct)
• The range of the distribution is roughly what we’d expect, given the regulation
around ambient pollution levels
• Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given
the wording of the standard and the averaging involved.
2 http://www.epa.gov/ttn/naaqs/standards/ozone/s_o3_history.html
Exploratory Data Analysis Checklist 27
Which counties in the United States have the highest levels of ambient ozone
pollution?
What’s the simplest answer we could provide to this question? For the moment, don’t
worry about whether the answer is correct, but the point is how could you provide prima
facie evidence for your hypothesis or question. You may refute that evidence later with
deeper analysis, but this is the first pass.
Because we want to know which counties have the highest levels, it seems we need a list
of counties that are ordered from highest to lowest with respect to their levels of ozone.
What do we mean by “levels of ozone”? For now, let’s just blindly take the average across
the entire year for each county and then rank counties according to this metric.
To identify each county we will use a combination of the State.Name and the County.Name
variables.
It seems interesting that all of these counties are in the western U.S., with 4 of them in
California alone.
For comparison we can look at the 10 lowest counties too.
Exploratory Data Analysis Checklist 28
Let’s take a look at one of the higest level counties, Mariposa County, California. First
let’s see how many observations there are for this county in the dataset.
Always be checking. Does that number of observations sound right? Well, there’s 24
hours in a day and 365 days per, which gives us 8760, which is close to that number
of observations. Sometimes the counties use alternate methods of measurement during
the year so there may be “extra” measurements.
We can take a look at how ozone varies through the year in this county by looking at
monthly averages. First we’ll need to convert the date variable into a Date class.
Then we will split the data by month to look at the average hourly levels.
A few things stand out here. First, ozone appears to be higher in the summer months
and lower in the winter months. Second, there are two months missing (November and
December) from the data. It’s not immediately clear why that is, but it’s probably worth
investigating a bit later on.
Now let’s take a look at one of the lowest level counties, Caddo County, Oklahoma.
Here we see that there are perhaps fewer observations than we would expect for a
monitor that was measuring 24 hours a day all year. We can check the data to see if
anything funny is going on.
Here we can see that the levels of ozone are much lower in this county and that also three
months are missing (October, November, and December). Given the seasonal nature of
ozone, it’s possible that the levels of ozone are so low in those months that it’s not even
worth measuring. In fact some of the monthly averages are below the typical method
detection limit of the measurement technology, meaning that those values are highly
uncertain and likely not distinguishable from zero.
The easy solution is nice because it is, well, easy, but you should never allow those results
to hold the day. You should always be thinking of ways to challenge the results, especially
if those results comport with your prior expectation.
Now, the easy answer seemed to work okay in that it gave us a listing of counties that had
the highest average levels of ozone for 2014. However, the analysis raised some issues.
Exploratory Data Analysis Checklist 30
For example, some counties do not have measurements every month. Is this a problem?
Would it affect our ranking of counties if we had those measurements?
Also, how stable are the rankings from year to year? We only have one year’s worth of
data for the moment, but we could perhaps get a sense of the stability of the rankings by
shuffling the data around a bit to see if anything changes. We can imagine that from
year to year, the ozone data are somewhat different randomly, but generally follow
similar patterns across the country. So the shuffling process could approximate the data
changing from one year to the next. It’s not an ideal solution, but it could give us a sense
of how stable the rankings are.
First we set our random number generator and resample the indices of the rows of
the data frame with replacement. The statistical jargon for this approach is a bootstrap
sample. We use the resampled indices to create a new dataset, ozone2, that shares many
of the same qualities as the original but is randomly perturbed.
> set.seed(10234)
> N <- nrow(ozone)
> idx <- sample(N, N, replace = TRUE)
> ozone2 <- ozone[idx, ]
Now we can reconstruct our rankings of the counties based on this resampled data.
We can then compare the top 10 counties from our original ranking and the top 10
counties from our ranking based on the resampled data.
4 Yavapai 0.04748795
5 Gila 0.04728284
6 San Juan 0.04665711
7 Inyo 0.04652602
8 Coconino 0.04616988
9 El Dorado 0.04611164
10 White Pine 0.04466106
We can see that the rankings based on the resampled data (columns 4–6 on the right) are
very close to the original, with the first 7 being identical. Numbers 8 and 9 get flipped in
the resampled rankings but that’s about it. This might suggest that the original rankings
are somewhat stable.
We can also look at the bottom of the list to see if there were any major changes.
Here we can see that the bottom 7 counties are identical in both rankings, but after that
things shuffle a bit. We’re less concerned with the counties at the bottom of the list, but
this suggests there is also reasonable stability.
In this chapter I’ve presented some simple steps to take when starting off on an ex-
ploratory analysis. The example analysis conducted in this chapter was far from perfect,
Exploratory Data Analysis Checklist 32
but it got us thinking about the data and the question of interest. It also gave us a number
of things to follow up on in case we continue to be interested in this question.
At this point it’s useful to consider a few followup questions.
1. Do you have the right data? Sometimes at the conclusion of an exploratory data
analysis, the conclusion is that the dataset is not really appropriate for this question.
In this case, the dataset seemed perfectly fine for answering the question of which
counties had the highest levels of ozone.
2. Do you need other data? One sub-question we tried to address was whether the
county rankings were stable across years. We addressed this by resampling the data
once to see if the rankings changed, but the better way to do this would be to simply
get the data for previous years and re-do the rankings.
3. Do you have the right question? In this case, it’s not clear that the question we tried
to answer has immediate relevance, and the data didn’t really indicate anything to
increase the question’s relevance. For example, it might have been more interesting
to assess which counties were in violation of the national ambient air quality
standard, because determining this could have regulatory implications. However,
this is a much more complicated calculation to do, requiring data from at least 3
previous years.
The goal of exploratory data analysis is to get you thinking about your data and reasoning
about your question. At this point, we can refine our question or collect new data, all in
an iterative process to get at the truth.
6. Principles of Analytic Graphics
Watch a video of this chapter1 .
The material for this chapter is inspired by Edward Tufte’s wonderful book Beautiful
Evidence, which I strongly encourage you to buy if you are able. He discusses how to make
informative and useful data graphics and lays out six principles that are important to
achieving that goal. Some of these principles are perhaps more relevant to making “final”
graphics as opposed to more “exploratory” graphics, but I believe they are all important
principles to keep in mind.
Showing comparisons is really the basis of all good scientific investigation. Evidence
for a hypothesis is always relative to another competing hypothesis. When you say
“the evidence favors hypothesis A”, what you mean to say is that “the evidence favors
hypothesis A versus hypothesis B”. A good scientist is always asking “Compared to
What?” when confronted with a scientific claim or statement. Data graphics should
generally follow this same principle. You should always be comparing at least two things.
For example, take a look at the plot below. This plot shows the change in symptom-free
days in a group of children enrolled in a clinical trial2 testing whether an air cleaner
installed in a child’s home improves their asthma-related symptoms. This study was
conducted at the Johns Hopkins University School of Medicine and was conducted in
homes where a smoker was living for at least 4 days a week. Each child was assessed
at baseline and then 6-months later at a second visit. The aim was to improve a child’s
symptom-free days over the 6-month period. In this case, a higher number is better,
indicating that they had more symptom-free days.
1 https://youtu.be/6lOvA_y7p7w
2 http://www.ncbi.nlm.nih.gov/pubmed/21810636
Principles of Analytic Graphics 34
There were 47 children who received the air cleaner, and you can see from the boxplot
that on average the number of symptom-free days increased by about 1 day (the solid
line in the middle of the box is the median of the data).
But the question of “compared to what?” is not answered in this plot. In particular, we
don’t know from the plot what would have happened if the children had not received the
air cleaner. But of course, we do have that data and we can show both the group that
received the air cleaner and the control group that did not.
Principles of Analytic Graphics 35
Here we can see that on average, the control group children changed very little in terms of
their symptom free days. Therefore, compared to children who did not receive an air cleaner,
children receiving an air cleaner experienced improved asthma morbidity.
If possible, it’s always useful to show your causal framework for thinking about a
question. Generally, it’s difficult to prove that one thing causes another thing even with
the most carefully collected data. But it’s still often useful for your data graphics to
indicate what you are thinking about in terms of cause. Such a display may suggest
hypotheses or refute them, but most importantly, they will raise new questions that can
be followed up with new data or analyses.
Principles of Analytic Graphics 36
In the plot below, which is reproduced from the previous section, I show the change in
symptom-free days for a group of children who received an air cleaner and a group of
children who received no intervention.
From the plot, it seems clear that on average, the group that received an air cleaner
experienced improved asthma morbidity (more symptom-free days, a good thing).
An interesting question might be “Why do the children with the air cleaner improve?”
This may not be the most important question—you might just care that the air cleaners
help things—but answering the question of “why?” might lead to improvements or new
developments.
The hypothesis behind air cleaners improving asthma morbidity in children is that the
air cleaners remove airborne particles from the air. Given that the homes in this study
all had smokers living in them, it is likely that there is a high level of particles in the air,
primarily from second-hand smoke.
It’s fairly well-understood that inhaling fine particles can exacerbate asthma symptoms,
Principles of Analytic Graphics 37
so it stands to reason that reducing the presence in the air should improve asthma
symptoms. Therefore, we’d expect that the group receiving the air cleaners should on
average see a decrease in airborne particles. In this case we are tracking fine particulate
matter, also called PM2.5 which stands for particulate matter less than or equal to 2.5
microns in aerodynamic diameter.
In the plot below, you can see both the change in symptom-free days for both groups
(left) and the change in PM2.5 in both groups (right).
Now we can see from the right-hand plot that on average in the control group, the level of
PM2.5 actually increased a little bit while in the air cleaner group the levels decreased on
average. This pattern shown in the plot above is consistent with the idea that air cleaners
improve health by reducing airborne particles. However, it is not conclusive proof of this
idea because there may be other unmeasured confounding factors that can lower levels
of PM2.5 and improve symptom-free days.
The real world is multivariate. For anything that you might study, there are usually
many attributes that you can measure. The point is that data graphics should attempt
to show this information as much as possible, rather than reduce things down to one or
two features that we can plot on a page. There are a variety of ways that you can show
multivariate data, and you don’t need to wear 3-D glasses to do it.
Principles of Analytic Graphics 38
Here is just a quick example. Below is data on daily airborne particulate matter (“PM10”)
in New York City and mortality from 1987 to 2000. Each point on the plot represents
the average PM10 level for that day (measured in micrograms per cubic meter) and
the number of deaths on that day. The PM10 data come from the U.S. Environmental
Protection Agency and the mortality data come from the U.S. National Center for Health
Statistics.
This is a bivariate plot showing two variables in this dataset. From the plot it seems that
there is a slight negative relationship between the two variables. That is, higher daily
average levels of PM10 appear to be associated with lower levels of mortality (fewer
deaths per day).
However, there are other factors that are associated with both mortality and PM10 levels.
One example is the season. It’s well known that mortality tends to be higher in the winter
than in the summer. That can be easily shown in the following plot of mortality and date.
Principles of Analytic Graphics 39
Similarly, we can show that in New York City, PM10 levels tend to be high in the summer
and low in the winter. Here’s the plot for daily PM10 over the same time period. Note
that the PM10 data have been centered (the overall mean has been subtracted from them)
so that is why there are both positive and negative values.
Principles of Analytic Graphics 40
From the two plots we can see that PM10 and mortality have opposite seasonality with
mortality being high in the winter and PM10 being high in the summer. What happens
if we plot the relationship between mortality and PM10 by season? That plot is below.
Interestingly, before, when we plotted PM10 and mortality by itself, the relationship
appeared to be slightly negative. However, in each of the plots above, the relationship is
Principles of Analytic Graphics 41
slightly positive. This set of plots illustrates the effect of confounding by season, because
season is related to both PM10 levels and to mortality counts, but in different ways for
each one.
This example illustrates just one of many reasons why it can be useful to plot multivariate
data and to show as many features as intelligently possible. In some cases, you may
uncover unexpected relationships depending on how they are plotted or visualized.
Just because you may be making data graphics, doesn’t mean you have to rely solely
on circles and lines to make your point. You can also include printed numbers, words,
images, and diagrams to tell your story. In other words, data graphics should make use
of many modes of data presentation simultaneously, not just the ones that are familiar
to you or that the software can handle. One should never let the tools available drive the
analysis; one should integrate as much evidence as possible on to a graphic as possible.
Data graphics should be appropriately documented with labels, scales, and sources. A
general rule for me is that a data graphic should tell a complete story all by itself. You
should not have to refer to extra text or descriptions when interpreting a plot, if possible.
Ideally, a plot would have all of the necessary descriptions attached to it. You might
think that this level of documentation should be reserved for “final” plots as opposed to
exploratory ones, but it’s good to get in the habit of documenting your evidence sooner
rather than later.
Imagine if you were writing a paper or a report, and a data graphic was presented to
make the primary point. Imagine the person you hand the paper/report to has very little
time and will only focus on the graphic. Is there enough information on that graphic for
the person to get the story? While it is certainly possible to be too detailed, I tend to err
on the side of more information rather than less.
In the simple example below, I plot the same data twice (this is the PM10 data from the
previous section of this chapter).
Principles of Analytic Graphics 42
The plot on the left is a default plot generated by the plot function in R. The plot on
the right uses the same plot function but adds annotations like a title, y-axis label, x-axis
label. Key information included is where the data were collected (New York), the units of
measurement, the time scale of measurements (daily), and the source of the data (EPA).
6.7 References
This chapter is inspired by the work of Edward Tufte. I encourage you to take a look at
his books, in particular the following book:
For the purposes of this chapter (and the rest of this book), we will make a distinction
between exploratory graphs and final graphs. This distinction is not a very formal one,
but it serves to highlight the fact that graphs are used for many different purposes.
Exploratory graphs are usually made very quickly and a lot of them are made in the
process of checking out the data.
The goal of making exploratory graphs is usually developing a personal understanding
of the data and to prioritize tasks for follow up. Details like axis orientation or legends,
while present, are generally cleaned up and prettified if the graph is going to be used
for communication later. Often color and plot symbol size are used to convey various
dimensions of information.
For this chapter, we will use a simple case study to demonstrate the kinds of simple
graphs that can be useful in exploratory analyses. The data we will be using come from
the U.S. Environmental Protection Agency (EPA), which is the U.S. government agency
1 https://youtu.be/ma6-0PSNLHo
2 https://youtu.be/UyopqXQ8TTM
Random documents with unrelated
content Scribd suggests to you:
And these statements would appear to be in accord with the
figures I have given above.
The statistics of your Right Hon’ble Ruler, which you receive with
thunders of applause, are not worth the paper on which they are
written.
Again I ask your verdict—guilty or not guilty?
Now for Crime. The statistics in this case are less defensible than
in the previous case, because they involve a dishonourable
suppression of facts.
The statistics brought forward to show that a diminution of crime
has been the result of Free Trade, are as follows:
Convictions in 1859 13,470
” 1881 11,353
———
Apparent decrease of crime 2,117
F O OT N OT E S :
[45] “In fifty years, Great Britain has lifted her estimate on this
point so rapidly that she spends five times as much for a given
number of paupers? than she did fifteen years after the opening
of the century.” (‘Practical Political Economy,’ by Profr. Bonamy
Price, p. 237.)
[46] Comparative Cost of Relief to Paupers.
England £10 0
France 2 2
Belgium and Holland 1 3
(Mulhall’s Statistics, p. 346.)
[47] Expenditure in London Charities.
1859. 1881.
Orphanages £409,000 £458,000
Homes for aged 88,000 770,000
Asylums 25,000 156,000
Hospitals, &c. 301,000 596,000
——— ———–
Total 823,000 1,980,000
[48] The financial condition of many of the Trades Unions is
causing serious alarm. The drain has been so heavy on them, that
their capital is greatly reduced, and unless some change takes
place, they will become bankrupt. The increase of pauperism will
then be enormous.
[49] Fortnightly Review, January, 1871.
[50] The Mail, December 19th, 1883.
CHAPTER XIV.
JUGERNATH AFLOAT.
I see, my Friend, that you are bringing out your trump card.
“Behold!” you argue “the unfortunate condition to which America has
been reduced by her protectionist policy; she has scarcely a ship
afloat, whilst Free Trade England is carrying the commerce of the
world.”
First, I would ask, are you quite sure that all this is caused by Free
Trade?
Don’t you think that it is just within the bounds of possibility that
our shrewd American cousins may possibly find a quicker and more
remunerative investment for their capital, in encouraging their
home-productive industries, and in employing their home-labour
productively, than in a keen competition with the English for a
barren trade that is not worth having?
Are you ignorant of the fact that the shipping trade has been a
losing concern for some considerable period?
Are you unaware of the fact that wheat has been frequently
carried as ballast, and has paid no freight; that other articles have
been carried at almost nominal rates?
In the Civil and Military Gazette of 7th December, 1883, under the
Telegraphic Summary, I read—
“It is predicted that, unless freight rates to India speedily improve, a
considerable number of steamers now engaged in the trade will be laid up.”
I also read in the Madras Mail, January 9th, 1884, that an organ of
the shipping interests in London has drawn up the probable “results
of the gross working of thirteen steamers of a well-known Steam
Navigation Company, the result of which is a total loss of £34,000 in
one year’s trading.”
Are the Americans to be pitied, because they have no share in this
losing concern?
If protectionism has kept them out of it, you can scarcely blame it.
But even without such keen competition, the Americans are
justified, by the writings of your sacred shastras, as may be seen by
the following quotation:
“The capital, therefore, employed in the Home trade of any country will
generally give encouragement and support to a greater quantity of productive
labour in that country, and increase the value of its annual produce, more than an
equal capital employed in the Foreign trade of consumption; and the capital
employed in this latter trade has, in both these respects, a still greater advantage
over an equal capital engaged in the Carrying trade.”[51]
F O OT N OT E :
[51] ‘Wealth of Nations,’ by Adam Smith, Bk. II. Chap. V.
CHAPTER XV.
ADVERSE PROSPERITY.
Now if a stalwart race could have existed, and have done 20 per
cent. more work on the lower rate of wages,—although, doubtless,
some improvement in the condition of workmen was desirable,—50
per cent. appears to be a large margin, when we consider that the
price of provisions is said to be unaltered. The British workman is
proverbially extravagant and improvident. High wages encourage
extravagance, whilst surplus cash furnishes the means, and short
hours the leisure, for gratifying a taste for drink.
Setting aside for the moment the serious evils of intemperance,
we have practically, with high wages, the causes that lead to the
impoverishment of a community.
A glance at the statistics of Mr. Giffen seems to indicate this, for
whilst the consumption per head of those commodities which are
termed necessaries of life, have only increased 33 to 40 per cent.
respectively, the consumption of those which may be considered
luxuries—namely, tea and sugar—have increased 232 and 260 per
cent. respectively.
Again, statistics show that, whilst the other classes of the
community have increased in number by 335 per cent. of late years,
the working classes have only increased by 6½ per cent. In other
words, the unproductive classes have increased largely, but, whilst
there is only 6½ per cent. numerical increase in the productive
classes, their labour has decreased by 20 per cent. from shorter
hours of labour.
The drones in the hive have increased very largely, and the
workers have not done so, but have developed an alarming taste for
honey.
The question of waste of wealth would be comparatively of minor
importance were it not seriously complicated by the existence of
Free Trade; but we have now to confront the fact, that, in the
present day, we have to pay 50 per cent. more money for 20 per
cent. less labour than we did forty years ago; whilst Free Trade
brings into the market the products of the keen competition of a
thrifty and parsimonious class of workmen who accept lower wages
and work longer hours. The result must be a gradual extinction of
our industries:
Cotton and woollen industries are struggling hard for existence.
[53]
Silk manufacture is dying out.
Iron industries in a bad way.
Gloomy predictions are made respecting the shipping trade.
Agriculture is rapidly becoming extinguished.
English pluck, capital, and credit are struggling manfully against
disaster, but the struggle cannot last much longer; capital is
sustained by credit; and credit is receiving heavy and repeated blows
from unremunerative industries. Meanwhile, high wages and
extravagant habits are not the best training for the millions that will
be thrown out of employment when the crash comes.
Your prophet, Adam Smith, though an advocate for the repeal of
the Corn Laws, foresaw and forewarned you of these consequences,
as follows:—
“If the free importation of Foreign manufactures were permitted, several of the
Home manufactures would probably suffer, and some of them perhaps go to ruin
altogether.”[54]
F O OT N OT E S :
[52] ‘Political Economy,’ by J. S. Mill, Bk. I. Chap. V.
[53] Mr. S. Smith, M.P., who is connected with cotton industry,
has recently stated that “with all the toil and anxiety of those who
had conducted it, the cotton industry of Lancashire, which gave
maintenance to two or three millions of people, had not earned
so much as 5 per cent. during the past ten years. The employers
had a most anxious life; and many, after struggling for years, had
become bankrupt, and some had died of a broken heart;” and he
added that he believed “most of the leading trades to be in the
same condition.”
The cheap production of Belgian fabrics is stated by the
employers to be the cause of the depression in the cotton trade.
(Times, Dec. 1883.)
[54] ‘Wealth of Nations,’ Bk. IV. Chap. II.
[55] A writer in Vanity Fair, in analyzing the Board of Trade’s
statistics for the year ended March 31st, 1883, when compared
with those for the year ended March, 1880, or the three years of
the Gladstone Ministry, says:
“We were promised cheaper Government, cheaper food, greater
prosperity. We find that so far from these promises being verified,
they have every one been falsified by the result.
“Our Imperial Government is dearer by £8,000,000; our Imperial
and Local Government, together, is dearer by £10,000,000.
“As to food, wheat has become dearer 1s. 3d. per quarter; beef,
by from 3d. to 5d. per stone; Mutton, by 1s. 3d.; money is dearer
than 1¾ per cent.
“As to prosperity, our staple pig iron is cheaper by 22s. 2d. per
ton. We have 398,397 acres fewer under cultivation for corn,
grain and other crops; 50,077 fewer horses; 129,119 fewer cattle;
4,789,738 fewer sheep in the country. We have, in spite of the
Land Act and the allegation of increased prosperity, 18,828 more
paupers in Ireland on a decreasing population. We find that
115,092 more emigrants have left the country in a year, because
they cannot get a living in it. We lose annually 349 more vessels
and 1,534 more lives at sea. The only element of consolation that
these figures” (Board of Trade Returns) “have to show is, that we
have 778,389 more pigs and 4,627 more policemen in the
country. In fact, we are more lacking in every thing we want;
more abounding in every thing we don’t want.
“The price of everything we have to sell has gone down; the price
of everything we have to buy has gone up; and what has gone up
most is the price of Government.
“Dearer Government, dearer bread, dearer beef, dearer mutton,
dearer money; cheaper pig iron; less corn, potatoes, turnips,
grass, and hops, fewer horses, fewer cattle, fewer sheep; more
paupers, more emigrants, more losses of life and property at sea,
more pigs, more policemen.
“These are the benefits that three years of liberal rule have
conferred upon us!!!”
CHAPTER XVI.
SACRED RIGHTS OF PROPERTY.
I have already stated that Mill, when he allows that which Herbert
Spencer terms “political bias,”—and Luigi Cossa terms his “narrow
philosophic utilitarianism,” to warp his better judgment,—is guilty of
absurdities and inconsistencies that would disgrace a schoolboy. This
is notably apparent when he attempts to draw a fundamental
distinction between land and any other property, as regards its
“sacred rights.”
Mr. Mill greatly admired the prosperity of the peasant proprietors
in France and Belgium, unfortunately forgetting that a system, suited
to the sober thrifty peasantry of the Continent, might possibly not be
equally suitable to the improvident lower classes of Ireland and
England,[56] neglectful also of the sensible view taken by M. De
Lavergne that “cultivation spontaneously finds out the organization
that suits it best.”[57] He wished therefore to establish an Utopia of
peasant proprietors in England and Ireland as a panacea for the evils
which Free Trade in the first place, and mischievous legislation in the
second place, had brought upon agriculture. Without presuming to
offer an opinion on the debated subjects of “Grande” and “Petite
Culture,” or peasant and landlord proprietorship, I may say that
cultivation appears to have found out spontaneously the organization
best suited to it, and that, in England and Ireland, landlordism
seems best suited to the improvident character of the lower classes,
in providing capital to help the tenants over bad times, and enabling
improvements to be made in prosperous times.
Be this as it may, peasant proprietorship has proved to be a failure
in Ireland, and is rapidly becoming extinct.[58] Writers on the subject
state that, under that system, labour was so ill-directed, that it
required six men to provide food for ten; and consolidation of
holdings is recommended. Mr. Mill, however, thought otherwise, and
biased by this political conviction, he has propounded the following
extraordinary arguments to prove that the sacred rights of property
are not applicable in the case of landed property[59]:—
(1) “No man made the land.”
(2) It is the original inheritance of the whole species.[60]
(3) Its appropriation is wholly a question of general expediency.
(4) When private property in land is not expedient, it is unjust.
(5) It is no hardship to any one to be excluded from what others
have produced.
(6) But it is a hardship to be born into the world and to find all
nature’s gifts previously engrossed.
(7) Whoever owns land, keeps others out of the enjoyment of it.
Now let us apply Mr. Mill’s arguments to any other kind of
property.
Suppose I say to you:—“My friend! you have two coats; hand one
of them over to me! Sacred rights of property don’t apply to it; you
did not make it; and Mill says—‘it is no hardship to be excluded from
what others have produced;’ but it is some hardship to be born into
the world, and to find all nature’s gifts engrossed. Your argument
that you paid for it in hard cash is worthless. No man made silver
and gold, ‘it is the original inheritance of the whole species, the
receiver is as bad as the thief, and you have connived in the robbery
of those metals from the earth, leaving posterity yet unborn to be
under the hardship of finding all nature’s gifts engrossed.’
“The manufacture of your coat is based on robbery and injustice,
and you have connived at it; the iron and coal used in its production
were made by no man, they are the common inheritance of the
species, those who have obtained them have robbed posterity. You
have bribed them to do so by silver and gold, also robbed from
posterity.
“The very wool of which your coat is formed was made by no
man, it was robbed from a defenceless sheep. Your argument that
the sheep was the property of the shearer is useless. No man made
the sheep, it is the common inheritance of all, &c. Your argument
that his owner reared the sheep, is equally worthless. Monster! if
you find a child, have you a right to rob him and make a slave of
him? such an argument would justify slavery[61] or worse.
“When private property is not expedient it is unjust, and from my
ground of view, it is not expedient that this private property should
be yours; public only differs from private expediency in degree. ‘He
who owns property keeps others out of the enjoyment of it,’ the
sacred rights of property don’t apply to this coat; so hand it over
without any more of your absurd arguments. Nay! if you don’t, and
as I see some one is approaching who may interfere, its
appropriation is one of expediency,—individual expediency must
follow the same law as general expediency,—it is expedient that I
should draw my knife across your throat, otherwise I shall lose that
which is my inheritance in common with the rest of the species.” And
so I might argue ad infinitum.
Mr. Mill’s sophisms however are, what Cossa terms, “concessions
more apparent than real to socialism,” for further on, in his Political
Economy, he completely stultifies his argument by stating that the
principle of property gives to the landowners:—
“a right to compensation for whatever portion of their interest in the land it may
be the policy of the State to deprive them of. To that their claim is indefeasible. It
is due to landowners, and to owners of any property whatever recognised as such
by the State, that they should not be dispossessed of it without receiving its
pecuniary value.... This is due on the general principles on which property rests. If
the land was bought with the produce of the labour and abstinence of themselves
or their ancestors, compensation is due to them on that ground; even if otherwise,
it is still due on the ground of prescription.”
“Nor,” he adds, “can it ever be necessary for accomplishing an object by which
the community altogether will gain, that a particular portion of the community
should be immolated.”[62]
F O OT N OT E S :
[56] If we were to partition out England into a Mill’s Utopia of
peasant proprietors to-morrow, it would not last a week; half of
the proprietors would convert their holdings into drink, and be in
a state of intoxication until it was expended.
[57] ‘Grande and Petite Culture. Rural Economy of France.’ De
Lavergne.
[58] The yeomen and small tenant-farmers, men of little capital,
have almost disappeared, and the process of improving them off
the face of the agricultural world is still progressing to its bitter
end; homestead after homestead has been deserted, and farm
has been added to farm—a very unpleasing result of the
inexorable principle—the survival of the fittest—by means of
which even the cultivators of the soil are selected;—but a result
which, not the laws of nature, but the bungling arrangements of
human legislators, have rendered inevitable. (Bear., Fortnightly
Review, September, 1873.)
[59] ‘Mill’s Political Economy,’ Bk. II. Chap. II.
[60] The original inheritors have, through their lawfully
constituted rulers, parted with their property, having, in most
cases, received an equivalent for it in the shape, either of
eminent services rendered to the State, or else of actual
payments in hard cash; and these transactions have been
deliberately ratified and acknowledged by the laws of the country
from time immemorial. It is therefore simply childish to argue that
the land thus disposed of still belongs to the original inheritors,
after they have enjoyed for past years the proceeds for which
they have bartered the land that once belonged to them.
[61] I beg your pardon, my dear Fanatic, I see I have
unconsciously made a slight mistake. Mill says, that appropriation
is wholly a matter of general expediency, and on that ground you
may justify slavery.
[62] Mill’s Political Economy, Bk. II. Chap. II.
CHAPTER XVII.
SELECTIONS FROM JUGERNATH’S SACRED WRITINGS.
F O OT N OT E S :
[63] Adam Smith, in speaking of the class of merchants and
manufacturers, says:—“Their superiority over the country
gentleman is not so much in their knowledge of the public
interest as in their having a better knowledge of their own
interest than he has of his. It is by this superior knowledge of
their own interest that they have frequently imposed upon his
generosity and persuaded him to give up his own interest and
that of the public from a very simple but honest conviction that
their interest, and not his, was the interest of the people.”
(Wealth of Nations, Bk. I. Chap. XI.)
How true in the case of Free Trade!
[64] The landlordism of the days before Famine (1847) never
“recovered its strength or its primitive ways. For the landlord,
there came of the Famine the Encumbered Estates Court. For the
small farmer and tenant class there floated up the American
Emigrant ships.” (‘History of Our Own Times,’ Justin Macarthy.)
[65] New Ireland, by A. M. Sullivan, p. 133.
[66] Adam Smith contradicts himself about rent—in one set of
passages he says it is the cause, and in another the effect, of
prices.
[67] Macleod’s Economics, p. 117.
[68] Political Economy, by J. S. Mill, Bk. II. Chap. XVI.
[69] Profr. Bonamy Price.
[70] Profr. Bonamy Price.
[71] Political Economy, Bastiat.
[72] “Legal plunder has two roots. One of them is in human
egotism, the other is in false philanthropy.” (Political Economy,
Bastiat.)
CHAPTER XIX.
ODIMUS QUOS LÆSIMUS.
Your friend, John Bright, with his usual disregard for accuracy,
describes the large landlord as the “squanderer and absorber of
national wealth,” but seeing that the total rent of land in Great
Britain and Ireland is less than 5 per cent. of the whole national
income,[73] and that of this less than one-seventh is in the hands of
large landowners, it would require a more able statesman than Mr.
Bright to show how he can squander that, of which such a very
small proportion passes through his lands.
No? friend Bright. You and your fellow free-traders are the real
squanderers of national wealth, and you seek to shift the blame
from your own shoulders, by dishonestly laying it on those of the
landowner. I command to your perusal the graphic description of a
large landowner—the Duke of Argyle—who states that, in Trylee, by
feeding the tenantry in bad times, by assisting some to emigrate, by
introducing new methods of cultivation, by expenditure of capital in
improvements, by consolidating small holdings when too narrow for
subsistence, he has raised a community, from the lowest state of
poverty and degradation, to one of lucrative industry and prosperity.
The prosperity these tenants enjoy is due to the beneficial and
regulative power of the landlord as a capitalist. The greater the
wealth of the landlord, the greater is his beneficial and regulative
power. There were thousands of landowners who acted up to the
limits of their power in this way, until you, friend Bright, ruined them
and deprived them of the power of helping their tenants.
No, doubt, there are bad landlords, as there are bad men in all
classes, but the interests of the landowner and those of the tenant
are inseparably bound together; and the landlord is shrewd enough
to see that it is to his own interest to improve the property if he can
afford to do so.
The old classic, with his insight into human nature, in odimus quos
læsimus, shows that human nature has not altered, and it does not
surprise me that you should hold up to execration the class you have
so cruelly injured.
You, my Free-trading Fanatic, have (thanks to Mill’s unfortunate
sophisms and your leaders’ persistent misrepresentations) such a
very hazy view about landowner’s rights and duties, that I think a
few words on the subject may clear the atmosphere.
(1.) Landed property is the capital of the landlord.
(2.) Interest on capital is fair, reasonable, and consistent with general good.
(3.) Rent is interest on the capital of the landlord.
(4.) The landlord may sell[74] his land, invest the proceeds in any other way, and
thus get interest on his capital.
(5.) The tenant can get rid of rent, either:—
(a) by borrowing money to buy land, in which case he has to pay interest
on the loan;
(b) by saving sufficient money to purchase land, in which case he might,
instead of purchasing, invest the money, so that its interest would pay
the rent.
(6.) In any case the whole question of rent resolves itself into a question of
capital, and interest thereon.
(7.) Law, from time immemorial, has recognised the right of property in land.
(8.) In most cases the owner has paid hard cash both for the land and for the
improvements of it.
(9.) Land is therefore actual capital just as much as money, coal, iron, cattle, or
any other disposable commodity.
It is absurd, therefore, to say, that a man possessing capital in
land may not act in the same way as the owner of any other form of
capital. (Of course he has his moral obligations, but those are
applicable to the possession of any other form of capital.) If the
tenant desires capital, he must work for it, or obtain it in some legal
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
ebookbell.com