0% found this document useful (0 votes)
9 views

Exploratory Data Analysis With R 2020 Update Roger Peng download

Ebook installation

Uploaded by

unanuetessza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Exploratory Data Analysis With R 2020 Update Roger Peng download

Ebook installation

Uploaded by

unanuetessza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Exploratory Data Analysis With R 2020 Update

Roger Peng download

https://ebookbell.com/product/exploratory-data-analysis-
with-r-2020-update-roger-peng-50733438

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Exploratory Data Analysis With R Roger D Peng

https://ebookbell.com/product/exploratory-data-analysis-with-r-roger-
d-peng-5317138

Exploratory Data Analysis With R Roger D Peng

https://ebookbell.com/product/exploratory-data-analysis-with-r-roger-
d-peng-5857668

Handson Exploratory Data Analysis With R Become An Expert In


Exploratory Data Analysis Using R Packages Radhika Datar

https://ebookbell.com/product/handson-exploratory-data-analysis-with-
r-become-an-expert-in-exploratory-data-analysis-using-r-packages-
radhika-datar-46830636

Exploratory Data Analysis With Python Cookbook Over 50 Recipes To


Analyze Visualize And Extract Insights From Structured And
Unstructured Data Oluleye

https://ebookbell.com/product/exploratory-data-analysis-with-python-
cookbook-over-50-recipes-to-analyze-visualize-and-extract-insights-
from-structured-and-unstructured-data-oluleye-56143624
Exploratory Data Analysis With Matlab 1st Edition Wendy L Martinez

https://ebookbell.com/product/exploratory-data-analysis-with-
matlab-1st-edition-wendy-l-martinez-4644642

Exploratory Data Analysis With Matlab Second Edition Chapman Hall Crc
Computer Science Data Analysis 2nd Edition Wendy L Martinez

https://ebookbell.com/product/exploratory-data-analysis-with-matlab-
second-edition-chapman-hall-crc-computer-science-data-analysis-2nd-
edition-wendy-l-martinez-1897092

Time Series Analysis With Python Cookbook Practical Recipes For


Exploratory Data Analysis Data Preparation Forecasting Tarek A Atwan

https://ebookbell.com/product/time-series-analysis-with-python-
cookbook-practical-recipes-for-exploratory-data-analysis-data-
preparation-forecasting-tarek-a-atwan-47632810

Handson Exploratory Data Analysis With Python Perform Eda Techniques


To Understand Summarize And Investigate Your Data 1st Edition Suresh
Kumar Mukhiya

https://ebookbell.com/product/handson-exploratory-data-analysis-with-
python-perform-eda-techniques-to-understand-summarize-and-investigate-
your-data-1st-edition-suresh-kumar-mukhiya-11063672

Handson Exploratory Data Analysis With Python Perform Eda Techniques


To Understand Summarize And Investigate Your Data 1st Edition Suresh
Kumar Mukhiya

https://ebookbell.com/product/handson-exploratory-data-analysis-with-
python-perform-eda-techniques-to-understand-summarize-and-investigate-
your-data-1st-edition-suresh-kumar-mukhiya-11063674
Exploratory Data Analysis with R
Roger D. Peng
This book is for sale at http://leanpub.com/exdata

This version was published on 2020-05-01

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean
Publishing process. Lean Publishing is the act of publishing an in-progress ebook using
lightweight tools and many iterations to get reader feedback, pivot until you have the
right book and build traction once you do.

© 2015 - 2020 Roger D. Peng


Also By Roger D. Peng
R Programming for Data Science
The Art of Data Science
Executive Data Science
Report Writing for Data Science in R
Advanced Statistical Computing
The Data Science Salon
Conversations On Data Science
Mastering Software Development in R
Essays on Data Analysis
Contents

1. Stay in Touch! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3. Getting Started with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


3.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Getting started with the R interface . . . . . . . . . . . . . . . . . . . . . . . . . 4

4. Managing Data Frames with the dplyr package . . . . . . . . . . . . . . . . . . . . 5


4.1 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 The dplyr Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 dplyr Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.4 Installing the dplyr package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.5 select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.6 filter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.7 arrange() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.8 rename() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.9 mutate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.10 group_by() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.11 %>% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5. Exploratory Data Analysis Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


5.1 Formulate your question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Read in your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Check the packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Run str() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.5 Look at the top and the bottom of your data . . . . . . . . . . . . . . . . . . . 21
5.6 Check your “n”s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.7 Validate with at least one external data source . . . . . . . . . . . . . . . . . . 26
5.8 Try the easy solution first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.9 Challenge your solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.10 Follow up questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6. Principles of Analytic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


CONTENTS

6.1 Show comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


6.2 Show causality, mechanism, explanation, systematic structure . . . . . . . 35
6.3 Show multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.4 Integrate evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.5 Describe and document the evidence . . . . . . . . . . . . . . . . . . . . . . . . 41
6.6 Content, Content, Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7. Exploratory Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1 Characteristics of exploratory graphs . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Air Pollution in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.3 Getting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.4 Simple Summaries: One Dimension . . . . . . . . . . . . . . . . . . . . . . . . 45
7.5 Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.6 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.7 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.8 Overlaying Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.9 Barplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.10 Simple Summaries: Two Dimensions and Beyond . . . . . . . . . . . . . . . 56
7.11 Multiple Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.12 Multiple Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.13 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.14 Scatterplot - Using Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.15 Multiple Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.16 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8. Plotting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.1 The Base Plotting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 The Lattice System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3 The ggplot2 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9. Graphics Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.1 The Process of Making a Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.2 How Does a Plot Get Created? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.3 Graphics File Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.4 Multiple Open Graphics Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.5 Copying Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

10. The Base Plotting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


10.1 Base Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.2 Simple Base Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10.3 Some Important Base Graphics Parameters . . . . . . . . . . . . . . . . . . . 80
CONTENTS

10.4 Base Plotting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


10.5 Base Plot with Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.6 Multiple Base Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

11. Plotting and Color in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


11.1 Colors 1, 2, and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
11.2 Connecting colors with data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
11.3 Color Utilities in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
11.4 colorRamp() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.5 colorRampPalette() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
11.6 RColorBrewer Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.7 Using the RColorBrewer palettes . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.8 The smoothScatter() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.9 Adding transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

12. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


12.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
12.2 How do we define close? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
12.3 Example: Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
12.4 Example: Manhattan distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.5 Example: Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.6 Prettier dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
12.7 Merging points: Complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.8 Merging points: Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
12.9 Using the heatmap() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
12.10 Notes and further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

13. K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


13.1 Illustrating the K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 123
13.2 Stopping the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.3 Using the kmeans() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.4 Building heatmaps from K-means solutions . . . . . . . . . . . . . . . . . . . 131
13.5 Notes and further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

14. Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


14.1 Matrix data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.2 Patterns in rows and columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
14.3 Related problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.4 SVD and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.5 Unpacking the SVD: u and v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.6 SVD for data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14.7 Components of the SVD - Variance explained . . . . . . . . . . . . . . . . . . 145
CONTENTS

14.8 Relationship to principal components . . . . . . . . . . . . . . . . . . . . . . . 148


14.9 What if we add a second pattern? . . . . . . . . . . . . . . . . . . . . . . . . . . 150
14.10 Dealing with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
14.11 Example: Face data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
14.12 Notes and further resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

15. The ggplot2 Plotting System: Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160


15.1 The Basics: qplot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
15.2 Before You Start: Label Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . 163
15.3 ggplot2 “Hello, world!” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
15.4 Modifying aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
15.5 Adding a geom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
15.6 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
15.7 Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
15.8 Case Study: MAACS Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
15.9 Summary of qplot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

16. The ggplot2 Plotting System: Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182


16.1 Basic Components of a ggplot2 Plot . . . . . . . . . . . . . . . . . . . . . . . . 182
16.2 Example: BMI, PM2.5, Asthma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
16.3 Building Up in Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
16.4 First Plot with Point Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
16.5 Adding More Layers: Smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
16.6 Adding More Layers: Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
16.7 Modifying Geom Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
16.8 Modifying Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
16.9 Customizing the Smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
16.10 Changing the Theme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
16.11 More Complex Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
16.12 A Quick Aside about Axis Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
16.13 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

17. Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. 200
17.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
17.2 Loading and Processing the Raw Data . . . . . . . . . . . . . . . . . . . . . . . 200
17.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

18. About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211


1. Stay in Touch!
Thanks for purchasing this book. If you are interested in hearing more from me about
things that I’m working on (books, data science courses, podcast, etc.), I have a regular
podcast called Not So Standard Deviations1 that I co-host with Dr. Hilary Parker, a
Data Scientist at Stitch Fix. On this podcast, Hilary and I talk about the craft of data
science and discuss common issues and problems in analyzing data. We’ll also compare
how data science is approached in both academia and industry contexts and discuss the
latest industry trends. You can listen to recent episodes on our Libsyn page or you can
subscribe to it in iTunes2 or your favorite podcasting app.
For those of you who purchased a printed copy of this book, I encourage you to go to the
Leanpub web site and obtain the e-book version3 , which is available for free. The reason
is that I will occasionally update the book with new material and readers who purchase
the e-book version are entitled to free updates (this is unfortunately not yet possible with
printed books).
Thanks again for purchasing this book and please do stay in touch!
1 http://nssdeviations.com
2 https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570
3 https://leanpub.com/exdata
2. Preface
Exploratory data analysis is a bit difficult to describe in concrete definitive terms, but I
think most data analysts and statisticians know it when they see it. I like to think of it in
terms of an analogy.
Filmmakers will shoot a lot of footage when making a movie or some film production,
not all of which will be used. In addition, the footage will typically not be shot in the
order that the storyline takes place, because of actors’ schedules or other complicating
factors. In addition, in some cases, it may be difficult to figure out exactly how the story
should be told while shooting the footage. Rather, it’s sometimes easier to see how the
story flows when putting the various clips together in the editing room.
In the editing room, the director and the editor can play around a bit with different
versions of different scenes to see which dialogue sounds better, which jokes are funnier,
or which scenes are more dramatic. Scenes that just “don’t work” might get dropped, and
scenes that are particularly powerful might get extended or re-shot. This “rough cut” of
the film is put together quickly so that important decisions can be made about what
to pursue further and where to back off. Finer details like color correction or motion
graphics might not be implemented at this point. Ultimately, this rough cut will help the
director and editor create the “final cut”, which is what the audience will ultimately view.
Exploratory data analysis is what occurs in the “editing room” of a research project
or any data-based investigation. EDA is the process of making the “rough cut” for a
data analysis, the purpose of which is very similar to that in the film editing room.
The goals are many, but they include identifying relationships between variables that
are particularly interesting or unexpected, checking to see if there is any evidence for
or against a stated hypothesis, checking for problems with the collected data, such as
missing data or measurement error), or identifying certain areas where more data need
to be collected. At this point, finer details of presentation of the data and evidence,
important for the final product, are not necessarily the focus.
Ultimately, EDA is important because it allows the investigator to make critical decisions
about what is interesting to follow up on and what probably isn’t worth pursuing because
the data just don’t provide the evidence (and might never provide the evidence, even with
follow up). These kinds of decisions are important to make if a project is to move forward
and remain within its budget.
This book covers some of the basics of visualizing data in R and summarizing high-
dimensional data with statistical multivariate analysis techniques. There is less of an
emphasis on formal statistical inference methods, as inference is typically not the focus
Preface 3

of EDA. Rather, the goal is to show the data, summarize the evidence and identify
interesting patterns while eliminating ideas that likely won’t pan out.
Throughout the book, we will focus on the R statistical programming language. We
will cover the various plotting systems in R and how to use them effectively. We will
also discuss how to implement dimension reduction techniques like clustering and the
singular value decomposition. All of these techniques will help you to visualize your data
and to help you make key decisions in any data analysis.
3. Getting Started with R
3.1 Installation

The first thing you need to do to get started with R is to install it on your computer. R
works on pretty much every platform available, including the widely available Windows,
Mac OS X, and Linux systems. If you want to watch a step-by-step tutorial on how to
install R for Mac or Windows, you can watch these videos:

• Installing R on Windows1
• Installing R on the Mac2

There is also an integrated development environment available for R that is built by


RStudio. I really like this IDE—it has a nice editor with syntax highlighting, there is an R
object viewer, and there are a number of other nice features that are integrated. You can
see how to install RStudio here

• Installing RStudio3

The RStudio IDE is available from RStudio’s web site4 .

3.2 Getting started with the R interface

After you install R you will need to launch it and start writing R code. Before we get to
exactly how to write R code, it’s useful to get a sense of how the system is organized. In
these two videos I talk about where to write code and how set your working directory,
which let’s R know where to find all of your files.

• Writing code and setting your working directory on the Mac5


• Writing code and setting your working directory on Windows6

1 http://youtu.be/Ohnk9hcxf9M
2 https://youtu.be/uxuuWXU-7UQ
3 https://youtu.be/bM7Sfz-LADM
4 http://rstudio.com
5 https://youtu.be/8xT3hmJQskU
6 https://youtu.be/XBcvH1BpIBo
4. Managing Data Frames with the
dplyr package
Watch a video of this chapter1

4.1 Data Frames

The data frame is a key data structure in statistics and in R. The basic structure of a data
frame is that there is one observation per row and each column represents a variable, a
measure, feature, or characteristic of that observation. R has an internal implementation
of data frames that is likely the one you will use most often. However, there are packages
on CRAN that implement data frames via things like relational databases that allow you
to operate on very very large data frames (but we won’t discuss them here).
Given the importance of managing data frames, it’s important that we have good tools for
dealing with them. R obviously has some built-in tools like the subset() function and the
use of [ and $ operators to extract subsets of data frames. However, other operations, like
filtering, re-ordering, and collapsing, can often be tedious operations in R whose syntax
is not very intuitive. The dplyr package is designed to mitigate a lot of these problems and
to provide a highly optimized set of routines specifically for dealing with data frames.

4.2 The dplyr Package

The dplyr package was developed by Hadley Wickham of RStudio and is an optimized
and distilled version of his plyr package. The dplyr package does not provide any “new”
functionality to R per se, in the sense that everything dplyr does could already be done
with base R, but it greatly simplifies existing functionality in R.
One important contribution of the dplyr package is that it provides a “grammar” (in
particular, verbs) for data manipulation and for operating on data frames. With this
grammar, you can sensibly communicate what it is that you are doing to a data frame
that other people can understand (assuming they also know the grammar). This is useful
because it provides an abstraction for data manipulation that previously did not exist.
Another useful contribution is that the dplyr functions are very fast, as many key
operations are coded in C++.
1 https://youtu.be/aywFompr1F4
Managing Data Frames with the dplyr package 6

4.3 dplyr Grammar

Some of the key “verbs” provided by the dplyr package are

• select: return a subset of the columns of a data frame, using a flexible notation
• filter: extract a subset of rows from a data frame based on logical conditions
• arrange: reorder rows of a data frame
• rename: rename variables in a data frame
• mutate: add new variables/columns or transform existing variables
• summarise / summarize: generate summary statistics of different variables in the data
frame, possibly within strata
• %>%: the “pipe” operator is used to connect multiple verb actions together into a
pipeline

The dplyr package as a number of its own data types that it takes advantage of. For
example, there is a handy print method that prevents you from printing a lot of data
to the console. Most of the time, these additional data types are transparent to the user
and do not need to be worried about.

Common dplyr Function Properties

All of the functions that we will discuss in this Chapter will have a few common
characteristics. In particular,

1. The first argument is a data frame.


2. The subsequent arguments describe what to do with the data frame specified in the
first argument, and you can refer to columns in the data frame directly without using
the $ operator (just use the column names).
3. The return result of a function is a new data frame
4. Data frames must be properly formatted and annotated for this to all be useful. In
particular, the data must be tidy2 . In short, there should be one observation per row,
and each column should represent a feature or characteristic of that observation.

4.4 Installing the dplyr package

The dplyr package can be installed from CRAN or from GitHub using the devtools
package and the install_github() function. The GitHub repository will usually contain
the latest updates to the package and the development version.
To install from CRAN, just run
2 http://www.jstatsoft.org/v59/i10/paper
Managing Data Frames with the dplyr package 7

> install.packages("dplyr")

To install from GitHub you can run

> remotes::install_github("tidyverse/dplyr")

After installing the package it is important that you load it into your R session with the
library() function.

> library(dplyr)

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

filter, lag
The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

You may get some warnings when the package is loaded because there are functions in
the dplyr package that have the same name as functions in other packages. For now you
can ignore the warnings.
NOTE: If you ever run into a problem where R is getting confused over which function
you mean to call, you can specify the full name of a function using the :: operator. The
full name is simply the package name from which the function is defined followed by ::
and then the function name. For example, the filter function from the dplyr package
has the full name dplyr::filter. Calling functions with their full name will resolve any
confusion over which function was meant to be called.

4.5 select()

For the examples in this chapter we will be using a dataset containing air pollution and
temperature data for the city of Chicago3 in the U.S. The dataset is available from my
web site.
After unzipping the archive, you can load the data into R using the readRDS() function.

> chicago <- readRDS("data/chicago.rds")

You can see some basic characteristics of the dataset with the dim() and str() functions.

3 http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip
Managing Data Frames with the dplyr package 8

> dim(chicago)
[1] 6940 8
> str(chicago)
'data.frame': 6940 obs. of 8 variables:
$ city : chr "chic" "chic" "chic" "chic" ...
$ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
$ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
$ date : Date, format: "1987-01-01" "1987-01-02" ...
$ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
$ pm10tmean2: num 34 NA 34.2 47 NA ...
$ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...

The select() function can be used to select columns of a data frame that you want to
focus on. Often you’ll have a large data frame containing “all” of the data, but any given
analysis might only use a subset of variables or observations. The select() function
allows you to get the few columns you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We
could for example use numerical indices. But we can also use the names directly.

> names(chicago)[1:3]
[1] "city" "tmpd" "dptp"
> subset <- select(chicago, city:dptp)
> head(subset)
city tmpd dptp
1 chic 31.5 31.500
2 chic 33.0 29.875
3 chic 33.0 27.375
4 chic 29.0 28.625
5 chic 32.0 28.875
6 chic 40.0 35.125

Note that the : normally cannot be used with names or strings, but inside the select()
function you can use it to specify a range of variable names.
You can also omit variables using the select() function by using the negative sign. With
select() you can do

> select(chicago, -(city:dptp))

which indicates that we should include every variable except the variables city through
dptp. The equivalent code in base R would be
Managing Data Frames with the dplyr package 9

> i <- match("city", names(chicago))


> j <- match("dptp", names(chicago))
> head(chicago[, -(i:j)])

Not super intuitive, right?


The select() function also allows a special syntax that allows you to specify variable
names based on patterns. So, for example, if you wanted to keep every variable that ends
with a “2”, we could do

> subset <- select(chicago, ends_with("2"))


> str(subset)
'data.frame': 6940 obs. of 4 variables:
$ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
$ pm10tmean2: num 34 NA 34.2 47 NA ...
$ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...

Or if we wanted to keep every variable that starts with a “d”, we could do

> subset <- select(chicago, starts_with("d"))


> str(subset)
'data.frame': 6940 obs. of 2 variables:
$ dptp: num 31.5 29.9 27.4 28.6 28.9 ...
$ date: Date, format: "1987-01-01" "1987-01-02" ...

You can also use more general regular expressions if necessary. See the help page
(?select) for more details.

4.6 filter()

The filter() function is used to extract subsets of rows from a data frame. This function
is similar to the existing subset() function in R but is quite a bit faster in my experience.
Suppose we wanted to extract the rows of the chicago data frame where the levels of
PM2.5 are greater than 30 (which is a reasonably high level), we could do
Managing Data Frames with the dplyr package 10

> chic.f <- filter(chicago, pm25tmean2 > 30)


> str(chic.f)
'data.frame': 194 obs. of 8 variables:
$ city : chr "chic" "chic" "chic" "chic" ...
$ tmpd : num 23 28 55 59 57 57 75 61 73 78 ...
$ dptp : num 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
$ date : Date, format: "1998-01-17" "1998-01-23" ...
$ pm25tmean2: num 38.1 34 39.4 35.4 33.3 ...
$ pm10tmean2: num 32.5 38.7 34 28.5 35 ...
$ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ...
$ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ...

You can see that there are now only 194 rows in the data frame and the distribution of
the pm25tmean2 values is.

> summary(chic.f$pm25tmean2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.05 32.12 35.04 36.63 39.53 61.50

We can place an arbitrarily complex logical sequence inside of filter(), so we could for
example extract the rows where PM2.5 is greater than 30 and temperature is greater than
80 degrees Fahrenheit.

> chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
> select(chic.f, date, tmpd, pm25tmean2)
date tmpd pm25tmean2
1 1998-08-23 81 39.60000
2 1998-09-06 81 31.50000
3 2001-07-20 82 32.30000
4 2001-08-01 84 43.70000
5 2001-08-08 85 38.83750
6 2001-08-09 84 38.20000
7 2002-06-20 82 33.00000
8 2002-06-23 82 42.50000
9 2002-07-08 81 33.10000
10 2002-07-18 82 38.85000
11 2003-06-25 82 33.90000
12 2003-07-04 84 32.90000
13 2005-06-24 86 31.85714
14 2005-06-27 82 51.53750
15 2005-06-28 85 31.20000
16 2005-07-17 84 32.70000
17 2005-08-03 84 37.90000

Now there are only 17 observations where both of those conditions are met.
Managing Data Frames with the dplyr package 11

4.7 arrange()

The arrange() function is used to reorder rows of a data frame according to one of the
variables/columns. Reordering rows of a data frame (while preserving corresponding
order of other columns) is normally a pain to do in R. The arrange() function simplifies
the process quite a bit.
Here we can order the rows of the data frame by date, so that the first row is the earliest
(oldest) observation and the last row is the latest (most recent) observation.

> chicago <- arrange(chicago, date)

We can now check the first few rows

> head(select(chicago, date, pm25tmean2), 3)


date pm25tmean2
1 1987-01-01 NA
2 1987-01-02 NA
3 1987-01-03 NA

and the last few rows.

> tail(select(chicago, date, pm25tmean2), 3)


date pm25tmean2
6938 2005-12-29 7.45000
6939 2005-12-30 15.05714
6940 2005-12-31 15.00000

Columns can be arranged in descending order too by useing the special desc() operator.

> chicago <- arrange(chicago, desc(date))

Looking at the first three and last three rows shows the dates in descending order.
Managing Data Frames with the dplyr package 12

> head(select(chicago, date, pm25tmean2), 3)


date pm25tmean2
1 2005-12-31 15.00000
2 2005-12-30 15.05714
3 2005-12-29 7.45000
> tail(select(chicago, date, pm25tmean2), 3)
date pm25tmean2
6938 1987-01-03 NA
6939 1987-01-02 NA
6940 1987-01-01 NA

4.8 rename()

Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function
is designed to make this process easier.
Here you can see the names of the first five variables in the chicago data frame.

> head(chicago[, 1:5], 3)


city tmpd dptp date pm25tmean2
1 chic 35 30.1 2005-12-31 15.00000
2 chic 36 31.0 2005-12-30 15.05714
3 chic 35 29.4 2005-12-29 7.45000

The dptp column is supposed to represent the dew point temperature and the pm25tmean2
column provides the PM2.5 data. However, these names are pretty obscure or awkward
and probably be renamed to something more sensible.

> chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)


> head(chicago[, 1:5], 3)
city tmpd dewpoint date pm25
1 chic 35 30.1 2005-12-31 15.00000
2 chic 36 31.0 2005-12-30 15.05714
3 chic 35 29.4 2005-12-29 7.45000

The syntax inside the rename() function is to have the new name on the left-hand side of
the = sign and the old name on the right-hand side.
I leave it as an exercise for the reader to figure how you do this in base R without dplyr.

4.9 mutate()

The mutate() function exists to compute transformations of variables in a data frame.


Often, you want to create new variables that are derived from existing variables and
mutate() provides a clean interface for doing that.
Managing Data Frames with the dplyr package 13

For example, with air pollution data, we often want to detrend the data by subtracting the
mean from the data. That way we can look at whether a given day’s air pollution level is
higher than or less than average (as opposed to looking at its absolute level).
Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

> chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))


> head(chicago)
city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
1 chic 35 30.1 2005-12-31 15.00000 23.5 2.531250 13.25000
2 chic 36 31.0 2005-12-30 15.05714 19.2 3.034420 22.80556
3 chic 35 29.4 2005-12-29 7.45000 23.5 6.794837 19.97222
4 chic 37 34.5 2005-12-28 17.75000 27.5 3.260417 19.28563
5 chic 40 33.6 2005-12-27 23.56000 27.0 4.468750 23.50000
6 chic 35 29.6 2005-12-26 8.40000 8.5 14.041667 16.81944
pm25detrend
1 -1.230958
2 -1.173815
3 -8.780958
4 1.519042
5 7.329042
6 -7.830958

There is also the related transmute() function, which does the same thing as mutate()
but then drops all non-transformed variables.
Here we detrend the PM10 and ozone (O3) variables.

> head(transmute(chicago,
+ pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE),
+ o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE)))
pm10detrend o3detrend
1 -10.395206 -16.904263
2 -14.695206 -16.401093
3 -10.395206 -12.640676
4 -6.395206 -16.175096
5 -6.895206 -14.966763
6 -25.395206 -5.393846

Note that there are only two columns in the transmuted data frame.

4.10 group_by()

The group_by() function is used to generate summary statistics from the data frame
within strata defined by a variable. For example, in this air pollution dataset, you might
want to know what the average annual level of PM2.5 is. So the stratum is the year,
and that is something we can derive from the date variable. In conjunction with the
Managing Data Frames with the dplyr package 14

group_by() function we often use the summarize() function (or summarise() for some
parts of the world).
The general operation here is a combination of splitting a data frame into separate pieces
defined by a variable or group of variables (group_by()), and then applying a summary
function across those subsets (summarize()).
First, we can create a year varible using as.POSIXlt().

> chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)

Now we can create a separate data frame that splits the original data frame by year.

> years <- group_by(chicago, year)

Finally, we compute summary statistics for each year in the data frame with the summa-
rize() function.

> summarize(years, pm25 = mean(pm25, na.rm = TRUE),


+ o3 = max(o3tmean2, na.rm = TRUE),
+ no2 = median(no2tmean2, na.rm = TRUE))
# A tibble: 19 x 4
year pm25 o3 no2
* <dbl> <dbl> <dbl> <dbl>
1 1987 NaN 63.0 23.5
2 1988 NaN 61.7 24.5
3 1989 NaN 59.7 26.1
4 1990 NaN 52.2 22.6
5 1991 NaN 63.1 21.4
6 1992 NaN 50.8 24.8
7 1993 NaN 44.3 25.8
8 1994 NaN 52.2 28.5
9 1995 NaN 66.6 27.3
10 1996 NaN 58.4 26.4
11 1997 NaN 56.5 25.5
12 1998 18.3 50.7 24.6
13 1999 18.5 57.5 24.7
14 2000 16.9 55.8 23.5
15 2001 16.9 51.8 25.1
16 2002 15.3 54.9 22.7
17 2003 15.2 56.2 24.6
18 2004 14.6 44.5 23.4
19 2005 16.2 58.8 22.6

summarize() returns a data frame with year as the first column, and then the annual
averages of pm25, o3, and no2.
In a slightly more complicated example, we might want to know what are the average
levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to
Managing Data Frames with the dplyr package 15

do this would be through a regression model, but we can actually do this quickly with
group_by() and summarize().

First, we can create a categorical variable of pm25 divided into quintiles.

> qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)


> chicago <- mutate(chicago, pm25.quint = cut(pm25, qq))

Now we can group the data frame by the pm25.quint variable.

> quint <- group_by(chicago, pm25.quint)

Finally, we can compute the mean of o3 and no2 within quintiles of pm25.

> summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),


+ no2 = mean(no2tmean2, na.rm = TRUE))
# A tibble: 6 x 3
pm25.quint o3 no2
* <fct> <dbl> <dbl>
1 (1.7,8.7] 21.7 18.0
2 (8.7,12.4] 20.4 22.1
3 (12.4,16.7] 20.7 24.4
4 (16.7,22.6] 19.9 27.3
5 (22.6,61.5] 20.3 29.6
6 <NA> 18.8 25.8

From the table, it seems there isn’t a strong relationship between pm25 and o3, but there
appears to be a positive correlation between pm25 and no2. More sophisticated statistical
modeling can help to provide precise answers to these questions, but a simple application
of dplyr functions can often get you most of the way there.

4.11 %>%

The pipeline operater %>% is very handy for stringing together multiple dplyr functions in
a sequence of operations. Notice above that every time we wanted to apply more than one
function, the sequence gets buried in a sequence of nested function calls that is difficult
to read, i.e.

> third(second(first(x)))

This nesting is not a natural way to think about a sequence of operations. The %>%
operator allows you to string operations in a left-to-right fashion, i.e.
Managing Data Frames with the dplyr package 16

> first(x) %>% second %>% third

Take the example that we just did in the last section where we computed the mean of o3
and no2 within quintiles of pm25. There we had to

1. create a new variable pm25.quint


2. split the data frame by that new variable
3. compute the mean of o3 and no2 in the sub-groups defined by pm25.quint

That can be done with the following sequence in a single R expression.

> mutate(chicago, pm25.quint = cut(pm25, qq)) %>%


+ group_by(pm25.quint) %>%
+ summarize(o3 = mean(o3tmean2, na.rm = TRUE),
+ no2 = mean(no2tmean2, na.rm = TRUE))
# A tibble: 6 x 3
pm25.quint o3 no2
* <fct> <dbl> <dbl>
1 (1.7,8.7] 21.7 18.0
2 (8.7,12.4] 20.4 22.1
3 (12.4,16.7] 20.7 24.4
4 (16.7,22.6] 19.9 27.3
5 (22.6,61.5] 20.3 29.6
6 <NA> 18.8 25.8

This way we don’t have to create a set of temporary variables along the way or create a
massive nested sequence of function calls.
Notice in the above code that I pass the chicago data frame to the first call to mutate(), but
then afterwards I do not have to pass the first argument to group_by() or summarize().
Once you travel down the pipeline with %>%, the first argument is taken to be the output
of the previous element in the pipeline.
Another example might be computing the average pollutant level by month. This could
be useful to see if there are any seasonal trends in the data.
Managing Data Frames with the dplyr package 17

> mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>%


+ group_by(month) %>%
+ summarize(pm25 = mean(pm25, na.rm = TRUE),
+ o3 = max(o3tmean2, na.rm = TRUE),
+ no2 = median(no2tmean2, na.rm = TRUE))
# A tibble: 12 x 4
month pm25 o3 no2
* <dbl> <dbl> <dbl> <dbl>
1 1 17.8 28.2 25.4
2 2 20.4 37.4 26.8
3 3 17.4 39.0 26.8
4 4 13.9 47.9 25.0
5 5 14.1 52.8 24.2
6 6 15.9 66.6 25.0
7 7 16.6 59.5 22.4
8 8 16.9 54.0 23.0
9 9 15.9 57.5 24.5
10 10 14.2 47.1 24.2
11 11 15.2 29.5 23.6
12 12 17.5 27.7 24.5

Here we can see that o3 tends to be low in the winter months and high in the summer
while no2 is higher in the winter and lower in the summer.

4.12 Summary

The dplyr package provides a concise set of operations for managing data frames. With
these functions we can do a number of complex operations in just a few lines of code.
In particular, we can often conduct the beginnings of an exploratory analysis with the
powerful combination of group_by() and summarize().
Once you learn the dplyr grammar there are a few additional benefits

• dplyr can work with other data frame “backends” such as SQL databases. There is
an SQL interface for relational databases via the DBI package
• dplyr can be integrated with the data.table package for large fast tables

The dplyr package is handy way to both simplify and speed up your data frame manage-
ment code. It’s rare that you get such a combination at the same time!
5. Exploratory Data Analysis Checklist
In this chapter we will run through an informal “checklist” of things to do when
embarking on an exploratory data analysis. As a running example I will use a dataset on
hourly ozone levels in the United States for the year 2014. The elements of the checklist
are

1. Formulate your question


2. Read in your data
3. Check the packaging
4. Run str()
5. Look at the top and the bottom of your data
6. Check your “n”s
7. Validate with at least one external data source
8. Try the easy solution first
9. Challenge your solution
10. Follow up

5.1 Formulate your question

Formulating a question can be a useful way to guide the exploratory data analysis process
and to limit the exponential number of paths that can be taken with any sizeable dataset.
In particular, a sharp question or hypothesis can serve as a dimension reduction tool that
can eliminate variables that are not immediately relevant to the question.
For example, in this chapter we will be looking at an air pollution dataset from the U.S.
Environmental Protection Agency (EPA). A general question one could as is

Are air pollution levels higher on the east coast than on the west coast?

But a more specific question might be

Are hourly ozone levels on average higher in New York City than they are in
Los Angeles?
Exploratory Data Analysis Checklist 19

Note that both questions may be of interest, and neither is right or wrong. But the first
question requires looking at all pollutants across the entire east and west coasts, while
the second question only requires looking at single pollutant in two cities.
It’s usually a good idea to spend a few minutes to figure out what is the question you’re
really interested in, and narrow it down to be as specific as possible (without becoming
uninteresting).
For this chapter, we will focus on the following question:

Which counties in the United States have the highest levels of ambient ozone
pollution?

As a side note, one of the most important questions you can answer with an exploratory
data analysis is “Do I have the right data to answer this question?” Often this question is
difficult ot answer at first, but can become more clear as we sort through and look at the
data.

5.2 Read in your data

The next task in any exploratory data analysis is to read in some data. Sometimes the
data will come in a very messy format and you’ll need to do some cleaning. Other times,
someone else will have cleaned up that data for you so you’ll be spared the pain of having
to do the cleaning.
We won’t go through the pain of cleaning up a dataset here, not because it’s not important,
but rather because there’s often not much generalizable knowledge to obtain from going
through it. Every dataset has its unique quirks and so for now it’s probably best to not
get bogged down in the details.
Here we have a relatively clean dataset from the U.S. EPA on hourly ozone measurements
in the entire U.S. for the year 2014. The data are available from the EPA’s Air Quality
System web page1 . I’ve simply downloaded the zip file from the web site, unzipped the
archive, and put the resulting file in a directory called “data”. If you want to run this code
you’ll have to use the same directory structure.
The dataset is a comma-separated value (CSV) file, where each row of the file contains
one hourly measurement of ozone at some location in the country.
NOTE: Running the code below may take a few minutes. There are 7,147,884 rows in
the CSV file. If it takes too long, you can read in a subset by specifying a value for the
n_max argument to read_csv() that is greater than 0.

1 https://aqs.epa.gov/aqsweb/airdata/download_files.html
Exploratory Data Analysis Checklist 20

> library(readr)
> ozone <- read_csv("data/hourly_44201_2014.csv",
+ col_types = "ccccinnccccccncnncccccc")

The readr package by Hadley Wickham is a nice package for reading in flat files very fast,
or at least much faster than R’s built-in functions. It makes some tradeoffs to obtain that
speed, so these functions are not always appropriate, but they serve our purposes here.
The character string provided to the col_types argument specifies the class of each
column in the dataset. Each letter represents the class of a column: “c” for character, “n”
for numeric”, and “i” for integer. No, I didn’t magically know the classes of each column—
I just looked quickly at the file to see what the column classes were. If there are too many
columns, you can not specify col_types and read_csv() will try to figure it out for you.
Just as a convenience for later, we can rewrite the names of the columns to remove any
spaces.

> names(ozone) <- make.names(names(ozone))

5.3 Check the packaging

Have you ever gotten a present before the time when you were allowed to open it? Sure,
we all have. The problem is that the present is wrapped, but you desperately want to
know what’s inside. What’s a person to do in those circumstances? Well, you can shake
the box a bit, maybe knock it with your knuckle to see if it makes a hollow sound, or even
weigh it to see how heavy it is. This is how you should think about your dataset before
you start analyzing it for real.
Assuming you don’t get any warnings or errors when reading in the dataset, you should
now have an object in your workspace named ozone. It’s usually a good idea to poke at
that object a little bit before we break open the wrapping paper.
For example, you can check the number of rows and columns.

> nrow(ozone)
[1] 7147884
> ncol(ozone)
[1] 23

Remember when I said there were 7,147,884 rows in the file? How does that match up
with what we’ve read in? This dataset also has relatively few columns, so you might be
able to check the original text file to see if the number of columns printed out (23) here
matches the number of columns you see in the original file.
Exploratory Data Analysis Checklist 21

5.4 Run str()

Another thing you can do is run str() on the dataset. This is usually a safe operation in
the sense that even with a very large dataset, running str() shouldn’t take too long.

> str(ozone)
Classes 'tbl_df', 'tbl' and 'data.frame': 7147884 obs. of 23 variables:
$ State.Code : chr "01" "01" "01" "01" ...
$ County.Code : chr "003" "003" "003" "003" ...
$ Site.Num : chr "0010" "0010" "0010" "0010" ...
$ Parameter.Code : chr "44201" "44201" "44201" "44201" ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Latitude : num 30.5 30.5 30.5 30.5 30.5 ...
$ Longitude : num -87.9 -87.9 -87.9 -87.9 -87.9 ...
$ Datum : chr "NAD83" "NAD83" "NAD83" "NAD83" ...
$ Parameter.Name : chr "Ozone" "Ozone" "Ozone" "Ozone" ...
$ Date.Local : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-03-01" ...
$ Time.Local : chr "01:00" "02:00" "03:00" "04:00" ...
$ Date.GMT : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-03-01" ...
$ Time.GMT : chr "07:00" "08:00" "09:00" "10:00" ...
$ Sample.Measurement : num 0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.037 0.044 0.046 ...
$ Units.of.Measure : chr "Parts per million" "Parts per million" "Parts per million" "Parts per millio\
n" ...
$ MDL : num 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 ...
$ Uncertainty : num NA NA NA NA NA NA NA NA NA NA ...
$ Qualifier : chr "" "" "" "" ...
$ Method.Type : chr "FEM" "FEM" "FEM" "FEM" ...
$ Method.Name : chr "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - U\
LTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" ...
$ State.Name : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ County.Name : chr "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
$ Date.of.Last.Change: chr "2014-06-30" "2014-06-30" "2014-06-30" "2014-06-30" ...

The output for str() duplicates some information that we already have, like the number
of rows and columns. More importantly, you can examine the classes of each of the
columns to make sure they are correctly specified (i.e. numbers are numeric and strings
are character, etc.). Because I pre-specified all of the column classes in read_csv(), they
all should match up with what I specified.
Often, with just these simple maneuvers, you can identify potential problems with the
data before plunging in head first into a complicated data analysis.

5.5 Look at the top and the bottom of your data

I find it useful to look at the “beginning” and “end” of a dataset right after I check the
packaging. This lets me know if the data were read in properly, things are properly
Exploratory Data Analysis Checklist 22

formatted, and that everthing is there. If your data are time series data, then make sure
the dates at the beginning and end of the dataset match what you expect the beginning
and ending time period to be.
You can peek at the top and bottom of the data with the head() and tail() functions.
Here’s the top.

> head(ozone[, c(6:7, 10)])


Latitude Longitude Date.Local
1 30.498 -87.88141 2014-03-01
2 30.498 -87.88141 2014-03-01
3 30.498 -87.88141 2014-03-01
4 30.498 -87.88141 2014-03-01
5 30.498 -87.88141 2014-03-01
6 30.498 -87.88141 2014-03-01

For brevity I’ve only taken a few columns. And here’s the bottom.

> tail(ozone[, c(6:7, 10)])


Latitude Longitude Date.Local
7147879 18.17794 -65.91548 2014-09-30
7147880 18.17794 -65.91548 2014-09-30
7147881 18.17794 -65.91548 2014-09-30
7147882 18.17794 -65.91548 2014-09-30
7147883 18.17794 -65.91548 2014-09-30
7147884 18.17794 -65.91548 2014-09-30

I find tail() to be particularly useful because often there will be some problem reading
the end of a dataset and if you don’t check that you’d never know. Sometimes there’s
weird formatting at the end or some extra comment lines that someone decided to stick
at the end.
Make sure to check all the columns and verify that all of the data in each column looks
the way it’s supposed to look. This isn’t a foolproof approach, because we’re only looking
at a few rows, but it’s a decent start.

5.6 Check your “n”s

In general, counting things is usually a good way to figure out if anything is wrong or not.
In the simplest case, if you’re expecting there to be 1,000 observations and it turns out
there’s only 20, you know something must have gone wrong somewhere. But there are
other areas that you can check depending on your application. To do this properly, you
need to identify some landmarks that can be used to check against your data. For example,
if you are collecting data on people, such as in a survey or clinical trial, then you should
Exploratory Data Analysis Checklist 23

know how many people there are in your study. That’s something you should check in
your dataset, to make sure that you have data on all the people you thought you would
have data on.
In this example, we will use the fact that the dataset purportedly contains hourly data for
the entire country. These will be our two landmarks for comparison.
Here, we have hourly ozone data that comes from monitors across the country. The
monitors should be monitoring continuously during the day, so all hours should be rep-
resented. We can take a look at the Time.Local variable to see what time measurements
are recorded as being taken.
> table(ozone$Time.Local)

00:00 00:01 01:00 01:02 02:00 02:03 03:00


288698 2 290871 2 283709 2 282951
03:04 04:00 04:05 05:00 05:06 06:00 06:07
2 288963 2 302696 2 302356 2
07:00 07:08 08:00 08:09 09:00 09:10 10:00
300950 2 298566 2 297154 2 297132
10:11 11:00 11:12 12:00 12:13 13:00 13:14
2 298125 2 298297 2 299997 2
14:00 14:15 15:00 15:16 16:00 16:17 17:00
301410 2 302636 2 303387 2 303806
17:18 18:00 18:19 19:00 19:20 20:00 20:21
2 303795 2 304268 2 304268 2
21:00 21:22 22:00 22:23 23:00 23:24
303551 2 295701 2 294549 2

One thing we notice here is that while almost all measurements in the dataset are
recorded as being taken on the hour, some are taken at slightly different times. Such
a small number of readings are taken at these off times that we might not want to care.
But it does seem a bit odd, so it might be worth a quick check.
We can take a look at which observations were measured at time “00:01”.
> library(dplyr)
> filter(ozone, Time.Local == "13:14") %>%
+ select(State.Name, County.Name, Date.Local,
+ Time.Local, Sample.Measurement)
# A tibble: 2 x 5
State.Name County.Name Date.Local Time.Local
<chr> <chr> <chr> <chr>
1 New York Franklin 2014-09-30 13:14
2 New York Franklin 2014-09-30 13:14
# … with 1 more variable:
# Sample.Measurement <dbl>

We can see that it’s a monitor in Franklin County, New York and that the measurements
were taken on September 30, 2014. What if we just pulled all of the measurements taken
at this monitor on this date?
Exploratory Data Analysis Checklist 24

> filter(ozone, State.Code == "36"


+ & County.Code == "033"
+ & Date.Local == "2014-09-30") %>%
+ select(Date.Local, Time.Local,
+ Sample.Measurement) %>%
+ as.data.frame
Date.Local Time.Local Sample.Measurement
1 2014-09-30 00:01 0.011
2 2014-09-30 01:02 0.012
3 2014-09-30 02:03 0.012
4 2014-09-30 03:04 0.011
5 2014-09-30 04:05 0.011
6 2014-09-30 05:06 0.011
7 2014-09-30 06:07 0.010
8 2014-09-30 07:08 0.010
9 2014-09-30 08:09 0.010
10 2014-09-30 09:10 0.010
11 2014-09-30 10:11 0.010
12 2014-09-30 11:12 0.012
13 2014-09-30 12:13 0.011
14 2014-09-30 13:14 0.013
15 2014-09-30 14:15 0.016
16 2014-09-30 15:16 0.017
17 2014-09-30 16:17 0.017
18 2014-09-30 17:18 0.015
19 2014-09-30 18:19 0.017
20 2014-09-30 19:20 0.014
21 2014-09-30 20:21 0.014
22 2014-09-30 21:22 0.011
23 2014-09-30 22:23 0.010
24 2014-09-30 23:24 0.010
25 2014-09-30 00:01 0.010
26 2014-09-30 01:02 0.011
27 2014-09-30 02:03 0.011
28 2014-09-30 03:04 0.010
29 2014-09-30 04:05 0.010
30 2014-09-30 05:06 0.010
31 2014-09-30 06:07 0.009
32 2014-09-30 07:08 0.008
33 2014-09-30 08:09 0.009
34 2014-09-30 09:10 0.009
35 2014-09-30 10:11 0.009
36 2014-09-30 11:12 0.011
37 2014-09-30 12:13 0.010
38 2014-09-30 13:14 0.012
39 2014-09-30 14:15 0.015
40 2014-09-30 15:16 0.016
41 2014-09-30 16:17 0.016
42 2014-09-30 17:18 0.014
43 2014-09-30 18:19 0.016
44 2014-09-30 19:20 0.013
45 2014-09-30 20:21 0.013
46 2014-09-30 21:22 0.010
Exploratory Data Analysis Checklist 25

47 2014-09-30 22:23 0.009


48 2014-09-30 23:24 0.009

Now we can see that this monitor just records its values at odd times, rather than on the
hour. It seems, from looking at the previous output, that this is the only monitor in the
country that does this, so it’s probably not something we should worry about.
Since EPA monitors pollution across the country, there should be a good representation
of states. Perhaps we should see exactly how many states are represented in this dataset.

> select(ozone, State.Name) %>% unique %>% nrow


[1] 52

So it seems the representation is a bit too good—there are 52 states in the dataset, but
only 50 states in the U.S.!
We can take a look at the unique elements of the State.Name variable to see what’s going
on.

> unique(ozone$State.Name)
[1] "Alabama" "Alaska"
[3] "Arizona" "Arkansas"
[5] "California" "Colorado"
[7] "Connecticut" "Delaware"
[9] "District Of Columbia" "Florida"
[11] "Georgia" "Hawaii"
[13] "Idaho" "Illinois"
[15] "Indiana" "Iowa"
[17] "Kansas" "Kentucky"
[19] "Louisiana" "Maine"
[21] "Maryland" "Massachusetts"
[23] "Michigan" "Minnesota"
[25] "Mississippi" "Missouri"
[27] "Montana" "Nebraska"
[29] "Nevada" "New Hampshire"
[31] "New Jersey" "New Mexico"
[33] "New York" "North Carolina"
[35] "North Dakota" "Ohio"
[37] "Oklahoma" "Oregon"
[39] "Pennsylvania" "Rhode Island"
[41] "South Carolina" "South Dakota"
[43] "Tennessee" "Texas"
[45] "Utah" "Vermont"
[47] "Virginia" "Washington"
[49] "West Virginia" "Wisconsin"
[51] "Wyoming" "Puerto Rico"

Now we can see that Washington, D.C. (District of Columbia) and Puerto Rico are the
“extra” states included in the dataset. Since they are clearly part of the U.S. (but not official
states of the union) that all seems okay.
Exploratory Data Analysis Checklist 26

This last bit of analysis made use of something we will discuss in the next section: external
data. We knew that there are only 50 states in the U.S., so seeing 52 state names was an
immediate trigger that something might be off. In this case, all was well, but validating
your data with an external data source can be very useful.

5.7 Validate with at least one external data source

Making sure your data matches something outside of the dataset is very important. It
allows you to ensure that the measurements are roughly in line with what they should
be and it serves as a check on what other things might be wrong in your dataset. External
validation can often be as simple as checking your data against a single number, as we
will do here.
In the U.S. we have national ambient air quality standards, and for ozone, the current
standard2 set in 2008 is that the “annual fourth-highest daily maximum 8-hr concentra-
tion, averaged over 3 years” should not exceed 0.075 parts per million (ppm). The exact
details of how to calculate this are not important for this analysis, but roughly speaking,
the 8-hour average concentration should not be too much higher than 0.075 ppm (it can
be higher because of the way the standard is worded).
Let’s take a look at the hourly measurements of ozone.
> summary(ozone$Sample.Measurement)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02000 0.03200 0.03123 0.04200 0.34900

From the summary we can see that the maximum hourly concentration is quite high
(0.349 ppm) but that in general, the bulk of the distribution is far below 0.075.
We can get a bit more detail on the distribution by looking at deciles of the data.
> quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))
0% 10% 20% 30% 40% 50% 60% 70%
0.000 0.010 0.018 0.023 0.028 0.032 0.036 0.040
80% 90% 100%
0.044 0.051 0.349

Knowing that the national standard for ozone is something like 0.075, we can see from
the data that

• The data are at least of the right order of magnitude (i.e. the units are correct)
• The range of the distribution is roughly what we’d expect, given the regulation
around ambient pollution levels
• Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given
the wording of the standard and the averaging involved.
2 http://www.epa.gov/ttn/naaqs/standards/ozone/s_o3_history.html
Exploratory Data Analysis Checklist 27

5.8 Try the easy solution first

Recall that our original question was

Which counties in the United States have the highest levels of ambient ozone
pollution?

What’s the simplest answer we could provide to this question? For the moment, don’t
worry about whether the answer is correct, but the point is how could you provide prima
facie evidence for your hypothesis or question. You may refute that evidence later with
deeper analysis, but this is the first pass.
Because we want to know which counties have the highest levels, it seems we need a list
of counties that are ordered from highest to lowest with respect to their levels of ozone.
What do we mean by “levels of ozone”? For now, let’s just blindly take the average across
the entire year for each county and then rank counties according to this metric.
To identify each county we will use a combination of the State.Name and the County.Name
variables.

> ranking <- group_by(ozone, State.Name, County.Name) %>%


+ summarize(ozone = mean(Sample.Measurement)) %>%
+ as.data.frame %>%
+ arrange(desc(ozone))

Now we can look at the top 10 counties in this ranking.

> head(ranking, 10)


State.Name County.Name ozone
1 California Mariposa 0.04992485
2 California Nevada 0.04866836
3 Wyoming Albany 0.04834274
4 Arizona Yavapai 0.04746346
5 Arizona Gila 0.04722276
6 California Inyo 0.04659648
7 Utah San Juan 0.04654895
8 Arizona Coconino 0.04605669
9 California El Dorado 0.04595514
10 Nevada White Pine 0.04465562

It seems interesting that all of these counties are in the western U.S., with 4 of them in
California alone.
For comparison we can look at the 10 lowest counties too.
Exploratory Data Analysis Checklist 28

> tail(ranking, 10)


State.Name County.Name ozone
781 Alaska Matanuska Susitna 0.020911008
782 Washington Whatcom 0.020114267
783 Hawaii Honolulu 0.019813165
784 Tennessee Knox 0.018579452
785 California Merced 0.017200647
786 Alaska Fairbanks North Star 0.014993138
787 Oklahoma Caddo 0.014677374
788 Puerto Rico Juncos 0.013738328
789 Puerto Rico Bayamon 0.010693529
790 Puerto Rico Catano 0.004685369

Let’s take a look at one of the higest level counties, Mariposa County, California. First
let’s see how many observations there are for this county in the dataset.

> filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>% nrow


[1] 9328

Always be checking. Does that number of observations sound right? Well, there’s 24
hours in a day and 365 days per, which gives us 8760, which is close to that number
of observations. Sometimes the counties use alternate methods of measurement during
the year so there may be “extra” measurements.
We can take a look at how ozone varies through the year in this county by looking at
monthly averages. First we’ll need to convert the date variable into a Date class.

> ozone <- mutate(ozone, Date.Local = as.Date(Date.Local))

Then we will split the data by month to look at the average hourly levels.

> filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>%


+ mutate(month = factor(months(Date.Local), levels = month.name)) %>%
+ group_by(month) %>%
+ summarize(ozone = mean(Sample.Measurement))
# A tibble: 10 x 2
month ozone
* <fct> <dbl>
1 January 0.0408
2 February 0.0388
3 March 0.0455
4 April 0.0498
5 May 0.0505
6 June 0.0564
7 July 0.0522
8 August 0.0554
9 September 0.0512
10 October 0.0469
Exploratory Data Analysis Checklist 29

A few things stand out here. First, ozone appears to be higher in the summer months
and lower in the winter months. Second, there are two months missing (November and
December) from the data. It’s not immediately clear why that is, but it’s probably worth
investigating a bit later on.
Now let’s take a look at one of the lowest level counties, Caddo County, Oklahoma.

> filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>% nrow


[1] 5666

Here we see that there are perhaps fewer observations than we would expect for a
monitor that was measuring 24 hours a day all year. We can check the data to see if
anything funny is going on.

> filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>%


+ mutate(month = factor(months(Date.Local), levels = month.name)) %>%
+ group_by(month) %>%
+ summarize(ozone = mean(Sample.Measurement))
# A tibble: 9 x 2
month ozone
* <fct> <dbl>
1 January 0.0187
2 February 0.00206
3 March 0.002
4 April 0.0232
5 May 0.0242
6 June 0.0202
7 July 0.0191
8 August 0.0209
9 September 0.002

Here we can see that the levels of ozone are much lower in this county and that also three
months are missing (October, November, and December). Given the seasonal nature of
ozone, it’s possible that the levels of ozone are so low in those months that it’s not even
worth measuring. In fact some of the monthly averages are below the typical method
detection limit of the measurement technology, meaning that those values are highly
uncertain and likely not distinguishable from zero.

5.9 Challenge your solution

The easy solution is nice because it is, well, easy, but you should never allow those results
to hold the day. You should always be thinking of ways to challenge the results, especially
if those results comport with your prior expectation.
Now, the easy answer seemed to work okay in that it gave us a listing of counties that had
the highest average levels of ozone for 2014. However, the analysis raised some issues.
Exploratory Data Analysis Checklist 30

For example, some counties do not have measurements every month. Is this a problem?
Would it affect our ranking of counties if we had those measurements?
Also, how stable are the rankings from year to year? We only have one year’s worth of
data for the moment, but we could perhaps get a sense of the stability of the rankings by
shuffling the data around a bit to see if anything changes. We can imagine that from
year to year, the ozone data are somewhat different randomly, but generally follow
similar patterns across the country. So the shuffling process could approximate the data
changing from one year to the next. It’s not an ideal solution, but it could give us a sense
of how stable the rankings are.
First we set our random number generator and resample the indices of the rows of
the data frame with replacement. The statistical jargon for this approach is a bootstrap
sample. We use the resampled indices to create a new dataset, ozone2, that shares many
of the same qualities as the original but is randomly perturbed.

> set.seed(10234)
> N <- nrow(ozone)
> idx <- sample(N, N, replace = TRUE)
> ozone2 <- ozone[idx, ]

Now we can reconstruct our rankings of the counties based on this resampled data.

> ranking2 <- group_by(ozone2, State.Name, County.Name) %>%


+ summarize(ozone = mean(Sample.Measurement)) %>%
+ as.data.frame %>%
+ arrange(desc(ozone))

We can then compare the top 10 counties from our original ranking and the top 10
counties from our ranking based on the resampled data.

> cbind(head(ranking, 10),


+ head(ranking2, 10))
State.Name County.Name ozone State.Name
1 California Mariposa 0.04992485 California
2 California Nevada 0.04866836 California
3 Wyoming Albany 0.04834274 Wyoming
4 Arizona Yavapai 0.04746346 Arizona
5 Arizona Gila 0.04722276 Arizona
6 California Inyo 0.04659648 Utah
7 Utah San Juan 0.04654895 California
8 Arizona Coconino 0.04605669 Arizona
9 California El Dorado 0.04595514 California
10 Nevada White Pine 0.04465562 Nevada
County.Name ozone
1 Mariposa 0.04983094
2 Nevada 0.04869841
3 Albany 0.04830520
Exploratory Data Analysis Checklist 31

4 Yavapai 0.04748795
5 Gila 0.04728284
6 San Juan 0.04665711
7 Inyo 0.04652602
8 Coconino 0.04616988
9 El Dorado 0.04611164
10 White Pine 0.04466106

We can see that the rankings based on the resampled data (columns 4–6 on the right) are
very close to the original, with the first 7 being identical. Numbers 8 and 9 get flipped in
the resampled rankings but that’s about it. This might suggest that the original rankings
are somewhat stable.
We can also look at the bottom of the list to see if there were any major changes.

> cbind(tail(ranking, 10),


+ tail(ranking2, 10))
State.Name County.Name ozone
781 Alaska Matanuska Susitna 0.020911008
782 Washington Whatcom 0.020114267
783 Hawaii Honolulu 0.019813165
784 Tennessee Knox 0.018579452
785 California Merced 0.017200647
786 Alaska Fairbanks North Star 0.014993138
787 Oklahoma Caddo 0.014677374
788 Puerto Rico Juncos 0.013738328
789 Puerto Rico Bayamon 0.010693529
790 Puerto Rico Catano 0.004685369
State.Name County.Name ozone
781 Alaska Matanuska Susitna 0.020806642
782 Washington Whatcom 0.020043750
783 Hawaii Honolulu 0.019821603
784 Tennessee Knox 0.018814913
785 California Merced 0.016917933
786 Alaska Fairbanks North Star 0.014933125
787 Oklahoma Caddo 0.014662867
788 Puerto Rico Juncos 0.013858010
789 Puerto Rico Bayamon 0.010578880
790 Puerto Rico Catano 0.004775807

Here we can see that the bottom 7 counties are identical in both rankings, but after that
things shuffle a bit. We’re less concerned with the counties at the bottom of the list, but
this suggests there is also reasonable stability.

5.10 Follow up questions

In this chapter I’ve presented some simple steps to take when starting off on an ex-
ploratory analysis. The example analysis conducted in this chapter was far from perfect,
Exploratory Data Analysis Checklist 32

but it got us thinking about the data and the question of interest. It also gave us a number
of things to follow up on in case we continue to be interested in this question.
At this point it’s useful to consider a few followup questions.

1. Do you have the right data? Sometimes at the conclusion of an exploratory data
analysis, the conclusion is that the dataset is not really appropriate for this question.
In this case, the dataset seemed perfectly fine for answering the question of which
counties had the highest levels of ozone.
2. Do you need other data? One sub-question we tried to address was whether the
county rankings were stable across years. We addressed this by resampling the data
once to see if the rankings changed, but the better way to do this would be to simply
get the data for previous years and re-do the rankings.
3. Do you have the right question? In this case, it’s not clear that the question we tried
to answer has immediate relevance, and the data didn’t really indicate anything to
increase the question’s relevance. For example, it might have been more interesting
to assess which counties were in violation of the national ambient air quality
standard, because determining this could have regulatory implications. However,
this is a much more complicated calculation to do, requiring data from at least 3
previous years.

The goal of exploratory data analysis is to get you thinking about your data and reasoning
about your question. At this point, we can refine our question or collect new data, all in
an iterative process to get at the truth.
6. Principles of Analytic Graphics
Watch a video of this chapter1 .
The material for this chapter is inspired by Edward Tufte’s wonderful book Beautiful
Evidence, which I strongly encourage you to buy if you are able. He discusses how to make
informative and useful data graphics and lays out six principles that are important to
achieving that goal. Some of these principles are perhaps more relevant to making “final”
graphics as opposed to more “exploratory” graphics, but I believe they are all important
principles to keep in mind.

6.1 Show comparisons

Showing comparisons is really the basis of all good scientific investigation. Evidence
for a hypothesis is always relative to another competing hypothesis. When you say
“the evidence favors hypothesis A”, what you mean to say is that “the evidence favors
hypothesis A versus hypothesis B”. A good scientist is always asking “Compared to
What?” when confronted with a scientific claim or statement. Data graphics should
generally follow this same principle. You should always be comparing at least two things.
For example, take a look at the plot below. This plot shows the change in symptom-free
days in a group of children enrolled in a clinical trial2 testing whether an air cleaner
installed in a child’s home improves their asthma-related symptoms. This study was
conducted at the Johns Hopkins University School of Medicine and was conducted in
homes where a smoker was living for at least 4 days a week. Each child was assessed
at baseline and then 6-months later at a second visit. The aim was to improve a child’s
symptom-free days over the 6-month period. In this case, a higher number is better,
indicating that they had more symptom-free days.
1 https://youtu.be/6lOvA_y7p7w
2 http://www.ncbi.nlm.nih.gov/pubmed/21810636
Principles of Analytic Graphics 34

Change in symptom-free days with air cleaner

There were 47 children who received the air cleaner, and you can see from the boxplot
that on average the number of symptom-free days increased by about 1 day (the solid
line in the middle of the box is the median of the data).
But the question of “compared to what?” is not answered in this plot. In particular, we
don’t know from the plot what would have happened if the children had not received the
air cleaner. But of course, we do have that data and we can show both the group that
received the air cleaner and the control group that did not.
Principles of Analytic Graphics 35

Change in symptom-free days by treatment group

Here we can see that on average, the control group children changed very little in terms of
their symptom free days. Therefore, compared to children who did not receive an air cleaner,
children receiving an air cleaner experienced improved asthma morbidity.

6.2 Show causality, mechanism, explanation,


systematic structure

If possible, it’s always useful to show your causal framework for thinking about a
question. Generally, it’s difficult to prove that one thing causes another thing even with
the most carefully collected data. But it’s still often useful for your data graphics to
indicate what you are thinking about in terms of cause. Such a display may suggest
hypotheses or refute them, but most importantly, they will raise new questions that can
be followed up with new data or analyses.
Principles of Analytic Graphics 36

In the plot below, which is reproduced from the previous section, I show the change in
symptom-free days for a group of children who received an air cleaner and a group of
children who received no intervention.

Change in symptom-free days by treatment group

From the plot, it seems clear that on average, the group that received an air cleaner
experienced improved asthma morbidity (more symptom-free days, a good thing).
An interesting question might be “Why do the children with the air cleaner improve?”
This may not be the most important question—you might just care that the air cleaners
help things—but answering the question of “why?” might lead to improvements or new
developments.
The hypothesis behind air cleaners improving asthma morbidity in children is that the
air cleaners remove airborne particles from the air. Given that the homes in this study
all had smokers living in them, it is likely that there is a high level of particles in the air,
primarily from second-hand smoke.
It’s fairly well-understood that inhaling fine particles can exacerbate asthma symptoms,
Principles of Analytic Graphics 37

so it stands to reason that reducing the presence in the air should improve asthma
symptoms. Therefore, we’d expect that the group receiving the air cleaners should on
average see a decrease in airborne particles. In this case we are tracking fine particulate
matter, also called PM2.5 which stands for particulate matter less than or equal to 2.5
microns in aerodynamic diameter.
In the plot below, you can see both the change in symptom-free days for both groups
(left) and the change in PM2.5 in both groups (right).

Change in symptom-free days and change in PM2.5 levels in-home

Now we can see from the right-hand plot that on average in the control group, the level of
PM2.5 actually increased a little bit while in the air cleaner group the levels decreased on
average. This pattern shown in the plot above is consistent with the idea that air cleaners
improve health by reducing airborne particles. However, it is not conclusive proof of this
idea because there may be other unmeasured confounding factors that can lower levels
of PM2.5 and improve symptom-free days.

6.3 Show multivariate data

The real world is multivariate. For anything that you might study, there are usually
many attributes that you can measure. The point is that data graphics should attempt
to show this information as much as possible, rather than reduce things down to one or
two features that we can plot on a page. There are a variety of ways that you can show
multivariate data, and you don’t need to wear 3-D glasses to do it.
Principles of Analytic Graphics 38

Here is just a quick example. Below is data on daily airborne particulate matter (“PM10”)
in New York City and mortality from 1987 to 2000. Each point on the plot represents
the average PM10 level for that day (measured in micrograms per cubic meter) and
the number of deaths on that day. The PM10 data come from the U.S. Environmental
Protection Agency and the mortality data come from the U.S. National Center for Health
Statistics.

PM10 and mortality in New York City

This is a bivariate plot showing two variables in this dataset. From the plot it seems that
there is a slight negative relationship between the two variables. That is, higher daily
average levels of PM10 appear to be associated with lower levels of mortality (fewer
deaths per day).
However, there are other factors that are associated with both mortality and PM10 levels.
One example is the season. It’s well known that mortality tends to be higher in the winter
than in the summer. That can be easily shown in the following plot of mortality and date.
Principles of Analytic Graphics 39

Daily mortality in New York City

Similarly, we can show that in New York City, PM10 levels tend to be high in the summer
and low in the winter. Here’s the plot for daily PM10 over the same time period. Note
that the PM10 data have been centered (the overall mean has been subtracted from them)
so that is why there are both positive and negative values.
Principles of Analytic Graphics 40

Daily PM10 in New York City

From the two plots we can see that PM10 and mortality have opposite seasonality with
mortality being high in the winter and PM10 being high in the summer. What happens
if we plot the relationship between mortality and PM10 by season? That plot is below.

PM10 and mortality in New York City by season

Interestingly, before, when we plotted PM10 and mortality by itself, the relationship
appeared to be slightly negative. However, in each of the plots above, the relationship is
Principles of Analytic Graphics 41

slightly positive. This set of plots illustrates the effect of confounding by season, because
season is related to both PM10 levels and to mortality counts, but in different ways for
each one.
This example illustrates just one of many reasons why it can be useful to plot multivariate
data and to show as many features as intelligently possible. In some cases, you may
uncover unexpected relationships depending on how they are plotted or visualized.

6.4 Integrate evidence

Just because you may be making data graphics, doesn’t mean you have to rely solely
on circles and lines to make your point. You can also include printed numbers, words,
images, and diagrams to tell your story. In other words, data graphics should make use
of many modes of data presentation simultaneously, not just the ones that are familiar
to you or that the software can handle. One should never let the tools available drive the
analysis; one should integrate as much evidence as possible on to a graphic as possible.

6.5 Describe and document the evidence

Data graphics should be appropriately documented with labels, scales, and sources. A
general rule for me is that a data graphic should tell a complete story all by itself. You
should not have to refer to extra text or descriptions when interpreting a plot, if possible.
Ideally, a plot would have all of the necessary descriptions attached to it. You might
think that this level of documentation should be reserved for “final” plots as opposed to
exploratory ones, but it’s good to get in the habit of documenting your evidence sooner
rather than later.
Imagine if you were writing a paper or a report, and a data graphic was presented to
make the primary point. Imagine the person you hand the paper/report to has very little
time and will only focus on the graphic. Is there enough information on that graphic for
the person to get the story? While it is certainly possible to be too detailed, I tend to err
on the side of more information rather than less.
In the simple example below, I plot the same data twice (this is the PM10 data from the
previous section of this chapter).
Principles of Analytic Graphics 42

Labelling and annotation of data graphics

The plot on the left is a default plot generated by the plot function in R. The plot on
the right uses the same plot function but adds annotations like a title, y-axis label, x-axis
label. Key information included is where the data were collected (New York), the units of
measurement, the time scale of measurements (daily), and the source of the data (EPA).

6.6 Content, Content, Content

Analytical presentations ultimately stand or fall depending on the quality, relevance,


and integrity of their content. This includes the question being asked and the evidence
presented in favor of certain hypotheses. No amount of visualization magic or bells and
whistles can make poor data, or more importantly, a poorly formed question, shine with
clarity. Starting with a good question, developing a sound approach, and only presenting
information that is necessary for answering that question, is essential to every data
graphic.

6.7 References

This chapter is inspired by the work of Edward Tufte. I encourage you to take a look at
his books, in particular the following book:

Edward Tufte (2006). Beautiful Evidence, Graphics Press LLC. www.edwardtufte.com3


3 http://www.edwardtufte.com
7. Exploratory Graphs
Watch a video of this chapter: Part 11 Part 22
There are many reasons to use graphics or plots in exploratory data analysis. If you just
have a few data points, you might just print them out on the screen or on a sheet of paper
and scan them over quickly before doing any real analysis (technique I commonly use
for small datasets or subsets). If you have a dataset with more than just a few data points,
then you’ll typically need some assistance to visualize the data.
Visualizing the data via graphics can be important at the beginning stages of data analysis
to understand basic properties of the data, to find simple patterns in data, and to suggest
possible modeling strategies. In later stages of an analysis, graphics can be used to “debug”
an analysis, if an unexpected (but not necessarily wrong) result occurs, or ultimately, to
communicate your findings to others.

7.1 Characteristics of exploratory graphs

For the purposes of this chapter (and the rest of this book), we will make a distinction
between exploratory graphs and final graphs. This distinction is not a very formal one,
but it serves to highlight the fact that graphs are used for many different purposes.
Exploratory graphs are usually made very quickly and a lot of them are made in the
process of checking out the data.
The goal of making exploratory graphs is usually developing a personal understanding
of the data and to prioritize tasks for follow up. Details like axis orientation or legends,
while present, are generally cleaned up and prettified if the graph is going to be used
for communication later. Often color and plot symbol size are used to convey various
dimensions of information.

7.2 Air Pollution in the United States

For this chapter, we will use a simple case study to demonstrate the kinds of simple
graphs that can be useful in exploratory analyses. The data we will be using come from
the U.S. Environmental Protection Agency (EPA), which is the U.S. government agency
1 https://youtu.be/ma6-0PSNLHo
2 https://youtu.be/UyopqXQ8TTM
Random documents with unrelated
content Scribd suggests to you:
And these statements would appear to be in accord with the
figures I have given above.
The statistics of your Right Hon’ble Ruler, which you receive with
thunders of applause, are not worth the paper on which they are
written.
Again I ask your verdict—guilty or not guilty?
Now for Crime. The statistics in this case are less defensible than
in the previous case, because they involve a dishonourable
suppression of facts.
The statistics brought forward to show that a diminution of crime
has been the result of Free Trade, are as follows:
Convictions in 1859 13,470
” 1881 11,353
———
Apparent decrease of crime 2,117

Now this apparent decrease is wholly due to the “Criminal Justice


Act” of 1855, which enables Magistrates to pass short sentences;
and these, coming under the head of “Summary Convictions,” do not
appear under the head of “Convictions,” where they would have
appeared but for the “Act” of 1855.
If we take the total cases, including summary convictions, the
figures stand as follows:—
Convictions in 1859 246,227
” 1881 542,319
———–
Increase in crime 296,092

In other words, instead of your Right Hon’ble Ruler’s decrease of


2,000 convictions, we have actually an increase of nearly 300,000. Is
it possible to conceive a more glaring case of what Mr. Gladstone
himself terms “the simple but effectual plan of pure falsification?”
Now for Intemperance. The number of persons fined for
drunkenness in England:
In the year 1860 88,410
In ” 1881 174,481

or roughly speaking, the convictions for drunkenness have doubled


in twenty-one years.
Truly, my Friend, you cannot congratulate Free Trade on the
decrease of pauperism, crime, and intemperance it has produced.

F O OT N OT E S :
[45] “In fifty years, Great Britain has lifted her estimate on this
point so rapidly that she spends five times as much for a given
number of paupers? than she did fifteen years after the opening
of the century.” (‘Practical Political Economy,’ by Profr. Bonamy
Price, p. 237.)
[46] Comparative Cost of Relief to Paupers.
England £10 0
France 2 2
Belgium and Holland 1 3
(Mulhall’s Statistics, p. 346.)
[47] Expenditure in London Charities.
1859. 1881.
Orphanages £409,000 £458,000
Homes for aged 88,000 770,000
Asylums 25,000 156,000
Hospitals, &c. 301,000 596,000
——— ———–
Total 823,000 1,980,000
[48] The financial condition of many of the Trades Unions is
causing serious alarm. The drain has been so heavy on them, that
their capital is greatly reduced, and unless some change takes
place, they will become bankrupt. The increase of pauperism will
then be enormous.
[49] Fortnightly Review, January, 1871.
[50] The Mail, December 19th, 1883.
CHAPTER XIV.
JUGERNATH AFLOAT.

I see, my Friend, that you are bringing out your trump card.
“Behold!” you argue “the unfortunate condition to which America has
been reduced by her protectionist policy; she has scarcely a ship
afloat, whilst Free Trade England is carrying the commerce of the
world.”
First, I would ask, are you quite sure that all this is caused by Free
Trade?
Don’t you think that it is just within the bounds of possibility that
our shrewd American cousins may possibly find a quicker and more
remunerative investment for their capital, in encouraging their
home-productive industries, and in employing their home-labour
productively, than in a keen competition with the English for a
barren trade that is not worth having?
Are you ignorant of the fact that the shipping trade has been a
losing concern for some considerable period?
Are you unaware of the fact that wheat has been frequently
carried as ballast, and has paid no freight; that other articles have
been carried at almost nominal rates?
In the Civil and Military Gazette of 7th December, 1883, under the
Telegraphic Summary, I read—
“It is predicted that, unless freight rates to India speedily improve, a
considerable number of steamers now engaged in the trade will be laid up.”

I also read in the Madras Mail, January 9th, 1884, that an organ of
the shipping interests in London has drawn up the probable “results
of the gross working of thirteen steamers of a well-known Steam
Navigation Company, the result of which is a total loss of £34,000 in
one year’s trading.”
Are the Americans to be pitied, because they have no share in this
losing concern?
If protectionism has kept them out of it, you can scarcely blame it.
But even without such keen competition, the Americans are
justified, by the writings of your sacred shastras, as may be seen by
the following quotation:
“The capital, therefore, employed in the Home trade of any country will
generally give encouragement and support to a greater quantity of productive
labour in that country, and increase the value of its annual produce, more than an
equal capital employed in the Foreign trade of consumption; and the capital
employed in this latter trade has, in both these respects, a still greater advantage
over an equal capital engaged in the Carrying trade.”[51]

So you see that the authority of your own sacred writings is


favourable to the policy of our American cousins in this respect.

F O OT N OT E :
[51] ‘Wealth of Nations,’ by Adam Smith, Bk. II. Chap. V.
CHAPTER XV.
ADVERSE PROSPERITY.

I have a few words to say about high wages and prosperity,


before I quit the subject.
Although the rise of wages is, in fact, to some extent, the work of
protection, I am not proud of it; for trades unionism is protection of
an extreme character, generally narrow in its aims, not sufficiently
far-seeing, and consequently sometimes mischievous in its results.
The raising of wages within reasonable bounds is desirable; but, in
a Free Trade country, it is apt to be attended with serious
consequences in raising the cost of the manufactured article, when
competing against the manufacture of foreign countries, where
wages are lower and hours of work longer.
It is said by Free Trade advocates, that although the cost of
provisions has not sensibly increased, yet wages are 50 per cent.
higher, and hours of labour 20 per cent. less, than they were forty
years ago.
From the political economist’s point of view, this appears to be a
decrease of national wealth. Mill says:—
“Saving enriches, and spending impoverishes, the community along with the
individual. Society at large is richer by what it expends in maintaining and aiding
productive labour, but poorer by what it expends in its enjoyments.”[52]

Now if a stalwart race could have existed, and have done 20 per
cent. more work on the lower rate of wages,—although, doubtless,
some improvement in the condition of workmen was desirable,—50
per cent. appears to be a large margin, when we consider that the
price of provisions is said to be unaltered. The British workman is
proverbially extravagant and improvident. High wages encourage
extravagance, whilst surplus cash furnishes the means, and short
hours the leisure, for gratifying a taste for drink.
Setting aside for the moment the serious evils of intemperance,
we have practically, with high wages, the causes that lead to the
impoverishment of a community.
A glance at the statistics of Mr. Giffen seems to indicate this, for
whilst the consumption per head of those commodities which are
termed necessaries of life, have only increased 33 to 40 per cent.
respectively, the consumption of those which may be considered
luxuries—namely, tea and sugar—have increased 232 and 260 per
cent. respectively.
Again, statistics show that, whilst the other classes of the
community have increased in number by 335 per cent. of late years,
the working classes have only increased by 6½ per cent. In other
words, the unproductive classes have increased largely, but, whilst
there is only 6½ per cent. numerical increase in the productive
classes, their labour has decreased by 20 per cent. from shorter
hours of labour.
The drones in the hive have increased very largely, and the
workers have not done so, but have developed an alarming taste for
honey.
The question of waste of wealth would be comparatively of minor
importance were it not seriously complicated by the existence of
Free Trade; but we have now to confront the fact, that, in the
present day, we have to pay 50 per cent. more money for 20 per
cent. less labour than we did forty years ago; whilst Free Trade
brings into the market the products of the keen competition of a
thrifty and parsimonious class of workmen who accept lower wages
and work longer hours. The result must be a gradual extinction of
our industries:
Cotton and woollen industries are struggling hard for existence.
[53]
Silk manufacture is dying out.
Iron industries in a bad way.
Gloomy predictions are made respecting the shipping trade.
Agriculture is rapidly becoming extinguished.
English pluck, capital, and credit are struggling manfully against
disaster, but the struggle cannot last much longer; capital is
sustained by credit; and credit is receiving heavy and repeated blows
from unremunerative industries. Meanwhile, high wages and
extravagant habits are not the best training for the millions that will
be thrown out of employment when the crash comes.
Your prophet, Adam Smith, though an advocate for the repeal of
the Corn Laws, foresaw and forewarned you of these consequences,
as follows:—
“If the free importation of Foreign manufactures were permitted, several of the
Home manufactures would probably suffer, and some of them perhaps go to ruin
altogether.”[54]

Verily, my Friend, you are like a shipowner who congratulates


himself that his sailors were never so well off before—never went
aloft less—never kept fewer watches—never remained so much in
their warm beds: meanwhile the devoted ship is drifting slowly, but
surely, on to the rocks.[55]

F O OT N OT E S :
[52] ‘Political Economy,’ by J. S. Mill, Bk. I. Chap. V.
[53] Mr. S. Smith, M.P., who is connected with cotton industry,
has recently stated that “with all the toil and anxiety of those who
had conducted it, the cotton industry of Lancashire, which gave
maintenance to two or three millions of people, had not earned
so much as 5 per cent. during the past ten years. The employers
had a most anxious life; and many, after struggling for years, had
become bankrupt, and some had died of a broken heart;” and he
added that he believed “most of the leading trades to be in the
same condition.”
The cheap production of Belgian fabrics is stated by the
employers to be the cause of the depression in the cotton trade.
(Times, Dec. 1883.)
[54] ‘Wealth of Nations,’ Bk. IV. Chap. II.
[55] A writer in Vanity Fair, in analyzing the Board of Trade’s
statistics for the year ended March 31st, 1883, when compared
with those for the year ended March, 1880, or the three years of
the Gladstone Ministry, says:
“We were promised cheaper Government, cheaper food, greater
prosperity. We find that so far from these promises being verified,
they have every one been falsified by the result.
“Our Imperial Government is dearer by £8,000,000; our Imperial
and Local Government, together, is dearer by £10,000,000.
“As to food, wheat has become dearer 1s. 3d. per quarter; beef,
by from 3d. to 5d. per stone; Mutton, by 1s. 3d.; money is dearer
than 1¾ per cent.
“As to prosperity, our staple pig iron is cheaper by 22s. 2d. per
ton. We have 398,397 acres fewer under cultivation for corn,
grain and other crops; 50,077 fewer horses; 129,119 fewer cattle;
4,789,738 fewer sheep in the country. We have, in spite of the
Land Act and the allegation of increased prosperity, 18,828 more
paupers in Ireland on a decreasing population. We find that
115,092 more emigrants have left the country in a year, because
they cannot get a living in it. We lose annually 349 more vessels
and 1,534 more lives at sea. The only element of consolation that
these figures” (Board of Trade Returns) “have to show is, that we
have 778,389 more pigs and 4,627 more policemen in the
country. In fact, we are more lacking in every thing we want;
more abounding in every thing we don’t want.
“The price of everything we have to sell has gone down; the price
of everything we have to buy has gone up; and what has gone up
most is the price of Government.
“Dearer Government, dearer bread, dearer beef, dearer mutton,
dearer money; cheaper pig iron; less corn, potatoes, turnips,
grass, and hops, fewer horses, fewer cattle, fewer sheep; more
paupers, more emigrants, more losses of life and property at sea,
more pigs, more policemen.
“These are the benefits that three years of liberal rule have
conferred upon us!!!”
CHAPTER XVI.
SACRED RIGHTS OF PROPERTY.

I have already stated that Mill, when he allows that which Herbert
Spencer terms “political bias,”—and Luigi Cossa terms his “narrow
philosophic utilitarianism,” to warp his better judgment,—is guilty of
absurdities and inconsistencies that would disgrace a schoolboy. This
is notably apparent when he attempts to draw a fundamental
distinction between land and any other property, as regards its
“sacred rights.”
Mr. Mill greatly admired the prosperity of the peasant proprietors
in France and Belgium, unfortunately forgetting that a system, suited
to the sober thrifty peasantry of the Continent, might possibly not be
equally suitable to the improvident lower classes of Ireland and
England,[56] neglectful also of the sensible view taken by M. De
Lavergne that “cultivation spontaneously finds out the organization
that suits it best.”[57] He wished therefore to establish an Utopia of
peasant proprietors in England and Ireland as a panacea for the evils
which Free Trade in the first place, and mischievous legislation in the
second place, had brought upon agriculture. Without presuming to
offer an opinion on the debated subjects of “Grande” and “Petite
Culture,” or peasant and landlord proprietorship, I may say that
cultivation appears to have found out spontaneously the organization
best suited to it, and that, in England and Ireland, landlordism
seems best suited to the improvident character of the lower classes,
in providing capital to help the tenants over bad times, and enabling
improvements to be made in prosperous times.
Be this as it may, peasant proprietorship has proved to be a failure
in Ireland, and is rapidly becoming extinct.[58] Writers on the subject
state that, under that system, labour was so ill-directed, that it
required six men to provide food for ten; and consolidation of
holdings is recommended. Mr. Mill, however, thought otherwise, and
biased by this political conviction, he has propounded the following
extraordinary arguments to prove that the sacred rights of property
are not applicable in the case of landed property[59]:—
(1) “No man made the land.”
(2) It is the original inheritance of the whole species.[60]
(3) Its appropriation is wholly a question of general expediency.
(4) When private property in land is not expedient, it is unjust.
(5) It is no hardship to any one to be excluded from what others
have produced.
(6) But it is a hardship to be born into the world and to find all
nature’s gifts previously engrossed.
(7) Whoever owns land, keeps others out of the enjoyment of it.
Now let us apply Mr. Mill’s arguments to any other kind of
property.
Suppose I say to you:—“My friend! you have two coats; hand one
of them over to me! Sacred rights of property don’t apply to it; you
did not make it; and Mill says—‘it is no hardship to be excluded from
what others have produced;’ but it is some hardship to be born into
the world, and to find all nature’s gifts engrossed. Your argument
that you paid for it in hard cash is worthless. No man made silver
and gold, ‘it is the original inheritance of the whole species, the
receiver is as bad as the thief, and you have connived in the robbery
of those metals from the earth, leaving posterity yet unborn to be
under the hardship of finding all nature’s gifts engrossed.’
“The manufacture of your coat is based on robbery and injustice,
and you have connived at it; the iron and coal used in its production
were made by no man, they are the common inheritance of the
species, those who have obtained them have robbed posterity. You
have bribed them to do so by silver and gold, also robbed from
posterity.
“The very wool of which your coat is formed was made by no
man, it was robbed from a defenceless sheep. Your argument that
the sheep was the property of the shearer is useless. No man made
the sheep, it is the common inheritance of all, &c. Your argument
that his owner reared the sheep, is equally worthless. Monster! if
you find a child, have you a right to rob him and make a slave of
him? such an argument would justify slavery[61] or worse.
“When private property is not expedient it is unjust, and from my
ground of view, it is not expedient that this private property should
be yours; public only differs from private expediency in degree. ‘He
who owns property keeps others out of the enjoyment of it,’ the
sacred rights of property don’t apply to this coat; so hand it over
without any more of your absurd arguments. Nay! if you don’t, and
as I see some one is approaching who may interfere, its
appropriation is one of expediency,—individual expediency must
follow the same law as general expediency,—it is expedient that I
should draw my knife across your throat, otherwise I shall lose that
which is my inheritance in common with the rest of the species.” And
so I might argue ad infinitum.
Mr. Mill’s sophisms however are, what Cossa terms, “concessions
more apparent than real to socialism,” for further on, in his Political
Economy, he completely stultifies his argument by stating that the
principle of property gives to the landowners:—
“a right to compensation for whatever portion of their interest in the land it may
be the policy of the State to deprive them of. To that their claim is indefeasible. It
is due to landowners, and to owners of any property whatever recognised as such
by the State, that they should not be dispossessed of it without receiving its
pecuniary value.... This is due on the general principles on which property rests. If
the land was bought with the produce of the labour and abstinence of themselves
or their ancestors, compensation is due to them on that ground; even if otherwise,
it is still due on the ground of prescription.”
“Nor,” he adds, “can it ever be necessary for accomplishing an object by which
the community altogether will gain, that a particular portion of the community
should be immolated.”[62]

Unfortunately, however, his mischievous denial of the sacred rights


of property in land is eagerly read, while his subsequent qualification
of it is neglected by those who, like Mr. Bright, aim at the destruction
of a political opponent; or, like Mr. Gladstone, are bent on a
particular policy, reckless of the results in carrying it out; or, like Mr.
Parnell and his followers, whose hands itch for plunder; and it has
produced a general haziness of ideas amongst that well-meaning
class of people who are good-naturedly liberal with the property of
other people.
Yet, clothe it with what sophism you will, any attempt, whether
legalized or otherwise, to deprive the landowner of his property and
to violate his rights, is as unjustifiable as the depredations of the
burglar or the pickpocket. Nay more so; because the statesman or
political economist cannot plead poverty or want of education as his
excuse.

F O OT N OT E S :
[56] If we were to partition out England into a Mill’s Utopia of
peasant proprietors to-morrow, it would not last a week; half of
the proprietors would convert their holdings into drink, and be in
a state of intoxication until it was expended.
[57] ‘Grande and Petite Culture. Rural Economy of France.’ De
Lavergne.
[58] The yeomen and small tenant-farmers, men of little capital,
have almost disappeared, and the process of improving them off
the face of the agricultural world is still progressing to its bitter
end; homestead after homestead has been deserted, and farm
has been added to farm—a very unpleasing result of the
inexorable principle—the survival of the fittest—by means of
which even the cultivators of the soil are selected;—but a result
which, not the laws of nature, but the bungling arrangements of
human legislators, have rendered inevitable. (Bear., Fortnightly
Review, September, 1873.)
[59] ‘Mill’s Political Economy,’ Bk. II. Chap. II.
[60] The original inheritors have, through their lawfully
constituted rulers, parted with their property, having, in most
cases, received an equivalent for it in the shape, either of
eminent services rendered to the State, or else of actual
payments in hard cash; and these transactions have been
deliberately ratified and acknowledged by the laws of the country
from time immemorial. It is therefore simply childish to argue that
the land thus disposed of still belongs to the original inheritors,
after they have enjoyed for past years the proceeds for which
they have bartered the land that once belonged to them.
[61] I beg your pardon, my dear Fanatic, I see I have
unconsciously made a slight mistake. Mill says, that appropriation
is wholly a matter of general expediency, and on that ground you
may justify slavery.
[62] Mill’s Political Economy, Bk. II. Chap. II.
CHAPTER XVII.
SELECTIONS FROM JUGERNATH’S SACRED WRITINGS.

Allow me, my dear Idolator, to make a few quotations from one of


your sacred Vedas, on the subject of land.
You are fond of quoting them when it suits your purpose.
Wealth of Nations, by Adam Smith. Action of Free Trade.
(1.) Every improvement in the Free Trade has ruined agricultural
circumstances of the society tends, industry. Can it be an
either directly or indirectly, to raise improvement in the
the real rent of land, to increase the circumstances of the society.
real wealth of the landlord, his power
of purchasing the labour or the
produce of the labour of other people.
(2.) Every increase in the real wealth of the Free Trade has lowered rents. Can
society, every increase in the quantity it have wrought increase in the
of useful labour employed within it, real wealth of society?
tends indirectly to raise the real rent
of land.
(3) All those improvements in the The improvements in machinery,
productive powers of labour which science, steam, and electricity
tend directly to reduce the real price prevented the collapse of
of manufactures, tend indirectly to agriculture at first, and has
raise the real rent of land. even given a semblance of
temporary prosperity, and this
has been dishonestly claimed
by Free-traders as their work.
(4.) Whatever reduces the real price of In spite of this advantage
manufactured produce raises that of agriculture has collapsed under
rude produce of the landlord. Free Trade.
(5.) The neglect of cultivation and Your Free Trade prophets, Bright
improvement, the fall in the real price and Gladstone, are unceasing in
of any part of the rude produce of the their endeavours to destroy the
land ... tend to lower the real rent of landlord and diminish his power
land, to reduce the real wealth of the of employing productive labour.
landlord, to diminish his power of
purchasing either the labour or the
produce of the labour of other people.
(6.) The whole annual produce of the land
and labour of every country
constitutes a revenue to three
different orders of people,
—to:—
1. Those who live by rent.
2. Those who live by wages.
3. Those who live by profit.
The interest of the first of these
three great orders is strictly and
inseparably connected with the
general interests of the society.
Whatever either promotes or Free trade obstructs the interests
obstructs the one, promotes or of the first of these three great
obstructs the other. orders, and necessarily
obstructs the general interests
of the nation at large.
(7.) The interest of this third order has not Free trade has emanated from
the same connection with the general this order.
interest of the society as that of the
other two.
Merchants and Master Manufacturers
are, in this order, the two classes of
people who commonly employ the
largest capitals.
(8.) The proposal of any new law or If attention had only been paid to
regulation of commerce, which comes Adam Smith’s warning, we
from this order, ought always to be should not now have to mourn
listened to with great precaution, and the decadence of England’s
ought never to be adopted till after industries.
having been long and carefully
examined, not only with the most
scrupulous, but with the most
suspicious, attention.
(9.) It comes from an order of men whose
interest is never exactly the same
with that of the public; who have
generally an interest to deceive and
even to oppress the public, and who How true of your prophet
accordingly have, upon many Bright! Free Trade is another
occasions, both deceived and fearful example of the
oppressed it. (Wealth of Nations, by deception and oppression
Adam Smith, Bk. I. Chap. XI.) practised by this class.

You will probably, attempt to discredit your sacred writings when


they do not support your own views.
You will argue that Adam Smith wrote when the conditions of
society and commerce were very different from what they are now.
Mathematicians say, that when a formula will not accommodate
itself to altering conditions and circumstances, it is unsound. It is the
same with political science. Either the political science of Adam
Smith is unsound, and he is not reliable, or the serious indictments
against Free Trade given in the quotations above are well-founded.
CHAPTER XVIII.
THE VAMPIRE.

What is the nature of a country-life that it should breed such a


vampire,—such a monster of iniquity,—such a “squanderer of
national wealth” as the landlord whom your Free-trading friends hold
up to public execration? The old classical idea “procul a negotiis”
would indicate that it had a contrary influence. How is it then that it
produces the unmitigated miscreant whom Bright delights to
denounce,—whom Gladstone loves to pursue with ruinous
enactments,—and whom Parnell, with his murderous crew, takes
pleasure in “boycotting,” maiming, and assassinating? The external
appearance of this monster gives no clue to his character. From
personal acquaintance with men of this class in England I should
have said, that, on the average, they were well-meaning, harmless,
good-natured men; not always of the widest of views, or shrewdest
intelligence, but with the best intentions, anxious in bad times to
help their tenants, and in good times to improve their property. Even
your prophet Adam Smith appears to have been deceived by them.
[63] Again, appearances are deceptive; for, to my inexperienced eye,
there seemed to be a large amount of kindly sympathy between
tenant and landlord.
I am unable to speak from personal experience respecting the
same classes in Ireland; but all novels and tales of Irish life, which
should reflect, with some degree of truth, the general aspect of
things, agree in describing scenes, probably founded on facts, from
which one would imagine that, before the present agitation and
enactments, there appeared to exist much kindly feeling and
sympathy between the peasantry and the “Masther,” who, with all
his faults, is represented as a generous, rollicking, devil-may-care
sort of fellow,[64] quite opposed to the grasping, grinding miscreant
whom your friends denounce; of course, there were exceptions.
Mr. A. M. Sullivan seems also to have been mistaken when he
says:—
“The conduct of the Irish landlords throughout the famine period has been
variously described, and has, I believe, been generally condemned. I consider the
censure visited on them too sweeping. I hold it to be in some respects cruelly
unjust.... It is impossible to contest authentic cases of brutal heartlessness here
and there; but granting all that has to be entered on the dark debtor side, the
overwhelming balance is the other way. The bulk of the resident Irish landlords
manfully did their best in that dread hour. If they did too little compared with what
the landlord class in England would have done in a similar case, it was because
little was in their power.... They were heritors of estates heavily overweighted with
the debts of a bygone generation.... To these landowners the failure of one year’s
rental receipts meant mortgage, foreclosure, and hopeless ruin. Yet cases might
be named by the score in which men scorned to avert, by pressure on their
suffering tenancy, the fate they saw impending over them. They went down with
the ship.
“No adequate tribute has ever been paid to the memory of those Irish landlords,
and they were men of every party and creed, who perished martyrs to duty, in
that awful time.”[65]

It is wonderful how, at such an awful time, the Irish landlord


should have continued to mask his true character.
Still I am rather puzzled.
I quite admit that the Irish landlord is wrong in rack-renting his
tenant to the extent of grinding out of him one-third of the amount
that is cheerfully paid by tenants in protectionist countries.
I admit that he should not have tried in a Free Trade country to
have extorted more than one-tenth of the rent paid by protectionist
tenants. Nay, I will go further. I don’t think that a tenant in Free
Trade Ireland would farm to a profit even if he had the land rent-
free. I admit also that it was selfish of the landlord to allow the
question of his own pauperism to weigh in the question of rent.
Still, after making due allowance for all these faults, I cannot quite
understand how his guilt is sufficiently proven to warrant his
continued persecution and gradual extermination, by enactment
after enactment for his ruin, should he chance to escape
assassination. A snake or a rat could not be hunted down with
greater venom. I must say that, in spite of his crimes, he is an
object of pity.
Perhaps an analysis of his villainy may help me to understand the
heinousness of his crime; let us apply, therefore, to the political
economist for the character of the rent, the instrument with which
he commits his crime—what does he say?[66]
“Rent does not affect the price of agricultural produce.”[67]
“Whoever does pay rent gets back its full value in extra advantage, and the rent
which he pays does not place him in a worse position than, but only in the same
position as, his fellow-producer who pays no rent, but whose instrument is one of
inferior efficiency.”[68]
“Rent is reached by bargaining between the landlord and tenant; bargaining
founded on the practical elements existing in the business. Profit must satisfy the
tenant, or he will not take the farm; and on the other hand, if he claim an unduly
low rent, he will find a rival competitor stepping into the farm house.... The
position of an in-coming tenant is that of a man who is buying a business for sale
(for whether he purchases the farm outright in order to cultivate it, or hires it,
makes no difference in the nature of the transaction). He is buying a specific
business in a given locality, as any man might do in a manufacturing town, and his
motive is profit. This consideration governs the whole of the negotiation between
the landowner and himself ... upon the terms of an annual payment of the means
of profit which he seeks to acquire.”[69]

Yes! This appears to me to be just and business-like; the tenant


hires the land for the profit he expects to get out of it, and his rent
is a simple debt. Proceed:—
“To refuse to pay debt violently is to steal, and to permit stealing is not only to
dissolve, but to demoralize, society.”[70]
“When a portion of wealth passes out of the hands of him who has acquired it,
without his consent, and without compensation, to him who has not created it ...
plunder is perpetrated.”[71]
“Law is common force organized to prevent injustice.”[71]
“If the law itself performs the action it ought to repress, plunder is still
perpetrated under aggravated circumstances.”[71]
“To place the position itself of a landlord in an invidious light, as a man who
exacts from the labours of others that for which he has neither toiled nor spun, is
a most unwarrantable process of argumentation.”[70]
“It would be impossible to introduce into society a greater change and a greater
evil than this:—the conversion of law into an instrument of plunder.”[71]

Yes, yes! All this appears to me to be just and sensible! but


pardon me, I am a little obtuse. I cannot yet see that the landlord’s
guilt is proven. Let us recapitulate:—
Rent does not raise the price of corn! The tenant gets value for his
rent! He enters into a business contract for profit! The rent is a
simple debt. To refuse it, is to steal! To assist legally at this refusal,
is to be an accomplice in the theft! In this case Government is the
accomplice, and the Government is a plunderer under aggravated
circumstances! Moreover, it not only plunders, but demoralizes
society. Mr. Gladstone represents Government. Messrs. Bright,
Parnell, Davitt and Co. assist in this legalized and illegal plunder;
thus demoralizing the society. The property of the landlord passes to
another without his consent and without compensation! Messrs.
Gladstone and Co. use that which Professor Bonamy Price terms a
most “unwarrantable process of argumentation.”
Stop! Stop!! for goodness’ sake!!! My brain is getting confused; in
my innocence, had I not been gravely assured that they were angels
of light, patriots, philanthropists,[72] I should have mistaken Messrs.
Gladstone, Bright, Parnell, Davitt, and Co. for the real criminals.

F O OT N OT E S :
[63] Adam Smith, in speaking of the class of merchants and
manufacturers, says:—“Their superiority over the country
gentleman is not so much in their knowledge of the public
interest as in their having a better knowledge of their own
interest than he has of his. It is by this superior knowledge of
their own interest that they have frequently imposed upon his
generosity and persuaded him to give up his own interest and
that of the public from a very simple but honest conviction that
their interest, and not his, was the interest of the people.”
(Wealth of Nations, Bk. I. Chap. XI.)
How true in the case of Free Trade!
[64] The landlordism of the days before Famine (1847) never
“recovered its strength or its primitive ways. For the landlord,
there came of the Famine the Encumbered Estates Court. For the
small farmer and tenant class there floated up the American
Emigrant ships.” (‘History of Our Own Times,’ Justin Macarthy.)
[65] New Ireland, by A. M. Sullivan, p. 133.
[66] Adam Smith contradicts himself about rent—in one set of
passages he says it is the cause, and in another the effect, of
prices.
[67] Macleod’s Economics, p. 117.
[68] Political Economy, by J. S. Mill, Bk. II. Chap. XVI.
[69] Profr. Bonamy Price.
[70] Profr. Bonamy Price.
[71] Political Economy, Bastiat.
[72] “Legal plunder has two roots. One of them is in human
egotism, the other is in false philanthropy.” (Political Economy,
Bastiat.)
CHAPTER XIX.
ODIMUS QUOS LÆSIMUS.

Your friend, John Bright, with his usual disregard for accuracy,
describes the large landlord as the “squanderer and absorber of
national wealth,” but seeing that the total rent of land in Great
Britain and Ireland is less than 5 per cent. of the whole national
income,[73] and that of this less than one-seventh is in the hands of
large landowners, it would require a more able statesman than Mr.
Bright to show how he can squander that, of which such a very
small proportion passes through his lands.
No? friend Bright. You and your fellow free-traders are the real
squanderers of national wealth, and you seek to shift the blame
from your own shoulders, by dishonestly laying it on those of the
landowner. I command to your perusal the graphic description of a
large landowner—the Duke of Argyle—who states that, in Trylee, by
feeding the tenantry in bad times, by assisting some to emigrate, by
introducing new methods of cultivation, by expenditure of capital in
improvements, by consolidating small holdings when too narrow for
subsistence, he has raised a community, from the lowest state of
poverty and degradation, to one of lucrative industry and prosperity.
The prosperity these tenants enjoy is due to the beneficial and
regulative power of the landlord as a capitalist. The greater the
wealth of the landlord, the greater is his beneficial and regulative
power. There were thousands of landowners who acted up to the
limits of their power in this way, until you, friend Bright, ruined them
and deprived them of the power of helping their tenants.
No, doubt, there are bad landlords, as there are bad men in all
classes, but the interests of the landowner and those of the tenant
are inseparably bound together; and the landlord is shrewd enough
to see that it is to his own interest to improve the property if he can
afford to do so.
The old classic, with his insight into human nature, in odimus quos
læsimus, shows that human nature has not altered, and it does not
surprise me that you should hold up to execration the class you have
so cruelly injured.
You, my Free-trading Fanatic, have (thanks to Mill’s unfortunate
sophisms and your leaders’ persistent misrepresentations) such a
very hazy view about landowner’s rights and duties, that I think a
few words on the subject may clear the atmosphere.
(1.) Landed property is the capital of the landlord.
(2.) Interest on capital is fair, reasonable, and consistent with general good.
(3.) Rent is interest on the capital of the landlord.
(4.) The landlord may sell[74] his land, invest the proceeds in any other way, and
thus get interest on his capital.
(5.) The tenant can get rid of rent, either:—
(a) by borrowing money to buy land, in which case he has to pay interest
on the loan;
(b) by saving sufficient money to purchase land, in which case he might,
instead of purchasing, invest the money, so that its interest would pay
the rent.
(6.) In any case the whole question of rent resolves itself into a question of
capital, and interest thereon.
(7.) Law, from time immemorial, has recognised the right of property in land.
(8.) In most cases the owner has paid hard cash both for the land and for the
improvements of it.
(9.) Land is therefore actual capital just as much as money, coal, iron, cattle, or
any other disposable commodity.
It is absurd, therefore, to say, that a man possessing capital in
land may not act in the same way as the owner of any other form of
capital. (Of course he has his moral obligations, but those are
applicable to the possession of any other form of capital.) If the
tenant desires capital, he must work for it, or obtain it in some legal
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like