Preview (2023) Introduction To Environmental Data Science in R 33p
Preview (2023) Introduction To Environmental Data Science in R 33p
Data Science
Introduction to Environmental Data Science focuses on data science methods in the R language
applied to environmental research, with sections on exploratory data analysis in R including data
abstraction, transformation, and visualization; spatial data analysis in vector and raster models;
statistics & modelling ranging from exploratory to modelling, considering confirmatory statis-
tics and extending to machine learning models; time series analysis, focusing especially on car-
bon and micrometeorological flux; and communication. Introduction to Environmental Data
Science. It is an ideal textbook to teach undergraduate to graduate level students in environmen-
tal science, environmental studies, geography, earth science, and biology, but can also serve as a
reference for environmental professionals working in consulting, NGOs, and government agen-
cies at the local, state, federal, and international levels.
Features
• Gives thorough consideration of the needs for environmental research in both spatial and
temporal domains.
• Features examples of applications involving field-collected data ranging from individual ob-
servations to data logging.
• Includes examples also of applications involving government and NGO sources, ranging
from satellite imagery to environmental data collected by regulators such as EPA.
• Contains class-tested exercises in all chapters other than case studies. Solutions manual
available for instructors.
• All examples and exercises make use of a GitHub package for functions and especially data.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction to Environmental
Data Science
Jerry D. Davis
Designed cover image: By Anna Studwell and Jerry D. Davis
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003317821
Typeset in LM Roman
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
“Dandelion fluff – Ephemeral stalk sheds seeds to the universe” by Anna Studwell
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Contents
List of Figures xv
vii
viii Contents
3 Data Abstraction 55
3.1 The Tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Building a tibble from vectors . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 tribble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.3 read_csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Summarizing variable distributions . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Stratifying variables by site using a Tukey box plot . . . . . . . . . . 62
3.4 Database operations with dplyr . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Select, mutate, and the pipe . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.3 Writing a data frame to a csv . . . . . . . . . . . . . . . . . . . . . . 67
3.4.4 Summarize by group . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.5 Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.6 Sorting after summarizing . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.7 The dot operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5 String abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.1 Detecting matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.2 Subsetting strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.3 String length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.4 Replacing substrings with other text (“mutating” strings) . . . . . . 73
3.5.5 Concatenating and splitting . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Dates and times with lubridate . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7 Calling functions explicitly with :: . . . . . . . . . . . . . . . . . . . . . . 77
3.8 Exercises: Data Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Visualization 79
4.1 plot in base R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Plotting one variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 Density plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Plotting Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1 Two continuous variables . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Two variables, one discrete . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Color systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.4 Trend line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 General Symbology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.1 Categorical symbology . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 Log scales instead of transform . . . . . . . . . . . . . . . . . . . . . 99
4.6 Graphs from Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6.1 Faceted graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.7 Titles and Subtitles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.8 Pairs Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.9 Exercises: Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
II Spatial 121
6 Spatial Data and Maps 123
6.1 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.1.1 Simple geometry building in sf . . . . . . . . . . . . . . . . . . . . . 125
6.1.2 Building points from a data frame . . . . . . . . . . . . . . . . . . . 128
6.1.3 SpatVectors in terra . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.1.4 Creating features from shapefiles . . . . . . . . . . . . . . . . . . . . 133
6.2 Coordinate Referencing Systems . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3 Creating sf Data from Data Frames . . . . . . . . . . . . . . . . . . . . . . 137
6.3.1 Removing geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.4 Base R’s plot() with terra . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.4.1 Using maptiles to create a basemap . . . . . . . . . . . . . . . . . . 139
6.5 Raster data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.1 Building rasters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.2 Vector to raster conversion . . . . . . . . . . . . . . . . . . . . . . . 143
6.6 ggplot2 for Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.6.1 Rasters in ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.7 tmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.8 Interactive Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.8.1 Leaflet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.8.2 Mapview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.8.3 tmap (view mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.8.4 Interactive mapping of individual penguins abstracted from a big
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.9 Exercises: Spatial Data and Maps . . . . . . . . . . . . . . . . . . . . . . . 159
6.9.1 Project preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
11 Modeling 253
11.1 Some Common Statistical Models . . . . . . . . . . . . . . . . . . . . . . . 253
11.2 Linear Model (lm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11.3 Spatial Influences on Statistical Analysis . . . . . . . . . . . . . . . . . . . 256
11.3.1 Mapping residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
11.4 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
11.5 Generalized linear model (GLM) . . . . . . . . . . . . . . . . . . . . . . . . 266
11.5.1 Binomial family: logistic GLM with streams . . . . . . . . . . . . . . 266
Contents xi
References 373
Index 377
Author/editor biographies
xiii
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
List of Figures
3.1 Visualization of some abstracted data from the EPA Toxic Release Inventory 55
3.2 Euc-Oak paired plot runoff and erosion study (Thompson, Davis, and
Oliphant (2016)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Eucalyptus/Oak paired site locations . . . . . . . . . . . . . . . . . . . . . 62
3.4 Tukey boxplot of runoff under eucalyptus canopy . . . . . . . . . . . . . . 62
4.1 Flipper length by mass and by species, base plot system. The Antarctic
peninsula penguin data set is from @palmer. . . . . . . . . . . . . . . . . . 80
4.2 Simple bar graph of meadow vegetation samples . . . . . . . . . . . . . . . 81
4.3 Distribution of NDVI, Knuthson Meadow . . . . . . . . . . . . . . . . . . . 83
4.4 Distribution of Average Monthly Temperatures, Sierra Nevada . . . . . . . 83
4.5 Cumulative Distribution of Average Monthly Temperatures, Sierra Nevada 84
4.6 Density plot of NDVI, Knuthson Meadow . . . . . . . . . . . . . . . . . . . 85
4.7 Comparative density plot using alpha setting . . . . . . . . . . . . . . . . . 85
4.8 Runoff under eucalyptus and oak in Bay Area sites . . . . . . . . . . . . . 86
4.9 Boxplot of runoff by site . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.10 Runoff at Bay Area Sites, colored as eucalyptus and oak . . . . . . . . . . 87
4.11 Marble Valley, Marble Mountains Wilderness, California . . . . . . . . . . 88
4.12 Marble Mountains soil gas sampling sites, with surface topographic features
and cave passages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.13 Visualizing soil CO2 data with a Tukey box plot . . . . . . . . . . . . . . . 89
4.14 Scatter plot of discharge (Q) and specific electrical conductance (EC) for
Sagehen Creek, California . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.15 Q and EC for Sagehen Creek, using log10 scaling on both axes . . . . . . . 91
4.16 Setting one color for all points . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.17 Two variables, one discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.18 Using aesthetics settings for both points and lines . . . . . . . . . . . . . . 93
4.19 Color set within aes() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xv
xvi List of Figures
4.20 Streamflow (Q) and specific electrical conductance (EC) for Sagehen Creek,
colored by temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.21 Channel slope as range from green to red, vertices sized by elevation . . . 96
4.22 Channel slope as range of line colors on a longitudinal profile . . . . . . . . 96
4.23 Channel slope by longitudinal distance as scatter points colored by slope . 97
4.24 Trend line with a linear model . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.25 EPA TRI, categorical symbology for industry sector . . . . . . . . . . . . . 99
4.26 Using log scales instead of transforming . . . . . . . . . . . . . . . . . . . . 100
4.27 NDVI symbolized by vegetation in two seasons . . . . . . . . . . . . . . . . 101
4.28 Eucalyptus and oak: rainfall and runoff . . . . . . . . . . . . . . . . . . . . 101
4.29 Faceted graph alternative to color grouping (note that the y scale is the
same for each) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.30 Titles added . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.31 Pairs plot for Sierra Nevada stations variables . . . . . . . . . . . . . . . . 104
4.32 Enhanced GGally pairs plot for palmerpenguin data . . . . . . . . . . . . . 104
6.1 A simple ggplot2 map built from scratch with hard-coded data as simple
feature columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Using an sf class to build a map in ggplot2, displaying an attribute . . . . 127
6.3 Base R plot of one attribute from two states . . . . . . . . . . . . . . . . . 128
6.4 Points created from a dataframe with Simple Features . . . . . . . . . . . 129
6.5 Simple plot of SpatVector point data with labels (note that overlapping
labels may result, as seen here) . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6 ggplot of twostates and stations . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Base R plot of twostates and stations SpatVectors . . . . . . . . . . . . . . 133
6.8 A simple plot of polygon data by default shows all variables . . . . . . . . 134
6.9 A single map with a legend is produced when a variable is specified . . . . 134
6.10 Points created from data frame with coordinate variables . . . . . . . . . . 137
6.11 Plotting SpatVector data with base R plot system . . . . . . . . . . . . . . 138
6.12 Features added to the map using the base R plot system . . . . . . . . . . 139
6.13 Using maptiles for a base map . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.14 Converted sf data for map with tiles . . . . . . . . . . . . . . . . . . . . . 141
6.15 Simple plot of a worldwide SpatRaster of 30-degree cells, with SpatVector
of CA and NV added . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.16 Stream raster converted from stream features, with 30 m cells from an ele-
vation raster template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.17 Shuttle Radar Topography Mission (SRTM) image of Virgin River Canyon
area, southern Utah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.18 simple ggplot map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.19 labels added . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.20 repositioned legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
List of Figures xvii
7.1 Plotting filtered data: above 2,000 m and 38°N latitude with a basemap . . 165
7.2 A Bodie scene, from Bodie State Historic Park (https://www.parks.ca.gov/) 165
7.3 Sierra data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.4 Northern Sierra stations and places . . . . . . . . . . . . . . . . . . . . . . 169
7.5 California county centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.6 Map scaled to cover Bay Area tracts using a bbox . . . . . . . . . . . . . . 171
7.7 Nile River points, colored by channel slope . . . . . . . . . . . . . . . . . . 173
7.8 Nile River channel slope as range of colors from green to red, with great
circle channel distances derived using the haversine method . . . . . . . . 173
7.9 Selection of soil CO2 sampling sites, July 1995 . . . . . . . . . . . . . . . . 174
7.10 Selection of soil CO2 and in-cave water samples . . . . . . . . . . . . . . . 176
7.11 Distance from CO2 samples to closest streams (not including lakes) . . . . 177
7.12 Distance to towns (places) from weather stations . . . . . . . . . . . . . . 178
7.13 100 m trail buffer, Marble Mountains . . . . . . . . . . . . . . . . . . . . . 179
7.14 Unioned trail buffer, dissolving boundaries . . . . . . . . . . . . . . . . . . 180
7.15 Intersection of trail and stream buffers . . . . . . . . . . . . . . . . . . . . 181
7.16 Union of two sets of buffer polygons . . . . . . . . . . . . . . . . . . . . . . 181
7.17 Cropping with specified x and y limits . . . . . . . . . . . . . . . . . . . . 182
7.18 TRI points with census variables added via a spatial join . . . . . . . . . . 183
7.19 Transect Buffers (goal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.1 Red Clover Valley eddy covariance flux tower installation . . . . . . . . . . 319
13.2 Loney Meadow net ecosystem exchange (NEE) results (Blackburn et al.
2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
13.3 Time series of Nile River flows . . . . . . . . . . . . . . . . . . . . . . . . . 320
13.4 Decomposition of Mauna Loa CO2 data . . . . . . . . . . . . . . . . . . . 322
13.5 Seasonal deomposition of time series using loess (stl) applied to CO2 . . . 322
13.6 San Francisco monthly highs and lows as time series . . . . . . . . . . . . . 323
13.7 SF data with yearly period . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
13.8 Greenhouse gases with 20 year observations, so 0.05 annual frequency . . . 325
13.9 Monthly sunspot activity from 1749 to 2013 . . . . . . . . . . . . . . . . . 326
13.10 Monthly sunspot activity from 1940 to 1970 . . . . . . . . . . . . . . . . . 326
13.11 Sunspots of the first 20 years of data . . . . . . . . . . . . . . . . . . . . . 327
13.12 11-year sunspot cycle decomposition . . . . . . . . . . . . . . . . . . . . . 328
13.13 San Pedro Creek E. coli time series . . . . . . . . . . . . . . . . . . . . . . 331
13.14 Decomposition of weekly E. coli data, annual period (frequency 52) . . . . 331
13.15 Moving average (order=15) of E. coli data . . . . . . . . . . . . . . . . . . 333
13.16 GHG CO2 time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
13.17 Moving average (order=7) of CO2 time series . . . . . . . . . . . . . . . . 334
13.18 Random variation seen by subtracting moving average . . . . . . . . . . . 335
13.19 Decomposition using stl of a 15th-order moving average of E. coli data . . 336
13.20 Marble Mountains resurgence data logger design . . . . . . . . . . . . . . . 336
13.21 Marble Mountains resurgence data logger equipment . . . . . . . . . . . . 337
13.22 Data logger data from the Marbles resurgence . . . . . . . . . . . . . . . . 338
13.23 stl decomposition of Marbles water level time series . . . . . . . . . . . . . 339
13.24 Flux tower installed at Loney Meadow, 2016. Photo credit: Darren Black-
burn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
13.25 Facet plot with free y scale of Loney flux tower parameters . . . . . . . . . 341
13.26 Scatter plot of Bugac solar radiation and air temperature . . . . . . . . . . 342
13.27 Solstice 8-day time series of solar radiation and temperature . . . . . . . . 343
13.28 Bugac solar radiation and temperature . . . . . . . . . . . . . . . . . . . . 345
13.29 Manaus ensemble averages with error bars . . . . . . . . . . . . . . . . . . 346
13.30 Facet graph of Marble Mountains resurgence data (goal) . . . . . . . . . . 348
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from noisy, structured and unstructured data
(Wikipedia). A data science approach is especially suitable for applications involving large
and complex data sets, and environmental data is a prime example, with rapidly growing
collections from automated sensors in space and time domains.
The methods needed for environmental research can include many things since environmen-
tal data can include many things, including environmental measurements in space and time
domains.
1
2 Background, Goals and Data
1.3 Goals
While the methodological reach of data science is very great, and the spectrum of environ-
mental data is as well, our goal is to lay the foundation and provide useful introductory
methods in the areas outlined above, but as a “live” book be able to extend into more
advanced methods and provide a growing suite of research examples with associated data
sets. We’ll briefly explore some data mining methods that can be applied to so-called “big
data” challenges, but our focus is on exploratory data analysis in general, applied to en-
vironmental data in space and time domains. For clarity in understanding the methods and
products, much of our data will be in fact be quite small, derived from field-based environ-
mental measurements where we can best understand how the data were collected, but these
methods extend to much larger data sets. It will primarily be in the areas of time-series and
imagery, where automated data capture and machine learning are employed, when we’ll dip
our toes into big data.
Exploratory Data Analysis 3
Machine Learning: building a model using training data in order to make predictions
without being explicitly programmed to do so. Related to artificial intelligence methods.
Used in:
Big Data: data having a size or complexity too big to be processed effectively by traditional
software
Exploratory Data Analysis: procedures for analyzing data, techniques for interpreting
the results of such procedures, ways of structuring data to make its analysis easier
• summarizing
• restructuring
• visualization
Just as exploration is a part of what National Geographic has long covered, it’s an impor-
tant part of geographic and environmental science research. Exploratory data analysis
is exploration applied to data, and has grown as an alternative approach to traditional
statistical analysis. This basic approach perhaps dates back to the work of Thomas Bayes
in the eighteenth century, but Tukey (1962) may have best articulated the basic goals of
this approach in defining the “data analysis” methods he was promoting: “Procedures for
analyzing data, techniques for interpreting the results of such procedures, ways of planning
the gathering of data to make its analysis easier, more precise or more accurate, and all the
machinery and results of (mathematical) statistics which apply to analyzing data.” Some
years later Tukey (1977) followed up with Exploratory Data Analysis.
4 Background, Goals and Data
Exploratory data analysis (EDA) is an approach to analyzing data via summaries and graph-
ics. The key word is exploratory, and while one might view this in contrast to confirmatory
statistics, in fact they are highly complementary. The objectives of EDA include (a) suggest-
ing hypotheses; (b) assessing assumptions on which inferences will be based; (c) selecting
appropriate statistical tools; and (d) guiding further data collection. This philosophy led to
the development of S at Bell Labs (led by John Chambers, 1976), then to R.
First, we’re going to use the R language, designed for statistical computing and graphics. It’s
not the only way to do data analysis – Python is another important data science language
– but R with its statistical foundation is an important language for academic research,
especially in the environmental sciences.
## [1] ”This book was produced in RStudio using R version 4.2.1 (2022-06-23 ucrt)”
For a start, you’ll need to have R and RStudio installed, then you’ll need to install various
packages to support specific chapters and sections.
• In Introduction to R (Chapter 2), we will mostly use the base installation of R, with
a few packages to provide data and enhanced table displays:
– igisci
– palmerpenguins
– DT
– knitr
• In Abstraction (Chapter 3) and Transformation (Chapter 5), we’ll start making a
lot of use of tidyverse 3.1 packages such as:
– ggplot2
– dplyr
– stringr
– tidyr
– lubridate
• In Visualization (Chapter 4), we’ll mostly use ggplot2, but also some specialized visu-
alization packages such as:
– GGally
• In Spatial (starting with Chapter 6), we’ll add some spatial data, analysis and mapping
packages:
– sf
– terra
– tmap
– leaflet
• In Statistics and Modeling (starting with Chapter 10), no additional packages are
needed, as we can rely on base R’s rich statistical methods and ggplot2’s visualization.
Software and Data 5
• In Time Series (Chapter 13), we’ll find a few other packages handy:
– xts (Extensible Time Series)
– forecast (for a few useful functions like a moving average)
And there will certainly be other packages we’ll explore along the way, so you’ll want to
install them when you first need them, which will typically be when you first see a library()
call in the code, or possibly when a function is prefaced with the package name, something
like dplyr::select(), or maybe when R raises an error that it can’t find a function you’ve
called or that the package isn’t installed. One of the earliest we’ll need is the suite of
packages in the “tidyverse” (Wickham and Grolemund (2016)), which includes some of the
ones listed above: ggplot2, dplyr, stringr, and tidyr. You can install these individually, or
all at once with:
`install.packages(”tidyverse”)`
This is usually done from the console in RStudio and not included in an R script or mark-
down document, since you don’t want to be installing the package over and over again. You
can also respond to a prompt from RStudio when it detects a package called in a script you
open that you don’t have installed.
From time to time, you’ll want to update your installed packages, and that usually happens
when something doesn’t work and maybe the dependencies of one package on another gets
broken with a change in a package. Fortunately, in the R world, especially at the main
repository at CRAN, there’s a lot of effort put into making sure packages work together, so
usually there are no surprises if you’re using the most current versions. Note that there can
be exceptions to this, and occasionally new package versions will create problems with other
packages due to inter-package dependencies and the introduction of functions with names
that duplicate other packages. The packages installed for this book were current as of that
version of R, but new package versions may occasionally introduce errors.
Once a package like dplyr is installed, you can access all of its functions and data by adding
a library call, like …
library(dplyr)
… which you will want to include in your code, or to provide access to multiple libraries in
the tidyverse, you can use library(tidyverse). Alternatively, if you’re only using maybe one
function out of an installed package, you can call that function with the :: separator, like
dplyr::select(). This method has another advantage in avoiding problems with duplicate
names – and for instance we’ll generally call dplyr::select() this way.
1.5.1 Data
We’ll be using data from various sources, including data on CRAN like the code packages
above which you install the same way – so use install.packages(”palmerpenguins”).
We’ve also created a repository on GitHub that includes data we’ve developed in the Insti-
tute for Geographic Information Science (iGISc) at SFSU, and you’ll need to install that
package a slightly different way.
6 Background, Goals and Data
GitHub packages require a bit more work on the user’s part since we need to first install
remotes1 , then use that to install the GitHub data package:
install.packages(”remotes”)
remotes::install_github(”iGISc/igisci”)
Then you can access it just like other built-in data by including:
library(igisci)
To see what’s in it, you’ll see the various datasets listed in:
data(package=”igisci”)
For instance, Figure 1.2 is a map of California counties using the CA_counties sf feature
data. We’ll be looking at the sf (Simple Features) package later in the Spatial section of the
book, but seeing library(sf), this is one place where you’d need to have installed another
package, with install.packages(”sf”).
The package datasets can be used directly as sf data or data frames. And similarly to
functions, you can access the (previously installed) data set by prefacing with igisci:: this
way, without having to load the library. This might be useful in a one-off operation:
mean(igisci::sierraFeb$LATITUDE)
## [1] 38.3192
Raw data such as .csv files can also be read from the extdata folder that is installed on
your computer when you install the package, using code such as:
1 Note:you can also use devtools instead of remotes if you have that installed. They do the same thing;
remotes is a subset of devtools. If you see a message about Rtools, you can ignore it since that is only needed
for building tools from C++ and things like that.
Software and Data 7
42°N
40°N
38°N
36°N
34°N
And we’ll find that including most of the above arcanity in a function will help. We’ll look
at functions later, but here’s a function that we’ll use a lot for setting up reading data from
the extdata folder:
ex <- function(dta){system.file(”extdata”,dta,package=”igisci”)}
And this ex()function is needed so often that it’s installed in the igisci package, so if you
have library(igisci) in effect, you can just use it like this:
But how do we see what’s in the extdata folder? We can’t use the data() function, so we
would have to dig for the folder where the igisci package gets installed, which is buried
pretty deeply in your user profile. So I wrote another function exfiles() that creates a
data frame showing all of the files and the paths to use. In RStudio you could access it
with View(exfiles()) or we could use a datatable (you’ll need to have installed “DT”).
You can use the path using the ex() function with any function that needs it to read data,
like read.csv(ex('CA/CA_ClimateNormals.csv')), or just enter that ex() call in the console
like ex('CA/CA_ClimateNormals.csv') to display where on your computer the installed data
reside.
1.6 Acknowledgements
This book was immensely aided by extensive testing by students in San Francisco State’s
GEOG 604/704 Environmental Data Science class, including specific methodological contri-
butions from some of the students and a contributed data wrangling exercise by one from
the first offering (Josh von Nonn) in Chapter 5. Thanks to Andrew Oliphant, Chair of the
Department of Geography and Environment, for supporting the class (as long as I included
time series) and then came through with some great data sets from eddy covariance flux
towers as well as guest lectures. Many thanks to Adam Davis, California Energy Commis-
sion, for suggestions on R spatial methods and package development, among other things
in the R world. Thanks to Anna Studwell, recent Associate Director of the IGISc, for ideas
on statistical modeling of birds and marine environments, and the nice water-color for the
front cover. And a lot of thanks goes to Nancy Wilkinson, who put up with my obsessing
on R coding puzzles at all hours and pretended to be impressed with what you can do with
R Markdown.
Acknowledgements 9
Cover art “Dandelion fluff – Ephemeral stalk sheds seeds to the universe” by Anna Studwell.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
References
Applied California Current Ecosystem Studies. n.d. https://pointblue.org.
Ballard, Grant , Annie E Schmidt , Viola Toniolo , Sam Veloz , Dennis Jongsomjit , Kevin R Arrigo , and David G
Ainley . 2019. “Fine-Scale Oceanographic Features Characterizing Successful Adélie Penguin Foraging in the
SW Ross Sea.” Marine Ecology Progress Series. https://doi.org/10.3354/meps12801.
Berry, Brian , and Duane Marble . 1968. Spatial Analysis: A Reader in Statistical Geography. Prentice-Hall.
Blackburn, Darren A , Andrew J Oliphant , and Jerry D Davis . 2021. “Carbon and Water Exchanges in a
Mountain Meadow Ecosystem, Sierra Nevada, California.” Wetlands 41 (3): 1–17.
https://doi.org/10.1007/s13157-021-01437-2.
Brown, Christopher. n.d. "R Accessors Explained". https://www.r-bloggers.com/2009/10/r-accessors-explained/.
Calculate Distance, Bearing and More Between Latitutde/Longitude Points". n.d. Movable Type Ltd.
https://www.movable-type.co.uk/scripts/latlong.html.
Clover Valley Ranch Restoration, the Sierra Fund. n.d. https://sierrafund.org/clover-valley-ranch/.
Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological
Measurement 20. https://doi.org/10.1177/001316446002000104.
Copernicus Open Access Hub. n.d. European Space Agency - ESA. https://scihub.copernicus.eu/.
Davis, JD , P Amato , and R Kiefer . 2001. “Soil Carbon Dioxide in a Summer-Dry Subalpine Karst, Marble
Mountains, California, USA.” Zeitschrift Für Geomorphologie N.F. 45 (3): 385–400.
https://www.researchgate.net/publication/258333952_Soil_carbon_dioxide_in_a_summer-
dry_subalpine_karst_Marble_Mountains_California_USA.
Davis, JD , L Blesius , M Slocombe , S Maher , M Vasey , P Christian , and P Lynch . 2020. “Unpiloted Aerial
System (UAS)-Supported Biogeomorphic Analysis of Restored Sierra Nevada Montane Meadows.” Remote
Sensing 12. https://www.mdpi.com/2072-4292/12/11/1828.
Davis, JD , and GA Davis . 2001. “A Microcontroller-Based Data-Logger Design for Seasonal Hydrochemical
Studies.” Earth Surface Processes and Landforms 26 (10): 1151–1159. https://doi.org/10.1002/esp.262.
Davis, Jerry. n.d. San Pedro Creek Watershed Virtual Fieldtrip: Story Map.
https://storymaps.arcgis.com/stories/62705877a9f64ac5956a64230430c248.
Davis, Jerry D , and George A Brook . 1993. “Geomorphology and Hydrology of Upper Sinking Cove,
Cumberland Plateau, Tennessee.” Earth Surface Processes and Landforms 18 (4): 339–362.
https://doi.org/10.1002/esp.3290180404.
Davis, Jerry , and Leonhard Blesius . 2015. “A Hybrid Physical and Maximum-Entropy Landslide Susceptibility
Model.” Entropy 17 (6): 4271–4292. https://www.mdpi.com/1099-4300/17/6/4271.
Ellen, Stephen D , and Gerald F Wieczorek . 1988. Landslides, Floods, and Marine Effects of the Storm of
January 3-5, 1982, in the San Francisco Bay Region, California. Vol. 1434. USGS.
https://pubs.usgs.gov/pp/1988/1434/.
EPSG Geodetic Parameter Dataset. n.d. https://en.wikipedia.org/wiki/EPSG_Geodetic_Parameter_Dataset.
European Fluxes Database. n.d. http://www.icos-etc.eu/home.
preref-StatisticalMethodsWaterResourcesHelsel , Dennis R., Robert M. Hirsch , Karen R. Ryberg , Stacey A.
Archfield , and Edward J. Gilroy . 2020. “Statistical Methods in Water Resources.” In Hydrologic Analysis and
Interpretation. Reston, Virginia: U.S. Geological Survey. https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf.
Hijmans, Robert J. n.d. Spatial Data Science. https://rspatial.org.
Horst, Allison Marie , Alison Presmanes Hill , and Kristen B Gorman . 2020. Palmerpenguins: Palmer
Archipelago (Antarctica) Penguin Data. https://allisonhorst.github.io/palmerpenguins/.
“Hysteresis.” n.d. https://en.wikipedia.org/wiki/Hysteresis.
Irizarry, Rafael A. 2019. Introduction to Data Science: Data Analysis and Prediction Algorithms with r. CRC
Press. https://cran.r-project.org/package=dslabs.
Johnston, Myfanwy , and Bob Rudis . n.d. Visualizing Fish Encounter Histories.
https://fishsciences.github.io/post/visualizing-fish-encounter-histories/.
Lovelace, Robin , Jakuv Nowosad , and Jannes Muenchow . 2019. Geocomputation with r. CRC Press.
https://geocompr.robinlovelace.net/.
Marine Debris Program. n.d. NOAA Office of Response; Restoration. https://marinedebris.noaa.gov/.
Nowosad, Jakub. n.d. Geostatistics in r. https://bookdown-
org.translate.goog/nowosad/geostatystyka/?_x_tr_sl=pl&_x_tr_tl=en&_x_tr_hl=pl.
Pebesma, Edzer. n.d. Gstat: Spatial and Spatio-Temporal Geostatistical Modelling, Prediction and Simulation.
https://cran.r-project.org/web/packages/gstat/index.html.
Powell, Cynthia , Leonhard Blesius , Jerry Davis , and Falk Schuetzenmeister . 2011. “Using MODIS Snow
Cover and Precipitation Data to Model Water Runoff for the Mokelumne River Basin in the Sierra Nevada,
California (2000–2009).” Global and Planetary Change 77 (1-2): 77–84.
https://doi.org/10.1016/j.gloplacha.2011.03.005.
Simple Features for r. n.d. https://r-spatial.github.io/sf/.
Sims, Stephanie. 2004. Hillslope Sediment Source Assessment of San Pedro Creek Watershed, California.
https://geog.sfsu.edu/theses/.
Studwell, Anna , Ellen Hines , Meredith L Elliott , Julie Howar , Barbara Holzman , Nadav Nur , and Jaime
Jahncke . 2017. “Modeling Nonresident Seabird Foraging Distributions to Inform Ocean Zoning in Central
California.” PLoS ONE. https://doi.org/10.1371/journal.pone.0169517.
Thiessen, A. 1911. “Precipitation Averages for Large Areas.” Monthly Weather Review 39 (7): 1082–1089.
Thompson, A , JD Davis , and AJ Oliphant . 2016. “Surface Runoff and Soil Erosion Under Eucalyptus and Oak
Canopy.” Earth Surface Processes and Landforms. https://doi.org/10.1002/esp.3881.
Tomlin, C Dana. 1990. Geographic Information Systems and Cartographic Modeling. Englewood Cliffs, N.J:
Prentice Hall.
Tukey, John W. 1962. “The Future of Data Analysis.” The Annals of Mathematical Statistics 33 (1): 1–67.
Tukey, John W. 1977. Exploratory Data Analysis. Reading, Mass: Addison-Wesley.
Voronoi, G. 1908. “Nouvelles Applications Des Paramètres Continus à La Théorie de Formes Quadratiques.”
Journal Für Die Reine Und Angewandte Mathematik 134: 198–287.
Wang, Earo , Dianne Cook , and Rob J Hyndman . 2020. “A New Tidy Data Structure to Support Exploration
and Modeling of Temporal Data.” Journal of Computational and Graphical Statistics 29 (3): 466–478.
https://doi.org/10.1080/10618600.2019.1695624.
Wickham, Hadley , and Garrett Grolemund . 2016. R for Data Science: Visualize, Model, Transform, Tidy, and
Import Data. O’Reilly Media, Inc. https://www.tidyverse.org/learn/.
Xie, Yihui. 2021. Bookdown: Authoring Books and Technical Documents with r Markdown. Boca Raton, Florida:
Chapman; Hall/CRC. https://bookdown.org/yihui/bookdown/.
Xie, Yihui , JJ Allaire , and Garrett Grolemund . 2019. R Markdown: The Definitive Guide. 1st ed. Boca Raton,
Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown/.
Yang, W , H Kobayashi , C Wang , J Shen M abd Chen , B Matsushita , Y Tang , Y Kim , et al. 2019. “A Semi-
Analytical Snow-Free Vegetation Index for Improving Estimation of Plant Phenology in Tundra and Grassland
Ecosystems.” Remote Sensing of Environment 228: 31–44. https://doi.org/10.1016/j.rse.2019.03.028.