DataVis Cheat Sheet
DataVis Cheat Sheet
Hello
If you’re interested in data visualization, then working through this cheat sheet is a good place to start.
You’ll find examples and code that you can practice at home. The data that I use is available to you on your
computer right now. If you want to see a comprehensive list of practice data sets on your computer, simple
type data() into R Studio. As you install packages, the list of datasets available to you will increase.
The examples that I provide will 1) walk you through the basics of using ggplot to create a data visualization,
2) help you understand which plots to use, given different combinations of data that you might want to look
at, and 3) provide some code for beautiful examples of data visualization that you might want to use in your
own work.
This sheet won’t cover everything. For a more comprehensive overview of data visualization using R, please
visit www.learnmore365.com
Load packages
If you haven’t ever installed these packages, then you need to do so. You only ever have to install a package
on your computer once using the function install.packages("package_name")
Here are all of the packages used to develop the graphics in this cheat sheet. I’ve also included a line of code
that sets the theme for all plots to “black and white”
library(tidyverse)
library(ggridges)
library(patchwork)
library(viridis)
library(hrbrthemes)
library(gapminder)
theme_set(theme_bw())
Using ggplot2
To create graphics using ggplot2, you need to understand “the grammar of graphics”. All plots have three
principle components:
1. data
2. mapping
3. geometry
Data
The data is simply the dataset that you are using (nothing complicated about that).
1
Mapping
Mapping refers to how each variable that you are going to use in your visualization relates to a particular
aesthetic. This could be color, shape, size, x-axis, y-axis and others.
Geometry
Geometry refers to the type of plot that will be used to represent the dataset. For example, you might want
a boxplot, a histogram, a scatter plot, etc.
In the example below, I use the gapminder dataset (which is available to you once you’ve installed the
gapminder package) to represent 5 variables at the same time. Four numeric variables (GPD per capita,
Life expectancy, year and population size) and one categorical variable (continent).
Each of the numeric variables is mapped against a specific aesthetic and the categorical variable is used to
disaggregate the data into facets.
gapminder %>%
filter(continent %in% c("Africa", "Europe")) %>%
filter(gdpPercap < 30000) %>%
ggplot(aes(x= gdpPercap,
y = lifeExp,
size = pop,
color = year)) +
geom_point() +
facet_wrap(~continent) +
labs(title = "Life expectancy explained by GDP per capita",
x = "GDP per capita",
y = "Life expectancy")
2
Life expectancy explained by GDP per capita
Africa Europe
80
pop
5e+07
1e+08
Life expectancy
60
year
2000
1990
1980
40
1970
1960
starwars %>%
select(name, height, mass, gender, hair_color) %>%
head()
## # A tibble: 6 x 5
## name height mass gender hair_color
## <chr> <int> <dbl> <chr> <chr>
## 1 Luke Skywalker 172 77 masculine blond
## 2 C-3PO 167 75 masculine <NA>
## 3 R2-D2 96 32 masculine <NA>
## 4 Darth Vader 202 136 masculine none
## 5 Leia Organa 150 49 feminine brown
## 6 Owen Lars 178 120 masculine brown, grey
3
Single numeric variable
Let’s start with a single numeric variable (height). In this figure we’ve created a histogram, a density plot,
a boxplot and a violin plot with the data. Here is the code and the outputs.
4
Single numeric variable
Histogram Boxplot
0.4
30
0.2
Count
20
0.0
10
−0.2
0 −0.4
100 200 100 150 200 250
Height Height
0.015
1.00
0.010 y
0.005 0.75
0.000
100 150 200 250 100 150 200 250
Height Height
5
p5a <- starwars %>%
drop_na(eye_color, gender) %>%
filter(eye_color %in% c("black", "brown", "blue", "yellow")) %>%
ggplot(aes(eye_color, fill = gender)) +
geom_bar(stat = "count", alpha = .5,
position="dodge",
show.legend = F)+
labs(title = "Grouped barplot",
x = "Eye colour",
y = "Count")
6
One or more categorical variable
Barplot Stacked barplot
20 20
15 15
Count
Count
10 10
5 5
0 0
black blue brown yellow black blue brown yellow
Eye colour Eye colour
Count
10
0.50
5
0.25
0 0.00
black blue brown yellow black blue brown yellow
Eye colour Eye colour
7
p15 <- starwars %>%
drop_na(hair_color, gender) %>%
filter(hair_color %in% c("black", "brown")) %>%
ggplot(aes(height, fill = gender)) +
geom_density(alpha = 0.3) +
facet_wrap(~hair_color) +
labs(title = "Density plot of a numeric variable",
subtitle = "disagregated by two categorical variables",
x = "Height",
y = "Probability") +
theme(legend.position = "none")
((p14b/p14a)|(p15 / p16)) +
plot_annotation(title = "One numberic and two categorical variable",
theme = theme(plot.title = element_text(size = 18,
colour = "blue"))) +
theme(text = element_text('mono'))
8
One numberic and two categorical variable
Density plot of a numeric variable Density plot of a numeric variable
disagregated by one categorical variable disagregated by two categorical variables
black brown
Probability
Probability
0.04 0.100
0.075
0.03 0.050
0.02 0.025
0.01 0.000
0.00 120 160 200 240 120 160 200 240
100 150 200 250 Height
Height
Boxplot of a numeric variable
Boxplot of a numeric variable
disagregated by two categorical variable
disagregated by one categorical variable
black brown
0.2
0.2
0.0 0.0
−0.2 −0.2
100 150 200 250 150 160 170 180 190 150 160 170 180 190
Height Height
9
subtitle = "disagregated by colour",
x = "Height",
y = "Mass")
(p6 | p7 / p7a) +
plot_annotation(title = "Two numberic and one categorical variable",
theme = theme(plot.title = element_text(size = 18,
colour = "blue"))) +
theme(text = element_text('mono'))
80 feminine
40 masculine
Scatter plot
disagregated by colour and facets
50
feminine masculine
160
120
Mass
0 80
40
10
Lolipop graphic
Instead of plotting a simple boxplot for each of the categories, I’ve plotted individual data points, the average
for each category and then joined the average with a line that represents the average for the entire dataset.
chickwts %>%
group_by(feed) %>%
mutate(mean_by_feed = mean(weight)) %>%
ungroup() %>%
mutate(feed = fct_reorder(feed, mean_by_feed)) %>%
ggplot(aes(feed, weight, colour = feed,
show.legend = F)) +
coord_flip() +
geom_jitter(show.legend = F,
size = 4,
alpha = 0.2,
width = 0.05) +
stat_summary(fun = mean, geom = "point", size = 8, show.legend = F) +
geom_hline(aes(yintercept = mean(weight)),
colour = "gray70",
size = 0.9) +
geom_segment(aes(x = feed, xend = feed,
y = mean(weight), yend = mean_by_feed),
size = 2, show.legend = F) +
labs(title = "Weight of chickens by feed group",
x = "Feed",
y = "Weight of chickens") +
theme(legend.position = "none") +
theme_bw()
11
Weight of chickens by feed group
sunflower
casein
meatmeal
Feed
soybean
linseed
horsebean
Using ridges
In this visualization I’ve shown distribution of temperatures that occurred within a given month (as a density
plot) and then compared each month by plotting the density plots of each month on the same canvas.
12
Temperatures in Lincoln NE in 2016
January
February
March
April
May
Month
June
July
August
September
October
November
December
0 25 50 75 100
Mean Temperature [F]
13