0% found this document useful (0 votes)
6 views

Week10 Slides Updated

The document introduces ggplot2, a data visualization package in R that implements the Grammar of Graphics, allowing users to create high-quality graphs. It covers the basics of creating scatter plots, mapping aesthetics, and addressing issues like over-plotting, while also providing guidance on adding titles and labels. Additionally, it discusses the use of smooth lines to depict trends and relationships in data.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week10 Slides Updated

The document introduces ggplot2, a data visualization package in R that implements the Grammar of Graphics, allowing users to create high-quality graphs. It covers the basics of creating scatter plots, mapping aesthetics, and addressing issues like over-plotting, while also providing guidance on adding titles and labels. Additionally, it discusses the use of smooth lines to depict trends and relationships in data.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Weeks 10 Introduction to ggplot2

1 / 80
Programming and data visualization
Two of the most popular programming languages for data science
would be Python and R.
▶ Developmental milestones over the years.
▶ Easy-to-use functions for data visualization in an efficient and
reproducible manner.

Source: John F. Ouyang.

2 / 80
Introduction

▶ Just as the grammar of language that helps us construct


meaningful sentences out of words, the Grammar of Graphics
helps us construct graphs out of different visual elements.
▶ ggplot2 implements the Grammar of Graphics.
▶ Produces very high quality graphs.
▶ Bear in mind though, this is not the only method for visualization
with R.

3 / 80
We start by loading the required package. Note that ggplot2 is
included in tidyverse.
library(tidyverse)

Artwork by Allison Horst

4 / 80
The mpg data set

Let’s make a first plot using this package.


▶ The mpg data frame in ggplot2 contains characteristics on 38 car
models in 1999 and 2008.
▶ For now, let’s work with just two variables:
▶ displ, car’s engine size in litres.
▶ hwy, fuel efficiency of the car on a highway in miles per gallon.

data(mpg)
head(mpg, 2)

## # A tibble: 2 x 11
## manufacturer model displ year cyl trans drv cty hwy f
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p

5 / 80
The mpg data set

The ggplot() call renders a blank slate plot.


▶ No layers were specified with geom function, thus nothing is drawn
except for a grey background.
ggplot()

6 / 80
The mpg data set

A scatterplot, with displ on the x-axis and hwy on the y-axis.


ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

40

30
hwy

20

2 3 4 5 6 7
displ

7 / 80
Breaking down the syntax

The function ggplot() creates a coordinate system that we can add


layers to.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

▶ data = mpg specifies the data set.


▶ geom_point() adds a layer of points to the plot, thus creating a
scatterplot.
▶ displ is mapped to the x-axis and hwy to the y-axis.

8 / 80
ggplot() template

Every ggplot2 plot has three key components:


ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

1. data
2. At least one layer which describes how to render each observation.
Layers are usually created with a geom function.
3. A set of aesthetic mappings between variables in the data and
visual elements in the geom function.
In ggplot2, we create graphs by adding (+) layers.

9 / 80
Source: Adapted from Tanya Shapiro.

10 / 80
Choosing the right plot

There are many geom functions available in ggplot2.


The choice of which one to use largely depends on two questions:
▶ What are you trying to communicate?
▶ What type of variable(s) do you want to show?

Source: Adapted from John F. Ouyang.

11 / 80
Outline

1. Aesthetics and geometrical objects


▶ Scatterplot
▶ Smoother line
▶ Histogram and density plot
▶ Line plot
▶ Text annotations
▶ Bar plot
▶ Maps

2. Miscellaneous tasks
▶ Themes
▶ Layouts
▶ Common layers

12 / 80
Geometrical objects

A geom refers to the geometrical object used to represent data.


In natural language, we typically use the geom to refer to a particular
type of graph:
▶ Scatter plot geom_point()
▶ Smoother line geom_smooth()
▶ Histogram geom_histogram()
▶ Density plot geom_density()
▶ ...

13 / 80
Aesthetics mappings

An aesthetic describes how properties of the data connects to visual


properties of the graph, such as
▶ The position of a point.
▶ The size, shape, or color of the points.
▶ The type of line (solid, dashed, etc), color and thickness of the line.
Source: Jacques Bertin (1967).

14 / 80
Scatterplot: geom_point()

geom_point() is used to create scatter plots.


The aesthetics associated with it are
▶ x (required)
▶ y (required)
▶ alpha
▶ color
▶ fill
▶ group
▶ shape
▶ size
The defining characteristic of a point is its position, hence the x and y
aesthetics are required. Others are optional.

15 / 80
How to map an aesthetic
To map an aesthetic to a variable, associate the aesthetic to the name
of the variable inside aes()
▶ ggplot2 will automatically assign a unique value of the aesthetic
to each unique value of the variable.
▶ A scatter plot we made under ggplot2:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

40

30
hwy

20

2 3 4 5 6 7
displ

16 / 80
A basic scatterplot

The syntax below are equivalent:


# 1
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

# 2
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))

# 3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

Pay attention to the structure of # 3:


▶ Data and aesthetic mappings are supplied in ggplot().
▶ Layer(s) are added on with +.

17 / 80
Global aesthetic mapping

# 3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

This is an important pattern.


▶ As we learn more about ggplot2, we will construct increasingly
sophisticated plots with multiple layers.
▶ If many layer maps the same variables to x and y, naming these
aesthetics can be tedious.
▶ We can simplify this by a global aesthetic mapping – supply
the aesthetics in ggplot(), instead of individual geom functions.
This way, all functions that are added as layers will default to the
global aesthetic mappings.

18 / 80
Mapping color
We can further visualize the class type of a car.
▶ It classifies cars into groups such as compact, midsize, and SUV.
▶ In the following, we map class to the color aesthetics.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))

40

class
2seater
compact
30
midsize
hwy

minivan
pickup
subcompact
20 suv

2 3 4 5 6 7
displ

19 / 80
Braking distance and speed
In Week 2, we created a scatter plot on the relationship between
braking distance and speed using base R plotting function.
data(cars)
new_red <- rgb(1, 0, 0, alpha = 0.4)
plot(cars, col = new_red, pch = 19, cex = 1.8, bty = "n",
xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")
Relationship between Speed and Braking
120
Stopping distance (ft)

20 40 60 80
0

5 10 15 20 25

Speed (mph)

20 / 80
We can recreate the plot with ggplot2:
▶ Adjust the point size with size.
▶ Set the point colors to be red using the color aesthetic.
▶ Add transparency to the points with alpha.
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(size = 4, color = "red", alpha = 0.5)
125

100

75
dist

50

25

0
5 10 15 20 25
speed

21 / 80
Issues with over-plotting

Over-plotting occurs when multiple data points overlap with each


other.
▶ Identical, or very similar x and y values.
▶ This can make it difficult to distinguish the number of points or
identify patterns in the data.

Solutions:
▶ Adjust the opacity (alpha) of the point.
▶ Add a small random variation to the location of each point
(jitter). This helps to separate the overlapping points.

22 / 80
▶ Opacity: alpha
▶ Jittering: position = "jitter"
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(size = 4, color = "red", alpha = 0.5, position = "jitter")
125

100

75
dist

50

25

0
5 10 15 20 25
speed

23 / 80
Over-plotting: Comparison

Original scatter plot ... with opacity ... and jitters

40 40 40

30 30 30

20 20 20

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

24 / 80
1. Anything wrong with this plot?
ggplot(mpg, aes(x = displ, y = hwy, color = "steelblue")) +
geom_point(size = 4, alpha = 0.7, position = "jitter")

40

30 colour
hwy

steelblue

20

2 3 4 5 6 7
displ

25 / 80
▶ The following code produces the expected result.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = "steelblue",
size = 4, alpha = 0.7, position = "jitter")

40

30
hwy

20

2 3 4 5 6 7
displ

26 / 80
2. The color aesthetic can be mapped to logical expressions.
▶ In this example, the class == "suv" takes values of TRUE and
FALSE.
▶ Each value will be mapped to a unique color.

mpg %>%
filter(manufacturer == "chevrolet") %>%
ggplot(aes(x = displ, y = hwy, color = class == "suv")) +
geom_point(size = 4, alpha = 0.7, position = "jitter")

30

25

class == "suv"
hwy

FALSE
TRUE
20

15

3 4 5 6 7
displ

27 / 80
Title and labels

▶ To include title and labels in ggplot2, we specify a labs() layer:


▶ title
▶ subtitle
▶ x
▶ y
▶ caption

28 / 80
Title and labels
mpg %>%
filter(manufacturer == "chevrolet") %>%
ggplot(aes(x = displ, y = hwy, color = class == "suv")) +
geom_point(size = 4, alpha = 0.7, position = "jitter") +
labs(title = "Fuel efficiency and engine size",
subtitle = "... for Chevrolet",
x = "Engine size (litres)",
y = "Highway fuel efficiency (mph)",
caption = "Source: Environment Protection Agency")
Fuel efficiency and engine size
... for Chevrolet
30
Highway fuel efficiency (mph)

25
class == "suv"
FALSE
TRUE
20

15

3 4 5 6 7
Engine size (litres)
Source: Environment Protection Agency

29 / 80
Smooth line: geom_smooth()

Many times, we can improve scatter plots with smooth line(s).


▶ Allows the eye to see the patterns in the data.
▶ Depict the trends in time series data.
▶ Detect (nonlinear) relationships between variables.

There are several types of smoothers, using different criteria to fit the
lines of best fit. We shall study just a couple of them in brief detail:
▶ Linear regression models.
▶ Loess smoother.

30 / 80
Aesthetics

Some aesthetics that this geom understands are


▶ x (required)
▶ y (required)
▶ alpha
▶ color
▶ fill
▶ linetype

31 / 80
▶ Let us add a smooth linear regression model (lm) line to the data.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter") +
geom_smooth(method = "lm", formula = y ~ x) +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

30

20

10

2 3 4 5 6 7
Engine Displacement (l)

32 / 80
Linear regression smoother

method = "lm" invokes a simple linear regression smoother.


▶ The blue line is the line of best fit, with gray regions representing
95% confidence intervals.
▶ The line does not appear to be suitable for this data set, which
have some non-linearity.

To allow for nonlinearity, we have a couple of options:


1. A higher-order polynomial term in the linear regression, or
2. A loess smoother – a nonparametric method that uses local
weighted regression to fit a smooth curve.

33 / 80
mpg data, higher order polynomials

▶ Fitting a quadratic function:


ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

30

20

2 3 4 5 6 7
Engine Displacement (l)

34 / 80
method = "loess" fits a line to a scatter plot that helps us see the
overall trend.
▶ A locally estimated regression fit.
▶ For every xi in the data, loess defines a span (between 0 and 1)
and fits a line using data within that span.
▶ The fitted value at xi becomes an estimate fˆ(xi ).
▶ Then it connects all estimated fˆ(xi ) and forms a smooth curve.

Source: Rafael A. Irizarry

35 / 80
mpg data, loess smoother

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point(position = "jitter") +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

30

20

2 3 4 5 6 7
Engine Displacement (l)

36 / 80
loess smoother with different span

Smoothness is a relative term. Different span gives us different


estimates.
Default (optimal) span = 0.2 span = 0.9

40 40 40

30 30 30

20 20 20

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

37 / 80
mpg data, loess smoother by group

The new curve reflects the presence of several cars that have large
engines and efficient highway mileage.
Now suppose that we want to study how this relationship varies with
the drive type:
▶ Front-wheel drive
▶ Rear-wheel drive, and
▶ Four-wheel drive

38 / 80
Loess smoother by group

ggplot(mpg, aes(x = displ, y = hwy, group = drv)) +


geom_point(position = "jitter") +
geom_smooth(formula = y ~ x, method = "loess") +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

30

20

2 3 4 5 6 7
Engine Displacement (l)

39 / 80
Loess smoother by group with colors
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(position = "jitter") +
geom_smooth(formula = y ~ x, method = "loess")+
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")+
scale_color_discrete(name = "Drive type",
labels = c("4-wheel", "Front-wheel", "Rear-wheel")) +
theme(legend.position = "top")
Drive type 4−wheel Front−wheel Rear−wheel

40
Highway Miles per Gallon

30

20

2 3 4 5 6 7
Engine Displacement (l)

40 / 80
Other considerations

The loess smoother is computationally intensive.


When the number of data points is large (≥ 1000 observations),
ggplot() will use a generalized additive model by default.
▶ Other smoothers can be used to model binary data (e.g., logistic
regression).
▶ We won’t go further into these. Once you take a class of them, you
will be able to use them with confidence.

41 / 80
Histogram: geom_histogram()

A histogram allows us to visualize the distribution of a single


continuous variable.
▶ The x-axis will first be divided into bins. Then the number of
observations in each bin will be counted.

We will cover two related geoms:


▶ geom_histogram() displays the counts in each bin with bars.
▶ geom_density() computes and draws kernel density estimate. It
is a smoothed version of the histogram.
▶ Allows comparison between distribution of a variable conditioned
on a categorical one, e.g., income distribution for male and female.

42 / 80
Aesthetics

Some of the aesthetics for this geom:


▶ x (required)
▶ alpha
▶ color
▶ fill

Apart from the aesthetics, we also need to consider the following issues:
▶ The width of the bins.
▶ The number of bins.
▶ The location of the bins.

43 / 80
Distribution of earnings
Let us revisit the histogram of earnings using base R.
heights <- read.csv("../data/heights.csv",
header = TRUE, stringsAsFactors = TRUE)
hist(heights$earn/1000, freq = FALSE,
main = "Histogram of Earnings",
xlab = "Earnings Per Annum (in Thousands)",
col = "maroon", border = "white",
breaks = seq(0, 200, by = 10))
Histogram of Earnings
0.030
0.020
Density

0.010
0.000

0 50 100 150 200

Earnings Per Annum (in Thousands)

44 / 80
Distribution of earnings

Here is the first attempt to recreate it in ggplot2.


ggplot(heights, aes(x = earn/1000)) +
geom_histogram()

200

150
count

100

50

0 50 100 150 200


earn/1000

45 / 80
Distribution of earnings (revised)
▶ Adjust the bin width, interior fill, titles and labels, . . .
▶ Adjust the boundary of the first bin so it reflects the lower limit of
the data.
ggplot(heights, aes(x = earn/1000)) +
geom_histogram(binwidth = 10, fill = "maroon", boundary = 0) +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Frequency")
Histogram of Earnings

300
Frequency

200

100

0 50 100 150 200


Earnings Per Annum (in Thousands)

46 / 80
Distribution of earnings (revised)
In Week 3, we use the density for each bin, instead of counts. This
makes the histogram closer in spirit to a probability density function.
▶ We can to tell ggplot2 to use density instead of count with y =
after_stat(density).
ggplot(heights, aes(x = earn/1000, y = after_stat(density))) +
geom_histogram(binwidth = 10, fill = "maroon", boundary = 0) +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density")
Histogram of Earnings
0.03

0.02
Density

0.01

0.00

0 50 100 150 200


Earnings Per Annum (in Thousands)

47 / 80
In Week 3, we found that there was a stark difference between males
and females in terms of income earned.
▶ We can present this information by mapping the variable sex to
the fill aesthetics.
ggplot(heights,
aes(x = earn/1000, y = after_stat(density), fill = sex)) +
geom_histogram(binwidth = 10, boundary = 0) +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density")
Histogram of Earnings
0.06

0.04

sex
Density

female
male
0.02

0.00

0 50 100 150 200


Earnings Per Annum (in Thousands)

48 / 80
▶ To create a side-by-side bar chart, use position = "dodge".
ggplot(heights,
aes(x = earn/1000, y = after_stat(density), fill = sex)) +
geom_histogram(binwidth = 10, boundary = 0, position = "dodge") +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density")
Histogram of Earnings

0.03

0.02 sex
Density

female
male

0.01

0.00

0 50 100 150 200


Earnings Per Annum (in Thousands)

49 / 80
Earnings, smooth density

To compare distribution conditional on a categorical variable, we would


be better off using smooth density plots.
▶ The smooth density is a curve that gets through the top of the
histogram bars when the bins are very, very small.
binwidth = 10 binwidth = 5 Smooth density
0.03 0.03 0.03

0.02 0.02 0.02


density

density

density
0.01 0.01 0.01

0.00 0.00 0.00


0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
earn/1000 earn/1000 earn/1000

50 / 80
Earnings, smooth density
Compare densities using the geom_density() function:
ggplot(heights, aes(x = earn/1000, fill = sex)) +
geom_density(alpha = 0.2) +
labs(title = "Smooth Density Plots of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density") +
scale_fill_discrete(name = "Gender", labels = c("Female", "Male"))
Smooth Density Plots of Earnings

0.03

Gender
Density

0.02
Female
Male

0.01

0.00

0 50 100 150 200


Earnings Per Annum (in Thousands)

51 / 80
Earnings, smooth density
Note that smoothness is a relative term. We can actually control it
through an option in the geom_density() function.
▶ The option that controls the smoothing bandwidth is bw.
▶ We should select a degree of smoothness that we can defend as
being representative of the underlying data.
Default (optimal) Oversmoothing Undersmoothing
0.03 0.03
0.03

0.02 0.02
0.02
density

density

density
0.01 0.01 0.01

0.00 0.00 0.00


0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
earn/1000 earn/1000 earn/1000

52 / 80
Line: geom_line()

The line geom connects observations in the order of the variable on the
x-axis (usually date and time).
▶ Suitable for plotting time-series data
▶ The aesthetics that the line geom uses are
▶ x (required)
▶ y (required)
▶ alpha
▶ color
▶ linetype or lty
▶ linewidth or lwd

53 / 80
Resale flat price trends

Let’s visualize the price trends in resale flats using the data set,
resales2024.csv from Week 7.
▶ First, let’s compute the median resale price across months in
selected flat types.

resale <- read.csv("../data/resales2024.csv", header = TRUE) %>%


mutate(month = ymd(month)) %>%
filter(flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM")) %>%
group_by(flat_type, month) %>%
summarize(med_resale_price = median(resale_price), .groups = "drop")
glimpse(resale)

## Rows: 132
## Columns: 3
## $ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 R
## $ month <date> 2021-01-01, 2021-02-01, 2021-03-01, 2021-04
## $ med_resale_price <dbl> 320000, 325000, 318000, 307000, 323000, 3280

54 / 80
Resale flat price trends
ggplot(resale, aes(x = month, y = med_resale_price/1000,
color = flat_type)) +
geom_line(lwd = 1) +
labs(title = "Resale flat price trends, 2021 - 2024",
x = "Year", y = "Median resale price (thousands)",
color = "Flat type")
Resale flat price trends, 2021 − 2024

700
Median resale price (thousands)

600
Flat type
3 ROOM
4 ROOM
500
5 ROOM

400

300
2021 2022 2023 2024
Year

55 / 80
Resale flat price trends (revised)

The colors are not helpful enough.


It does not tell the viewers directly which variable each line is
associated to.
▶ Instead, we shall annotate the lines with texts, corresponding to
the name of the variable.
▶ . . . with geom_text() or geom_label().
▶ We shall also make space for it by adding vertical and horizontal
adjustments to nudge the texts/labels from the line geom.

56 / 80
Let’s prepare the data for geom_label() at the end of each line.
▶ Three required aesthetics: x, y, and label.
resale_text <- filter(resale, month == "2024-08-01")
resale_text

## # A tibble: 3 x 3
## flat_type month med_resale_price
## <chr> <date> <dbl>
## 1 3 ROOM 2024-08-01 420000
## 2 4 ROOM 2024-08-01 612500
## 3 5 ROOM 2024-08-01 673000

57 / 80
▶ Notice that in the geom_label() layer, we override the global
mapping by defining a new mapping.
ggplot(resale, aes(x = month, y = med_resale_price/1000,
color = flat_type)) +
geom_line(lwd = 1, show.legend = FALSE) +
geom_label(data = resale_text, aes(label = flat_type),
show.legend = FALSE, size = 2.7,
vjust = "top", hjust = "middle",
nudge_y = -10, nudge_x = 15) +
labs(title = "Resale flat price trends, 2021 - 2024",
x = "Year", y = "Median resale price (thousands)")
Resale flat price trends, 2021 − 2024

700
Median resale price (thousands)

5 ROOM

600 4 ROOM

500

400 3 ROOM

300
2021 2022 2023 2024
Year
58 / 80
Reference lines

To add a reference line, we can use one of the followings:


▶ geom_vline() for vertical lines
▶ geom_hline() for horizontal lines
▶ geom_abline() for straight lines defined by a slope or an intercept
▶ ggplot2 uses ab in the name to remind us that we are supplying
the intercept (a) and slope (b).

59 / 80
Line types

▶ The argument lty or linetype specifies the type of the line.


▶ The argument lwd or linewidth controls the thickness of the line.

60 / 80
ggplot(resale, aes(x = month, y = med_resale_price/1000,
color = flat_type)) +
geom_line(lwd = 1, show.legend = FALSE) +
geom_label(data = resale_text, aes(label = flat_type),
show.legend = FALSE, size = 2.5,
vjust = "top", hjust = "middle",
nudge_y = -10, nudge_x = 15) +
geom_hline(aes(yintercept = 600), lty = 2, lwd = 0.3) +
labs(title = "Resale flat price trends, 2021 - 2024",
x = "Year", y = "Median resale price (thousands)")
Resale flat price trends, 2021 − 2024

700
Median resale price (thousands)

5 ROOM

600 4 ROOM

500

400 3 ROOM

300
2021 2022 2023 2024
Year

61 / 80
First summary on ggplot2

Summary on some of the geom functions we have learned this week.


ggplot + geom_point + geom_smooth

+ geom_histogram + geom_density + geom_line + geom_label

some text

62 / 80
Variations
By combining geom functions, we can create variations of the basic
plots, such as annotated line chart, lollipop chart (or dumbbell chart),
and slope graph.

63 / 80
Common problems

As you start to use ggplot(), you are likely to run into problems. It
happens to everyone.
R is extremely picky. A misplaced character can make all the
differences.
▶ Make sure that every opening bracket ( is matched with a closing
bracket ); every " is paired with another ".
▶ Check that the + comes at the end of the line, not the start.
▶ If you are stuck, read the error message carefully, Then read the
function documentations.
▶ You can also Google the error message, as it is highly likely that
someone else has had encountered the same issue, and has gotten
help online.

64 / 80
Case study: US gun murders
▶ Last week, we examined the components of a graph on US gun
murders.
▶ We now construct this plot layer-by-layer in ggplot2.

65 / 80
US gun murders

We start by loading the data set, murders.csv.


murders <- read.csv("../data/murders.csv")
glimpse(murders)

## Rows: 51
## Columns: 5
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif
## $ abb <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "D
## $ region <chr> "South", "West", "West", "South", "West", "West",
## $ population <int> 4779736, 710231, 6392017, 2915918, 37253956, 50291
## $ total <int> 135, 19, 232, 93, 1257, 65, 97, 38, 99, 669, 376,

66 / 80
US gun murders

1. Aesthetic mappings in a point geom.


ggplot(murders, aes(x = population/1000000, y = total)) +
geom_point()

1200

800
total

400

0 10 20 30
population/1e+06

67 / 80
US gun murders
2. A second layer of the plot.
▶ Labels to each point to identify the state.

ggplot(murders, aes(x = population/1000000, y = total)) +


geom_point() +
geom_text(aes(label = abb))

CA
1200

800 TX

FL
total

NY
PA
400 MI
LAMO GA IL
MD NC OH
AZ VA
SC TN
NJ

MS KY IN
AL MA
DC NV
AROK
CT WIWA
NM
DE
AKNE
WV
RI
MT
SDID
ME
HI
KS
UTORCO
IA
MN
0 WY
NH
ND
VT
0 10 20 30
population/1e+06

68 / 80
3. Tweaking the arguments to make the plot easier to read.
▶ Adjust the point size using the size argument in geom_point.
▶ Adjust the text positions slightly to the right or to the left using
nudge_x in geom_text().
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(size = 3) +
geom_text(aes(label = abb), nudge_x = 1.5)

CA
1200

800 TX

FL
total

NY
PA
400 MI
LAMO GA IL
MD NC OH
AZ VA
SC TN
NJ

MS KY IN
AL MA
DC NV
AROK
CT WI
WA
NMKS CO
DE
AKNE
WV
RI
MT
SDID
ME
HI UTORMN
IA
0 WY
NH
ND
VT
0 10 20 30 40
population/1e+06

69 / 80
4. Transformation and scales.
▶ Both axes have a highly skewed distribution.
▶ We can use the scale_*_log10() function apply a log
transformation.
▶ Because we are in log-scale now, the nudge must be made smaller.
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06)+
scale_x_log10() + scale_y_log10()

CA
1000
TX
FL
NY
MI PA
GA IL
LA MO
MD NCOH
AZ VA
SC TN
NJ
IN
AL MA
MS OKKY
100 DC AR CT WIWA
NV
NM KS CO
total

MN
DE OR
NE
WV
AK UTIA
RI
MT MEID
10
SD HI
WY NH
ND

VT
1 3 10 30
population/1e+06
70 / 80
5. Next, add descriptive labels and a title.
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06)+
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region")
US Gun Murders in 2010
CA
1000
TX
FL
Total number of murders (log scale)

NY
LA MO MI PA
GA IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA
MS OKKY
100 DC AR CT
NV WIWA
NM KS CO
MN
DE OR
NE
WV
AK UTIA
RI
10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

71 / 80
6. Categories as colors.
▶ Map the region variable to the col aesthetics for geom_point().
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region")
US Gun Murders in 2010
CA
1000
TX
FL
Total number of murders (log scale)

NY
LA MO GAMI PA
IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA Region
MSOK KY
100 DC ARCT WIWA
NV North Central
NM KS CO
MN Northeast
DE OR
NE South
WV
AK UTIA
RI West

10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

72 / 80
US gun murders

7. Reference line.
▶ Next, we want to add a reference line that represents the average
murder rate for the entire country.
▶ The line is defined by the formula: y = rx.
▶ In log-10 scale, this line turns into log(y) = log(r) + log(x).
▶ So in our plot, it is a line with slope 1 and intercept log(r).

r <- murders %>%


summarize(rate = sum(total) / (sum(population/1e6))) %>%
pull(rate) # extract rate as a single number
log10_r <- log10(r)
log10_r

## [1] 1.482095

73 / 80
US gun murders

7. Reference line.
▶ To add the line, we use the geom_abline() function.
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region") +
geom_abline(slope = 1, intercept = log10_r, linetype = 2)

74 / 80
US gun murders

US Gun Murders in 2010


CA
1000
TX
FL

Total number of murders (log scale)


NY
LA MO GAMI PA
IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA Region
MSOK KY
100 DC AR
NV CT WI
WA North Central
NM KS CO
MN Northeast
DE OR
NE South
WV
AK UTIA
RI West

10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

▶ Next, we can adjust the order of the geom layers: Draw the dashed
line first, so it doesn’t go over the points.

75 / 80
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_abline(slope = 1, intercept = log10_r, linetype = 2) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region")
US Gun Murders in 2010
CA
1000
TX
FL
Total number of murders (log scale)

NY
LA MO GAMI PA
IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA Region
MSOK KY
100 DC ARCT WIWA
NV North Central
NM KS CO
MN Northeast
DE OR
NE South
WV
AK UTIA
RI West

10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

76 / 80
US gun murders

8. ggplot2 extensions:
The power of ggplot2 is augmented further due to the availability
of extension packages.
▶ ggthemes contains many popular themes such as
theme_economist() and theme_wsj().
▶ ggrepel stands for repulsive textual annotations. It includes a
geometry that adds labels while ensuring that they do not fall on
top of each other.
# install.packages(c("ggthemes", "ggrepel"))
library(ggthemes)
library(ggrepel)

Gallery of themes: All your figures belong to us.

77 / 80
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_abline(slope = 1, intercept = log10_r, linetype = 2) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.07) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region") +
theme_economist()

US Gun Murders in 2010


Region North Central Northeast South West
Total number of murders (log scale)

CA
1000
FL TX
NY
LA MD MI PA
GA IL
MO NCOH
AZ VA
SC TN NJ
MS KY IN
AL MA
100 DC OK
AR CT WIWA
NV
NM KS CO
MN
DE NE OR
WV UTIA
AK RI
10 MT MEID
SD HI
WY NH
ND
VT

1 3 10 30
Population in millions (log scale)

78 / 80
US gun murders

Final touch:
▶ Replace geom_text() with geom_text_repel().
▶ Save the plot to a file with ggsave().
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_abline(slope = 1, intercept = log10_r, linetype = 2) +
geom_point(aes(color = region), size = 3) +
geom_text_repel(aes(label = abb), color = "black") +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region") +
theme_economist()

# Save plot to a file


ggsave("../figures/wk10_murders.png")

79 / 80
US gun murders (final plot)

US Gun Murders in 2010


Region North Central Northeast South West
Total number of murders (log scale)

MI GA CA
1000 FL TX
LA MO VA PA
NY
SC MD AZ OH IL
AR MS OK TN IN NJ
100 NV AL MA NC
DC KY WI
NM CT WA
DE KS CO
NE UT MN
AK OR
RI
MT ID WV IA
10 ME
WY SD HI
NH
ND
VT

1 3 10 30
Population in millions (log scale)

80 / 80

You might also like