0% found this document useful (0 votes)

6 views

Week10 Slides Updated

The document introduces ggplot2, a data visualization package in R that implements the Grammar of Graphics, allowing users to create high-quality graphs. It covers the basics of creating scatter plots, mapping aesthetics, and addressing issues like over-plotting, while also providing guidance on adding titles and labels. Additionally, it discusses the use of smooth lines to depict trends and relationships in data.

Uploaded by

Tùng Nguyễn

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Week10 Slides Updated

Uploaded by

Tùng Nguyễn

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Weeks 10 Introduction to ggplot2

1 / 80
Programming and data visualization
Two of the most popular programming languages for data science
would be Python and R.
▶ Developmental milestones over the years.
▶ Easy-to-use functions for data visualization in an efficient and
reproducible manner.

Source: John F. Ouyang.

2 / 80
Introduction

▶ Just as the grammar of language that helps us construct

meaningful sentences out of words, the Grammar of Graphics
helps us construct graphs out of different visual elements.
▶ ggplot2 implements the Grammar of Graphics.
▶ Produces very high quality graphs.
▶ Bear in mind though, this is not the only method for visualization
with R.

3 / 80
We start by loading the required package. Note that ggplot2 is
included in tidyverse.
library(tidyverse)

Artwork by Allison Horst

4 / 80
The mpg data set

Let’s make a first plot using this package.

▶ The mpg data frame in ggplot2 contains characteristics on 38 car
models in 1999 and 2008.
▶ For now, let’s work with just two variables:
▶ displ, car’s engine size in litres.
▶ hwy, fuel efficiency of the car on a highway in miles per gallon.

data(mpg)
head(mpg, 2)

## # A tibble: 2 x 11
## manufacturer model displ year cyl trans drv cty hwy f
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p

5 / 80
The mpg data set

The ggplot() call renders a blank slate plot.

▶ No layers were specified with geom function, thus nothing is drawn
except for a grey background.
ggplot()

6 / 80
The mpg data set

A scatterplot, with displ on the x-axis and hwy on the y-axis.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

30
hwy

2 3 4 5 6 7
displ

7 / 80
Breaking down the syntax

The function ggplot() creates a coordinate system that we can add

layers to.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

▶ data = mpg specifies the data set.

▶ geom_point() adds a layer of points to the plot, thus creating a
scatterplot.
▶ displ is mapped to the x-axis and hwy to the y-axis.

8 / 80
ggplot() template

Every ggplot2 plot has three key components:

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

1. data
2. At least one layer which describes how to render each observation.
Layers are usually created with a geom function.
3. A set of aesthetic mappings between variables in the data and
visual elements in the geom function.
In ggplot2, we create graphs by adding (+) layers.

9 / 80
Source: Adapted from Tanya Shapiro.

10 / 80
Choosing the right plot

There are many geom functions available in ggplot2.

The choice of which one to use largely depends on two questions:
▶ What are you trying to communicate?
▶ What type of variable(s) do you want to show?

Source: Adapted from John F. Ouyang.

11 / 80
Outline

1. Aesthetics and geometrical objects

▶ Scatterplot
▶ Smoother line
▶ Histogram and density plot
▶ Line plot
▶ Text annotations
▶ Bar plot
▶ Maps

2. Miscellaneous tasks
▶ Themes
▶ Layouts
▶ Common layers

12 / 80
Geometrical objects

A geom refers to the geometrical object used to represent data.

In natural language, we typically use the geom to refer to a particular
type of graph:
▶ Scatter plot geom_point()
▶ Smoother line geom_smooth()
▶ Histogram geom_histogram()
▶ Density plot geom_density()
▶ ...

13 / 80
Aesthetics mappings

An aesthetic describes how properties of the data connects to visual

properties of the graph, such as
▶ The position of a point.
▶ The size, shape, or color of the points.
▶ The type of line (solid, dashed, etc), color and thickness of the line.
Source: Jacques Bertin (1967).

14 / 80
Scatterplot: geom_point()

geom_point() is used to create scatter plots.

The aesthetics associated with it are
▶ x (required)
▶ y (required)
▶ alpha
▶ color
▶ fill
▶ group
▶ shape
▶ size
The defining characteristic of a point is its position, hence the x and y
aesthetics are required. Others are optional.

15 / 80
How to map an aesthetic
To map an aesthetic to a variable, associate the aesthetic to the name
of the variable inside aes()
▶ ggplot2 will automatically assign a unique value of the aesthetic
to each unique value of the variable.
▶ A scatter plot we made under ggplot2:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

30
hwy

2 3 4 5 6 7
displ

16 / 80
A basic scatterplot

The syntax below are equivalent:

# 1
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

# 2
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))

# 3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

Pay attention to the structure of # 3:

▶ Data and aesthetic mappings are supplied in ggplot().
▶ Layer(s) are added on with +.

17 / 80
Global aesthetic mapping

# 3
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

This is an important pattern.

▶ As we learn more about ggplot2, we will construct increasingly
sophisticated plots with multiple layers.
▶ If many layer maps the same variables to x and y, naming these
aesthetics can be tedious.
▶ We can simplify this by a global aesthetic mapping – supply
the aesthetics in ggplot(), instead of individual geom functions.
This way, all functions that are added as layers will default to the
global aesthetic mappings.

18 / 80
Mapping color
We can further visualize the class type of a car.
▶ It classifies cars into groups such as compact, midsize, and SUV.
▶ In the following, we map class to the color aesthetics.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))

class
2seater
compact
30
midsize
hwy

minivan
pickup
subcompact
20 suv

2 3 4 5 6 7
displ

19 / 80
Braking distance and speed
In Week 2, we created a scatter plot on the relationship between
braking distance and speed using base R plotting function.
data(cars)
new_red <- rgb(1, 0, 0, alpha = 0.4)
plot(cars, col = new_red, pch = 19, cex = 1.8, bty = "n",
xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")
Relationship between Speed and Braking
120
Stopping distance (ft)

20 40 60 80
0

5 10 15 20 25

Speed (mph)

20 / 80
We can recreate the plot with ggplot2:
▶ Adjust the point size with size.
▶ Set the point colors to be red using the color aesthetic.
▶ Add transparency to the points with alpha.
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(size = 4, color = "red", alpha = 0.5)
125

100

75
dist

0
5 10 15 20 25
speed

21 / 80
Issues with over-plotting

Over-plotting occurs when multiple data points overlap with each

other.
▶ Identical, or very similar x and y values.
▶ This can make it difficult to distinguish the number of points or
identify patterns in the data.

Solutions:
▶ Adjust the opacity (alpha) of the point.
▶ Add a small random variation to the location of each point
(jitter). This helps to separate the overlapping points.

22 / 80
▶ Opacity: alpha
▶ Jittering: position = "jitter"
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(size = 4, color = "red", alpha = 0.5, position = "jitter")
125

100

75
dist

0
5 10 15 20 25
speed

23 / 80
Over-plotting: Comparison

Original scatter plot ... with opacity ... and jitters

40 40 40

30 30 30

20 20 20

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

24 / 80
1. Anything wrong with this plot?
ggplot(mpg, aes(x = displ, y = hwy, color = "steelblue")) +
geom_point(size = 4, alpha = 0.7, position = "jitter")

30 colour
hwy

steelblue

2 3 4 5 6 7
displ

25 / 80
▶ The following code produces the expected result.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = "steelblue",
size = 4, alpha = 0.7, position = "jitter")

30
hwy

2 3 4 5 6 7
displ

26 / 80
2. The color aesthetic can be mapped to logical expressions.
▶ In this example, the class == "suv" takes values of TRUE and
FALSE.
▶ Each value will be mapped to a unique color.

mpg %>%
filter(manufacturer == "chevrolet") %>%
ggplot(aes(x = displ, y = hwy, color = class == "suv")) +
geom_point(size = 4, alpha = 0.7, position = "jitter")

class == "suv"
hwy

FALSE
TRUE
20

3 4 5 6 7
displ

27 / 80
Title and labels

▶ To include title and labels in ggplot2, we specify a labs() layer:

▶ title
▶ subtitle
▶ x
▶ y
▶ caption

28 / 80
Title and labels
mpg %>%
filter(manufacturer == "chevrolet") %>%
ggplot(aes(x = displ, y = hwy, color = class == "suv")) +
geom_point(size = 4, alpha = 0.7, position = "jitter") +
labs(title = "Fuel efficiency and engine size",
subtitle = "... for Chevrolet",
x = "Engine size (litres)",
y = "Highway fuel efficiency (mph)",
caption = "Source: Environment Protection Agency")
Fuel efficiency and engine size
... for Chevrolet
30
Highway fuel efficiency (mph)

25
class == "suv"
FALSE
TRUE
20

3 4 5 6 7
Engine size (litres)
Source: Environment Protection Agency

29 / 80
Smooth line: geom_smooth()

Many times, we can improve scatter plots with smooth line(s).

▶ Allows the eye to see the patterns in the data.
▶ Depict the trends in time series data.
▶ Detect (nonlinear) relationships between variables.

There are several types of smoothers, using different criteria to fit the
lines of best fit. We shall study just a couple of them in brief detail:
▶ Linear regression models.
▶ Loess smoother.

30 / 80
Aesthetics

Some aesthetics that this geom understands are

▶ x (required)
▶ y (required)
▶ alpha
▶ color
▶ fill
▶ linetype

31 / 80
▶ Let us add a smooth linear regression model (lm) line to the data.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter") +
geom_smooth(method = "lm", formula = y ~ x) +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

2 3 4 5 6 7
Engine Displacement (l)

32 / 80
Linear regression smoother

method = "lm" invokes a simple linear regression smoother.

▶ The blue line is the line of best fit, with gray regions representing
95% confidence intervals.
▶ The line does not appear to be suitable for this data set, which
have some non-linearity.

To allow for nonlinearity, we have a couple of options:

1. A higher-order polynomial term in the linear regression, or
2. A loess smoother – a nonparametric method that uses local
weighted regression to fit a smooth curve.

33 / 80
mpg data, higher order polynomials

▶ Fitting a quadratic function:

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2)) +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

2 3 4 5 6 7
Engine Displacement (l)

34 / 80
method = "loess" fits a line to a scatter plot that helps us see the
overall trend.
▶ A locally estimated regression fit.
▶ For every xi in the data, loess defines a span (between 0 and 1)
and fits a line using data within that span.
▶ The fitted value at xi becomes an estimate fˆ(xi ).
▶ Then it connects all estimated fˆ(xi ) and forms a smooth curve.

Source: Rafael A. Irizarry

35 / 80
mpg data, loess smoother

ggplot(mpg, aes(x = displ, y = hwy)) +

geom_point(position = "jitter") +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

2 3 4 5 6 7
Engine Displacement (l)

36 / 80
loess smoother with different span

Smoothness is a relative term. Different span gives us different

estimates.
Default (optimal) span = 0.2 span = 0.9

40 40 40

30 30 30

20 20 20

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

37 / 80
mpg data, loess smoother by group

The new curve reflects the presence of several cars that have large
engines and efficient highway mileage.
Now suppose that we want to study how this relationship varies with
the drive type:
▶ Front-wheel drive
▶ Rear-wheel drive, and
▶ Four-wheel drive

38 / 80
Loess smoother by group

ggplot(mpg, aes(x = displ, y = hwy, group = drv)) +

geom_point(position = "jitter") +
geom_smooth(formula = y ~ x, method = "loess") +
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")

40
Highway Miles per Gallon

2 3 4 5 6 7
Engine Displacement (l)

39 / 80
Loess smoother by group with colors
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(position = "jitter") +
geom_smooth(formula = y ~ x, method = "loess")+
labs(x = "Engine Displacement (l)", y = "Highway Miles per Gallon")+
scale_color_discrete(name = "Drive type",
labels = c("4-wheel", "Front-wheel", "Rear-wheel")) +
theme(legend.position = "top")
Drive type 4−wheel Front−wheel Rear−wheel

40
Highway Miles per Gallon

2 3 4 5 6 7
Engine Displacement (l)

40 / 80
Other considerations

The loess smoother is computationally intensive.

When the number of data points is large (≥ 1000 observations),
ggplot() will use a generalized additive model by default.
▶ Other smoothers can be used to model binary data (e.g., logistic
regression).
▶ We won’t go further into these. Once you take a class of them, you
will be able to use them with confidence.

41 / 80
Histogram: geom_histogram()

A histogram allows us to visualize the distribution of a single

continuous variable.
▶ The x-axis will first be divided into bins. Then the number of
observations in each bin will be counted.

We will cover two related geoms:

▶ geom_histogram() displays the counts in each bin with bars.
▶ geom_density() computes and draws kernel density estimate. It
is a smoothed version of the histogram.
▶ Allows comparison between distribution of a variable conditioned
on a categorical one, e.g., income distribution for male and female.

42 / 80
Aesthetics

Some of the aesthetics for this geom:

▶ x (required)
▶ alpha
▶ color
▶ fill

Apart from the aesthetics, we also need to consider the following issues:
▶ The width of the bins.
▶ The number of bins.
▶ The location of the bins.

43 / 80
Distribution of earnings
Let us revisit the histogram of earnings using base R.
heights <- read.csv("../data/heights.csv",
header = TRUE, stringsAsFactors = TRUE)
hist(heights$earn/1000, freq = FALSE,
main = "Histogram of Earnings",
xlab = "Earnings Per Annum (in Thousands)",
col = "maroon", border = "white",
breaks = seq(0, 200, by = 10))
Histogram of Earnings
0.030
0.020
Density

0.010
0.000

0 50 100 150 200

Earnings Per Annum (in Thousands)

44 / 80
Distribution of earnings

Here is the first attempt to recreate it in ggplot2.

ggplot(heights, aes(x = earn/1000)) +
geom_histogram()

200

150
count

100

0 50 100 150 200

earn/1000

45 / 80
Distribution of earnings (revised)
▶ Adjust the bin width, interior fill, titles and labels, . . .
▶ Adjust the boundary of the first bin so it reflects the lower limit of
the data.
ggplot(heights, aes(x = earn/1000)) +
geom_histogram(binwidth = 10, fill = "maroon", boundary = 0) +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Frequency")
Histogram of Earnings

300
Frequency

200

100

0 50 100 150 200

Earnings Per Annum (in Thousands)

46 / 80
Distribution of earnings (revised)
In Week 3, we use the density for each bin, instead of counts. This
makes the histogram closer in spirit to a probability density function.
▶ We can to tell ggplot2 to use density instead of count with y =
after_stat(density).
ggplot(heights, aes(x = earn/1000, y = after_stat(density))) +
geom_histogram(binwidth = 10, fill = "maroon", boundary = 0) +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density")
Histogram of Earnings
0.03

0.02
Density

0.01

0.00

0 50 100 150 200

Earnings Per Annum (in Thousands)

47 / 80
In Week 3, we found that there was a stark difference between males
and females in terms of income earned.
▶ We can present this information by mapping the variable sex to
the fill aesthetics.
ggplot(heights,
aes(x = earn/1000, y = after_stat(density), fill = sex)) +
geom_histogram(binwidth = 10, boundary = 0) +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density")
Histogram of Earnings
0.06

0.04

sex
Density

female
male
0.02

0.00

0 50 100 150 200

Earnings Per Annum (in Thousands)

48 / 80
▶ To create a side-by-side bar chart, use position = "dodge".
ggplot(heights,
aes(x = earn/1000, y = after_stat(density), fill = sex)) +
geom_histogram(binwidth = 10, boundary = 0, position = "dodge") +
labs(title = "Histogram of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density")
Histogram of Earnings

0.03

0.02 sex
Density

female
male

0.01

0.00

0 50 100 150 200

Earnings Per Annum (in Thousands)

49 / 80
Earnings, smooth density

To compare distribution conditional on a categorical variable, we would

be better off using smooth density plots.
▶ The smooth density is a curve that gets through the top of the
histogram bars when the bins are very, very small.
binwidth = 10 binwidth = 5 Smooth density
0.03 0.03 0.03

0.02 0.02 0.02

density

density
0.01 0.01 0.01

0.00 0.00 0.00

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
earn/1000 earn/1000 earn/1000

50 / 80
Earnings, smooth density
Compare densities using the geom_density() function:
ggplot(heights, aes(x = earn/1000, fill = sex)) +
geom_density(alpha = 0.2) +
labs(title = "Smooth Density Plots of Earnings",
x = "Earnings Per Annum (in Thousands)", y = "Density") +
scale_fill_discrete(name = "Gender", labels = c("Female", "Male"))
Smooth Density Plots of Earnings

0.03

Gender
Density

0.02
Female
Male

0.01

0.00

0 50 100 150 200

Earnings Per Annum (in Thousands)

51 / 80
Earnings, smooth density
Note that smoothness is a relative term. We can actually control it
through an option in the geom_density() function.
▶ The option that controls the smoothing bandwidth is bw.
▶ We should select a degree of smoothness that we can defend as
being representative of the underlying data.
Default (optimal) Oversmoothing Undersmoothing
0.03 0.03
0.03

0.02 0.02
0.02
density

density

density
0.01 0.01 0.01

0.00 0.00 0.00

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
earn/1000 earn/1000 earn/1000

52 / 80
Line: geom_line()

The line geom connects observations in the order of the variable on the
x-axis (usually date and time).
▶ Suitable for plotting time-series data
▶ The aesthetics that the line geom uses are
▶ x (required)
▶ y (required)
▶ alpha
▶ color
▶ linetype or lty
▶ linewidth or lwd

53 / 80
Resale flat price trends

Let’s visualize the price trends in resale flats using the data set,
resales2024.csv from Week 7.
▶ First, let’s compute the median resale price across months in
selected flat types.

resale <- read.csv("../data/resales2024.csv", header = TRUE) %>%

mutate(month = ymd(month)) %>%
filter(flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM")) %>%
group_by(flat_type, month) %>%
summarize(med_resale_price = median(resale_price), .groups = "drop")
glimpse(resale)

## Rows: 132
## Columns: 3
## $ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 R
## $ month <date> 2021-01-01, 2021-02-01, 2021-03-01, 2021-04
## $ med_resale_price <dbl> 320000, 325000, 318000, 307000, 323000, 3280

54 / 80
Resale flat price trends
ggplot(resale, aes(x = month, y = med_resale_price/1000,
color = flat_type)) +
geom_line(lwd = 1) +
labs(title = "Resale flat price trends, 2021 - 2024",
x = "Year", y = "Median resale price (thousands)",
color = "Flat type")
Resale flat price trends, 2021 − 2024

700
Median resale price (thousands)

600
Flat type
3 ROOM
4 ROOM
500
5 ROOM

400

300
2021 2022 2023 2024
Year

55 / 80
Resale flat price trends (revised)

The colors are not helpful enough.

It does not tell the viewers directly which variable each line is
associated to.
▶ Instead, we shall annotate the lines with texts, corresponding to
the name of the variable.
▶ . . . with geom_text() or geom_label().
▶ We shall also make space for it by adding vertical and horizontal
adjustments to nudge the texts/labels from the line geom.

56 / 80
Let’s prepare the data for geom_label() at the end of each line.
▶ Three required aesthetics: x, y, and label.
resale_text <- filter(resale, month == "2024-08-01")
resale_text

## # A tibble: 3 x 3
## flat_type month med_resale_price
## <chr> <date> <dbl>
## 1 3 ROOM 2024-08-01 420000
## 2 4 ROOM 2024-08-01 612500
## 3 5 ROOM 2024-08-01 673000

57 / 80
▶ Notice that in the geom_label() layer, we override the global
mapping by defining a new mapping.
ggplot(resale, aes(x = month, y = med_resale_price/1000,
color = flat_type)) +
geom_line(lwd = 1, show.legend = FALSE) +
geom_label(data = resale_text, aes(label = flat_type),
show.legend = FALSE, size = 2.7,
vjust = "top", hjust = "middle",
nudge_y = -10, nudge_x = 15) +
labs(title = "Resale flat price trends, 2021 - 2024",
x = "Year", y = "Median resale price (thousands)")
Resale flat price trends, 2021 − 2024

700
Median resale price (thousands)

5 ROOM

600 4 ROOM

500

400 3 ROOM

300
2021 2022 2023 2024
Year
58 / 80
Reference lines

To add a reference line, we can use one of the followings:

▶ geom_vline() for vertical lines
▶ geom_hline() for horizontal lines
▶ geom_abline() for straight lines defined by a slope or an intercept
▶ ggplot2 uses ab in the name to remind us that we are supplying
the intercept (a) and slope (b).

59 / 80
Line types

▶ The argument lty or linetype specifies the type of the line.

▶ The argument lwd or linewidth controls the thickness of the line.

60 / 80
ggplot(resale, aes(x = month, y = med_resale_price/1000,
color = flat_type)) +
geom_line(lwd = 1, show.legend = FALSE) +
geom_label(data = resale_text, aes(label = flat_type),
show.legend = FALSE, size = 2.5,
vjust = "top", hjust = "middle",
nudge_y = -10, nudge_x = 15) +
geom_hline(aes(yintercept = 600), lty = 2, lwd = 0.3) +
labs(title = "Resale flat price trends, 2021 - 2024",
x = "Year", y = "Median resale price (thousands)")
Resale flat price trends, 2021 − 2024

700
Median resale price (thousands)

5 ROOM

600 4 ROOM

500

400 3 ROOM

300
2021 2022 2023 2024
Year

61 / 80
First summary on ggplot2

Summary on some of the geom functions we have learned this week.

ggplot + geom_point + geom_smooth

+ geom_histogram + geom_density + geom_line + geom_label

some text

62 / 80
Variations
By combining geom functions, we can create variations of the basic
plots, such as annotated line chart, lollipop chart (or dumbbell chart),
and slope graph.

63 / 80
Common problems

As you start to use ggplot(), you are likely to run into problems. It
happens to everyone.
R is extremely picky. A misplaced character can make all the
differences.
▶ Make sure that every opening bracket ( is matched with a closing
bracket ); every " is paired with another ".
▶ Check that the + comes at the end of the line, not the start.
▶ If you are stuck, read the error message carefully, Then read the
function documentations.
▶ You can also Google the error message, as it is highly likely that
someone else has had encountered the same issue, and has gotten
help online.

64 / 80
Case study: US gun murders
▶ Last week, we examined the components of a graph on US gun
murders.
▶ We now construct this plot layer-by-layer in ggplot2.

65 / 80
US gun murders

We start by loading the data set, murders.csv.

murders <- read.csv("../data/murders.csv")
glimpse(murders)

## Rows: 51
## Columns: 5
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif
## $ abb <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "D
## $ region <chr> "South", "West", "West", "South", "West", "West",
## $ population <int> 4779736, 710231, 6392017, 2915918, 37253956, 50291
## $ total <int> 135, 19, 232, 93, 1257, 65, 97, 38, 99, 669, 376,

66 / 80
US gun murders

1. Aesthetic mappings in a point geom.

ggplot(murders, aes(x = population/1000000, y = total)) +
geom_point()

1200

800
total

400

0 10 20 30
population/1e+06

67 / 80
US gun murders
2. A second layer of the plot.
▶ Labels to each point to identify the state.

ggplot(murders, aes(x = population/1000000, y = total)) +

geom_point() +
geom_text(aes(label = abb))

CA
1200

800 TX

FL
total

NY
PA
400 MI
LAMO GA IL
MD NC OH
AZ VA
SC TN
NJ

MS KY IN
AL MA
DC NV
AROK
CT WIWA
NM
DE
AKNE
WV
RI
MT
SDID
ME
HI
KS
UTORCO
IA
MN
0 WY
NH
ND
VT
0 10 20 30
population/1e+06

68 / 80
3. Tweaking the arguments to make the plot easier to read.
▶ Adjust the point size using the size argument in geom_point.
▶ Adjust the text positions slightly to the right or to the left using
nudge_x in geom_text().
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(size = 3) +
geom_text(aes(label = abb), nudge_x = 1.5)

CA
1200

800 TX

FL
total

NY
PA
400 MI
LAMO GA IL
MD NC OH
AZ VA
SC TN
NJ

MS KY IN
AL MA
DC NV
AROK
CT WI
WA
NMKS CO
DE
AKNE
WV
RI
MT
SDID
ME
HI UTORMN
IA
0 WY
NH
ND
VT
0 10 20 30 40
population/1e+06

69 / 80
4. Transformation and scales.
▶ Both axes have a highly skewed distribution.
▶ We can use the scale_*_log10() function apply a log
transformation.
▶ Because we are in log-scale now, the nudge must be made smaller.
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06)+
scale_x_log10() + scale_y_log10()

CA
1000
TX
FL
NY
MI PA
GA IL
LA MO
MD NCOH
AZ VA
SC TN
NJ
IN
AL MA
MS OKKY
100 DC AR CT WIWA
NV
NM KS CO
total

MN
DE OR
NE
WV
AK UTIA
RI
MT MEID
10
SD HI
WY NH
ND

VT
1 3 10 30
population/1e+06
70 / 80
5. Next, add descriptive labels and a title.
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06)+
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region")
US Gun Murders in 2010
CA
1000
TX
FL
Total number of murders (log scale)

NY
LA MO MI PA
GA IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA
MS OKKY
100 DC AR CT
NV WIWA
NM KS CO
MN
DE OR
NE
WV
AK UTIA
RI
10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

71 / 80
6. Categories as colors.
▶ Map the region variable to the col aesthetics for geom_point().
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region")
US Gun Murders in 2010
CA
1000
TX
FL
Total number of murders (log scale)

NY
LA MO GAMI PA
IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA Region
MSOK KY
100 DC ARCT WIWA
NV North Central
NM KS CO
MN Northeast
DE OR
NE South
WV
AK UTIA
RI West

10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

72 / 80
US gun murders

7. Reference line.
▶ Next, we want to add a reference line that represents the average
murder rate for the entire country.
▶ The line is defined by the formula: y = rx.
▶ In log-10 scale, this line turns into log(y) = log(r) + log(x).
▶ So in our plot, it is a line with slope 1 and intercept log(r).

r <- murders %>%

summarize(rate = sum(total) / (sum(population/1e6))) %>%
pull(rate) # extract rate as a single number
log10_r <- log10(r)
log10_r

## [1] 1.482095

73 / 80
US gun murders

7. Reference line.
▶ To add the line, we use the geom_abline() function.
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region") +
geom_abline(slope = 1, intercept = log10_r, linetype = 2)

74 / 80
US gun murders

US Gun Murders in 2010

CA
1000
TX
FL

Total number of murders (log scale)

NY
LA MO GAMI PA
IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA Region
MSOK KY
100 DC AR
NV CT WI
WA North Central
NM KS CO
MN Northeast
DE OR
NE South
WV
AK UTIA
RI West

10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

▶ Next, we can adjust the order of the geom layers: Draw the dashed
line first, so it doesn’t go over the points.

75 / 80
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_abline(slope = 1, intercept = log10_r, linetype = 2) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.06) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region")
US Gun Murders in 2010
CA
1000
TX
FL
Total number of murders (log scale)

NY
LA MO GAMI PA
IL
MD VANCOH
AZ NJ
SC TN
IN
AL MA Region
MSOK KY
100 DC ARCT WIWA
NV North Central
NM KS CO
MN Northeast
DE OR
NE South
WV
AK UTIA
RI West

10
MT MEID
SD HI
WY NH
ND

VT
1 3 10 30
Population in millions (log scale)

76 / 80
US gun murders

8. ggplot2 extensions:
The power of ggplot2 is augmented further due to the availability
of extension packages.
▶ ggthemes contains many popular themes such as
theme_economist() and theme_wsj().
▶ ggrepel stands for repulsive textual annotations. It includes a
geometry that adds labels while ensuring that they do not fall on
top of each other.
# install.packages(c("ggthemes", "ggrepel"))
library(ggthemes)
library(ggrepel)

Gallery of themes: All your figures belong to us.

77 / 80
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_abline(slope = 1, intercept = log10_r, linetype = 2) +
geom_point(aes(color = region), size = 3) +
geom_text(aes(label = abb), nudge_x = 0.07) +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region") +
theme_economist()

US Gun Murders in 2010

Region North Central Northeast South West
Total number of murders (log scale)

CA
1000
FL TX
NY
LA MD MI PA
GA IL
MO NCOH
AZ VA
SC TN NJ
MS KY IN
AL MA
100 DC OK
AR CT WIWA
NV
NM KS CO
MN
DE NE OR
WV UTIA
AK RI
10 MT MEID
SD HI
WY NH
ND
VT

1 3 10 30
Population in millions (log scale)

78 / 80
US gun murders

Final touch:
▶ Replace geom_text() with geom_text_repel().
▶ Save the plot to a file with ggsave().
ggplot(murders, aes(x = population/1e6, y = total)) +
geom_abline(slope = 1, intercept = log10_r, linetype = 2) +
geom_point(aes(color = region), size = 3) +
geom_text_repel(aes(label = abb), color = "black") +
scale_x_log10() + scale_y_log10() +
labs(title = "US Gun Murders in 2010",
x = "Population in millions (log scale)",
y = "Total number of murders (log scale)", color = "Region") +
theme_economist()

# Save plot to a file

ggsave("../figures/wk10_murders.png")

79 / 80
US gun murders (final plot)

US Gun Murders in 2010

Region North Central Northeast South West
Total number of murders (log scale)

MI GA CA
1000 FL TX
LA MO VA PA
NY
SC MD AZ OH IL
AR MS OK TN IN NJ
100 NV AL MA NC
DC KY WI
NM CT WA
DE KS CO
NE UT MN
AK OR
RI
MT ID WV IA
10 ME
WY SD HI
NH
ND
VT

1 3 10 30
Population in millions (log scale)

80 / 80

Data Visualization With R - Principles and Practice
No ratings yet
Data Visualization With R - Principles and Practice
36 pages
Data Visualization Using Ggplot2
No ratings yet
Data Visualization Using Ggplot2
21 pages
Lecture 6 - Data Visualization With Ggplot2
No ratings yet
Lecture 6 - Data Visualization With Ggplot2
15 pages
Data Visualization With R Ggplot2
No ratings yet
Data Visualization With R Ggplot2
236 pages
02 Visualize Slides
No ratings yet
02 Visualize Slides
92 pages
3 Styling Ggplot2 Graphics
No ratings yet
3 Styling Ggplot2 Graphics
38 pages
R Module 4
No ratings yet
R Module 4
31 pages
Unit 3 Part 2 Graphics For Communication
No ratings yet
Unit 3 Part 2 Graphics For Communication
40 pages
Unit 3Data Visualization With Ggplot2
No ratings yet
Unit 3Data Visualization With Ggplot2
19 pages
DataViz Ggplot Sample
No ratings yet
DataViz Ggplot Sample
23 pages
Ggplot2 Cheat Sheet
No ratings yet
Ggplot2 Cheat Sheet
1 page
246
No ratings yet
246
2 pages
KrutikaKolhe-862467252-HW2
No ratings yet
KrutikaKolhe-862467252-HW2
25 pages
226
No ratings yet
226
2 pages
228
No ratings yet
228
2 pages
Exercise 1
No ratings yet
Exercise 1
5 pages
Data Visualization With Ggplot2, Asthetic Mappings, Facets, Common Problems, Layered Grammar of Graphics
No ratings yet
Data Visualization With Ggplot2, Asthetic Mappings, Facets, Common Problems, Layered Grammar of Graphics
21 pages
BDA Experiment 9 and 10
No ratings yet
BDA Experiment 9 and 10
22 pages
Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013
No ratings yet
Introduction To Ggplot2: Saier (Vivien) Ye September 16, 2013
32 pages
Ggplot2 Scatter Plots - Quick Start Guide - R Software and Data Visualization - Easy Guides - Wiki - STHDA
No ratings yet
Ggplot2 Scatter Plots - Quick Start Guide - R Software and Data Visualization - Easy Guides - Wiki - STHDA
25 pages
Data Visualization With Ggplot2::: Cheat Sheet
No ratings yet
Data Visualization With Ggplot2::: Cheat Sheet
2 pages
235
No ratings yet
235
2 pages
Data Layers Niveditha Haridas 2302032
No ratings yet
Data Layers Niveditha Haridas 2302032
18 pages
M4 DAR Part1
No ratings yet
M4 DAR Part1
16 pages
Using Ggplot2 For Plots in R
No ratings yet
Using Ggplot2 For Plots in R
8 pages
Data Visualization 2.1
No ratings yet
Data Visualization 2.1
2 pages
ProgrammingForDS15_dataviz (1)
No ratings yet
ProgrammingForDS15_dataviz (1)
40 pages
Data Visualization in R Sem-III 2021 PDF
No ratings yet
Data Visualization in R Sem-III 2021 PDF
57 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 2
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 2
25 pages
Ggplot
No ratings yet
Ggplot
10 pages
04 Visualizing Data
No ratings yet
04 Visualizing Data
145 pages
Exercise-9..Study and Implementation of Data Visulization With Ggplot
No ratings yet
Exercise-9..Study and Implementation of Data Visulization With Ggplot
1 page
Ggplot2 Cheatsheet PDF
No ratings yet
Ggplot2 Cheatsheet PDF
2 pages
22MSM40206 Data Visualisation
No ratings yet
22MSM40206 Data Visualisation
13 pages
Ggplot 2: Elegant Graphics For Data Analysis. Second Edition.
No ratings yet
Ggplot 2: Elegant Graphics For Data Analysis. Second Edition.
277 pages
DSR_Unit 2-2.1 ExploringBasicgraphs
No ratings yet
DSR_Unit 2-2.1 ExploringBasicgraphs
51 pages
Content: Dplyr, Readr, TM, Ggplot2/+ggforce/, Tidyr, Broom Dplyr
No ratings yet
Content: Dplyr, Readr, TM, Ggplot2/+ggforce/, Tidyr, Broom Dplyr
8 pages
Data Visualization With Ggplot2::: Cheat Sheet
No ratings yet
Data Visualization With Ggplot2::: Cheat Sheet
2 pages
Visualization in R
No ratings yet
Visualization in R
44 pages
Ggplot2 Cheatsheet
No ratings yet
Ggplot2 Cheatsheet
2 pages
How To Make Any Plot in Ggplot2?: Topics
No ratings yet
How To Make Any Plot in Ggplot2?: Topics
18 pages
Ultimate Cheat SHEET - Analysis in R
No ratings yet
Ultimate Cheat SHEET - Analysis in R
17 pages
Cheat Sheet Ggplot2
No ratings yet
Cheat Sheet Ggplot2
2 pages
Data Visualization
No ratings yet
Data Visualization
2 pages
Data Visualization
No ratings yet
Data Visualization
30 pages
R
No ratings yet
R
5 pages
Visualizing Data in R 4: Graphics Using the base, graphics, stats, and ggplot2 Packages 1st Edition Margot Tollefson All Chapters Instant Download
100% (1)
Visualizing Data in R 4: Graphics Using the base, graphics, stats, and ggplot2 Packages 1st Edition Margot Tollefson All Chapters Instant Download
40 pages
Learning Ggplot2
No ratings yet
Learning Ggplot2
16 pages
Tableau
No ratings yet
Tableau
5 pages
Ggplot 2
No ratings yet
Ggplot 2
48 pages
compiler_queue_syntax_hash_integer_stack
No ratings yet
compiler_queue_syntax_hash_integer_stack
3 pages
Exploratory_Data_Analysis_Course_Notes
No ratings yet
Exploratory_Data_Analysis_Course_Notes
55 pages
Ggplot2 Elegant Graphics For Data Analysis (2016, Springer) PDF
No ratings yet
Ggplot2 Elegant Graphics For Data Analysis (2016, Springer) PDF
281 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Histogram Equalization: Enhancing Image Contrast for Enhanced Visual Perception
From Everand
Histogram Equalization: Enhancing Image Contrast for Enhanced Visual Perception
Fouad Sabry
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Raster Graphics Editor: Transforming Visual Realities: Mastering Raster Graphics Editors in Computer Vision
From Everand
Raster Graphics Editor: Transforming Visual Realities: Mastering Raster Graphics Editors in Computer Vision
Fouad Sabry
No ratings yet
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
From Everand
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
Tutorial Books
No ratings yet
SOLIDWORKS 2017 Learn by doing - Part 3
From Everand
SOLIDWORKS 2017 Learn by doing - Part 3
Tutorial Books
No ratings yet
Vertex Computer Graphics: Exploring the Intersection of Vertex Computer Graphics and Computer Vision
From Everand
Vertex Computer Graphics: Exploring the Intersection of Vertex Computer Graphics and Computer Vision
Fouad Sabry
No ratings yet
Week5 Slides
No ratings yet
Week5 Slides
72 pages
Week6 Slides Updated
No ratings yet
Week6 Slides Updated
57 pages
Week12 Slides
No ratings yet
Week12 Slides
46 pages
Week11 Slides
No ratings yet
Week11 Slides
27 pages
Week3 Slides
No ratings yet
Week3 Slides
36 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Week2 Slides
No ratings yet
Week2 Slides
76 pages
2 Quiz 1: Platform As A Service (Paas)
100% (1)
2 Quiz 1: Platform As A Service (Paas)
9 pages
5.4.6 Packet Tracer - Explore A Simple Network - ILM
No ratings yet
5.4.6 Packet Tracer - Explore A Simple Network - ILM
4 pages
EDC Differntiator and High Pass Filter
100% (1)
EDC Differntiator and High Pass Filter
4 pages
Loyalty Management From Loyalty Programs To Omnichannel Customer Experiences (Cristina Ziliani, Marco Ieva)
No ratings yet
Loyalty Management From Loyalty Programs To Omnichannel Customer Experiences (Cristina Ziliani, Marco Ieva)
261 pages
2022 Ct505ni LB6 210495981 C10
No ratings yet
2022 Ct505ni LB6 210495981 C10
16 pages
This Study Resource Was: Assessment Task 1 Establish Team Performance Plan
No ratings yet
This Study Resource Was: Assessment Task 1 Establish Team Performance Plan
4 pages
Kunci Gitar Calum Scott - You Are The Reason Chord Dasar Mudah @ PDF
No ratings yet
Kunci Gitar Calum Scott - You Are The Reason Chord Dasar Mudah @ PDF
3 pages
Optimality Conditions
No ratings yet
Optimality Conditions
5 pages
Evaluation of TCM and CRCM Modulation For Totem Pole PFC
No ratings yet
Evaluation of TCM and CRCM Modulation For Totem Pole PFC
7 pages
Calculating Swper Index
100% (1)
Calculating Swper Index
6 pages
12 Sdms 05 Foc Splicing
No ratings yet
12 Sdms 05 Foc Splicing
23 pages
Chapter - 01 Course Introduction
No ratings yet
Chapter - 01 Course Introduction
94 pages
LEM Unilap Geo Specs
No ratings yet
LEM Unilap Geo Specs
6 pages
Analysis of Rigid Jointed Non Sway Frame Using Stiffness Method
No ratings yet
Analysis of Rigid Jointed Non Sway Frame Using Stiffness Method
11 pages
Sieps80000098c 2 0 PDF
No ratings yet
Sieps80000098c 2 0 PDF
363 pages
IE2052 - Advanced Networking Technologies: Virtual Local Area Networks (VLAN) Ms - Hansika Mahaadikara
No ratings yet
IE2052 - Advanced Networking Technologies: Virtual Local Area Networks (VLAN) Ms - Hansika Mahaadikara
50 pages
8a Android Menus and Dialogs
No ratings yet
8a Android Menus and Dialogs
14 pages
AX58100 Datasheet v107
No ratings yet
AX58100 Datasheet v107
73 pages
Manufacturing of Sulfuric Acid by Lead Chamber Process and Contact Process
No ratings yet
Manufacturing of Sulfuric Acid by Lead Chamber Process and Contact Process
14 pages
Sfu Thesis Approval Page
100% (3)
Sfu Thesis Approval Page
4 pages
CSS Shruti Experiment No. 01
No ratings yet
CSS Shruti Experiment No. 01
6 pages
Mean Stack-Sample-1
No ratings yet
Mean Stack-Sample-1
8 pages
System Reliability Theory Models Statistical Methods and Applications 3rd edition by Marvin Rausand, Anne Barros, Arnljot Hoyland 9781119373957 1119373956 - The ebook in PDF/DOCX format is ready for download now
100% (18)
System Reliability Theory Models Statistical Methods and Applications 3rd edition by Marvin Rausand, Anne Barros, Arnljot Hoyland 9781119373957 1119373956 - The ebook in PDF/DOCX format is ready for download now
81 pages
Sober: Open Dsus4 1 D 2 A 6 D
No ratings yet
Sober: Open Dsus4 1 D 2 A 6 D
12 pages
PRAMEYA ACADEMY Maths Test 1
No ratings yet
PRAMEYA ACADEMY Maths Test 1
3 pages
Create Schemas Script
No ratings yet
Create Schemas Script
13 pages
5G Air Interface
100% (1)
5G Air Interface
53 pages
Subhasis CV FP&A
No ratings yet
Subhasis CV FP&A
3 pages
Copy of M2
No ratings yet
Copy of M2
18 pages
How Has The Development of Personal Computer Hardware and Software Reversed Some of The Trends Brought On by The Industrial Revolution?
No ratings yet
How Has The Development of Personal Computer Hardware and Software Reversed Some of The Trends Brought On by The Industrial Revolution?
9 pages