Distributions Plotting
Distributions Plotting
Jan Anders
11/07/2023
There are multiple approaches to displaying a distribution and clarifying concepts to oneself. The concept
you are most likely already familiar with is simulating a random sample that follows a particular distribution
and then plotting it. It its simplest form, this looks like this:
# Simulate a sample of size n = 1000 (default: mean 0, variance 1)
x <- rnorm(1000)
hist(x, freq = FALSE)
Histogram of x
0.4
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
x While in
the real world, this approach is exactly the first thing we do when approaching a problem (plotting,
exploratory data analysis), we are presented with multiple problems when trying to explore a theoretical
concept:
• The empirical distribution does not match the true theoretical density closely enough (at least not
for reasonably small sample sizes). This will become especially problematic when you’re dealing with
distributions that are highly skewed or you are interested in the density of extreme values, since you
will have to simulate very often and at high sample sizes to estimate these properly.
• Out of the box, you will have to rely on binning or to get an understanding of the shape of the density.
You will not have a proper value for the density of any given value of x.
• This is not a clean (mathematical) way to approach this problem
1
What is a better way to plot a density function?
There are three other functions for any given distribution - Remember?
• d (dnorm, dbinom, dpois) The density
• p (pnorm, pbinom, ppois) The cumulative density function
• q (qnorm, qbinom, qpois) The quantile for a given cumulative probability
The documentation of these functions is arguably a bit lacking, so here’s what they do:
d
Get the density of any (well known) probability distribution for a given parametrisation. Very simple example:
The density (probability) of getting 0 in a Bernoulli experiment with p = 0.5 is 0.5.
dbinom(0, 1, 0.5)
## [1] 0.5
The highest density of the normal distribution is at its mean:
dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423
These functions (like everything in R) are vectorized, so they can handle multiple values, which we can make
use of to create a plot:
x <- c(-1, -0.5, 0, 0.5, 1)
y <- dnorm(x, 0, 1)
plot(x, y, type = "l")
0.40
0.35
y
0.30
0.25
x A bit
bulky right now. Let’s use some helper functions to get this to look better. Most useful in this context is
probably the sequence function. It creates a sequence of x-values that you can then use to calculate the
densities:
2
seq(0, 1, by = 0.05)
## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
## [16] 0.75 0.80 0.85 0.90 0.95 1.00
Let’s use this to create a plot of the normal density. Don’t forget to set the type of the plot to “line”. The
default just plots the points.
x <- seq(-5, 5, by = 0.05)
y <- dnorm(x, 0, 1)
plot(x, y, type = "l")
0.4
0.3
0.2
y
0.1
0.0
−4 −2 0 2 4
x Much
better.
Of course, sometimes we want multiple curves in the same diagram to compare different concepts (this is
likely what you’ll most often want to do, in this course we’re mostly using R to build intuition after all).
The lines function comes in handy:
# create a plot of the standard normal density we just saw
x <- seq(-5, 5, by = 0.1)
y <- dnorm(x, 0, 1)
plot(x, y, type = "l")
3
0.4
0.3
0.2
y
0.1
0.0
−4 −2 0 2 4
x Prob-
lem: The original plot dictates the x- and y-axis. The quick fix is to manually specify ranges.
plot(x, y, type = "l", ylim = c(0, max(y, y2)))
0.2
0.0
−4 −2 0 2 4
p
Let’s get back to working with distributions. There are two other functions left we need to look at. The
p-function does the same as the d-function, just with the distribution function (cumulative probability
function). Really helpful when you need to calculate cumulative probabilities (such as in typical probability
theory, testing, . . . ). You can control the side of the cumulative probability mass by the lower.tail argument
4
(or you just use 1-q, this will result in different results for discrete probabilities though, careful!).
For example, we know that 50% of the probability mass of the normal distribution is left of the mean value.
If we plug in the value 0 for x we get the probability of obtaining 0 or a value smaller than that. In other
words, the probability mass left of 0:
pnorm(0, mean = 0, sd = 1, lower.tail = TRUE) # lower tail = TRUE is default
## [1] 0.5
Let’s repeat the same procedure for plotting we saw with the density function here as well:
x <- seq(-5, 5, by = 0.05)
y_mass <- pnorm(x)
plot(x, y_mass, type = "l")
0.4
0.2
0.0
−4 −2 0 2 4
x
pbinom(5, 10, 0.5, lower.tail = FALSE) + 1/2 * dbinom(5, 10, 0.5)
## [1] 0.5
If you can explain why the result of the above is 0.5, then you’ve probably understood the probability and
density function (and the binomial distribution).
q One function to go: The quantile function. This one is basically the inverse of the probability (cdf)
function. It gives you the corresponding value of x for a given quantile. Simple example again: The 50%
quantile of the normal is at 0, the 97.5% quantile at 1.96 (Rule of thumb: 2):
qnorm(0.5, 0, 1)
## [1] 0
qnorm(0.025, 0, 1)
## [1] -1.959964
5
This function is really helpful in confidence intervals and testing, where you often need to reach a certain
minimum probability and want to know the corresponding minimum value of your random variable to reach
it. For example, say we want to know the 95% confidence interval of a normally distributed variable with
µ = 50 and V ar = 100. We just get the 2.5% and 97.5% quantile of this variable:
qnorm(c(0.025, 0.975), mean = 50, sd = sqrt(100))
0.2
0.1
0.0
−4 −2 0 2 4
r
This one you are likely most familiar with. It simulates a random sample from the data. Arguably the most
fun out of the four functions.
rnorm(5)
6
0.4
0.3
dnorm(x)
0.2
0.1
0.0
−4 −2 0 2 4
x There
are many more distribution functions available in R. The nice thing about curve is that it also has an “add”
parameter, so you can stack as many plots as you want:
# unregister x
rm(x)
left <- -2
right <- 6
0.4
0.2
0.0
−2 0 2 4 6
7
Of course, everything showed here can be done much more professionally, there are density objects and
professional plotting libraries like ggplot (in which you can of course also build line plots using the methods
from here). These are sometimes a bit overkill and might take quite long to get something right, which can
be tedious when you just want to quickly check something. Base R plotting will be sufficient for everything
we do in this course, so we’ll stick to it for now.