0% found this document useful (0 votes)
8 views

Distributions Plotting

Uploaded by

Supreme Urs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Distributions Plotting

Uploaded by

Supreme Urs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Plotting Statistical Distributions and Properties

Jan Anders

11/07/2023

There are multiple approaches to displaying a distribution and clarifying concepts to oneself. The concept
you are most likely already familiar with is simulating a random sample that follows a particular distribution
and then plotting it. It its simplest form, this looks like this:
# Simulate a sample of size n = 1000 (default: mean 0, variance 1)
x <- rnorm(1000)
hist(x, freq = FALSE)

Histogram of x
0.4
0.3
Density

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

x While in
the real world, this approach is exactly the first thing we do when approaching a problem (plotting,
exploratory data analysis), we are presented with multiple problems when trying to explore a theoretical
concept:
• The empirical distribution does not match the true theoretical density closely enough (at least not
for reasonably small sample sizes). This will become especially problematic when you’re dealing with
distributions that are highly skewed or you are interested in the density of extreme values, since you
will have to simulate very often and at high sample sizes to estimate these properly.
• Out of the box, you will have to rely on binning or to get an understanding of the shape of the density.
You will not have a proper value for the density of any given value of x.
• This is not a clean (mathematical) way to approach this problem

1
What is a better way to plot a density function?
There are three other functions for any given distribution - Remember?
• d (dnorm, dbinom, dpois) The density
• p (pnorm, pbinom, ppois) The cumulative density function
• q (qnorm, qbinom, qpois) The quantile for a given cumulative probability
The documentation of these functions is arguably a bit lacking, so here’s what they do:

d
Get the density of any (well known) probability distribution for a given parametrisation. Very simple example:
The density (probability) of getting 0 in a Bernoulli experiment with p = 0.5 is 0.5.
dbinom(0, 1, 0.5)

## [1] 0.5
The highest density of the normal distribution is at its mean:
dnorm(0, mean = 0, sd = 1)

## [1] 0.3989423
These functions (like everything in R) are vectorized, so they can handle multiple values, which we can make
use of to create a plot:
x <- c(-1, -0.5, 0, 0.5, 1)
y <- dnorm(x, 0, 1)
plot(x, y, type = "l")
0.40
0.35
y

0.30
0.25

−1.0 −0.5 0.0 0.5 1.0

x A bit
bulky right now. Let’s use some helper functions to get this to look better. Most useful in this context is
probably the sequence function. It creates a sequence of x-values that you can then use to calculate the
densities:

2
seq(0, 1, by = 0.05)

## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
## [16] 0.75 0.80 0.85 0.90 0.95 1.00
Let’s use this to create a plot of the normal density. Don’t forget to set the type of the plot to “line”. The
default just plots the points.
x <- seq(-5, 5, by = 0.05)
y <- dnorm(x, 0, 1)
plot(x, y, type = "l")
0.4
0.3
0.2
y

0.1
0.0

−4 −2 0 2 4

x Much
better.
Of course, sometimes we want multiple curves in the same diagram to compare different concepts (this is
likely what you’ll most often want to do, in this course we’re mostly using R to build intuition after all).
The lines function comes in handy:
# create a plot of the standard normal density we just saw
x <- seq(-5, 5, by = 0.1)
y <- dnorm(x, 0, 1)
plot(x, y, type = "l")

# create new values for a different distribution


y2 <- dnorm(x, 1, 0.5)
lines(x, y2, col = "red")

3
0.4
0.3
0.2
y

0.1
0.0

−4 −2 0 2 4

x Prob-
lem: The original plot dictates the x- and y-axis. The quick fix is to manually specify ranges.
plot(x, y, type = "l", ylim = c(0, max(y, y2)))

lines(x, y2, type = "l", col = "red")


0.8
0.6
0.4
y

0.2
0.0

−4 −2 0 2 4

p
Let’s get back to working with distributions. There are two other functions left we need to look at. The
p-function does the same as the d-function, just with the distribution function (cumulative probability
function). Really helpful when you need to calculate cumulative probabilities (such as in typical probability
theory, testing, . . . ). You can control the side of the cumulative probability mass by the lower.tail argument

4
(or you just use 1-q, this will result in different results for discrete probabilities though, careful!).
For example, we know that 50% of the probability mass of the normal distribution is left of the mean value.
If we plug in the value 0 for x we get the probability of obtaining 0 or a value smaller than that. In other
words, the probability mass left of 0:
pnorm(0, mean = 0, sd = 1, lower.tail = TRUE) # lower tail = TRUE is default

## [1] 0.5
Let’s repeat the same procedure for plotting we saw with the density function here as well:
x <- seq(-5, 5, by = 0.05)
y_mass <- pnorm(x)
plot(x, y_mass, type = "l")

#adding the density for visualizing the relationship


lines(x, dnorm(x))
1.0
0.8
0.6
y_mass

0.4
0.2
0.0

−4 −2 0 2 4

x
pbinom(5, 10, 0.5, lower.tail = FALSE) + 1/2 * dbinom(5, 10, 0.5)

## [1] 0.5
If you can explain why the result of the above is 0.5, then you’ve probably understood the probability and
density function (and the binomial distribution).

q One function to go: The quantile function. This one is basically the inverse of the probability (cdf)
function. It gives you the corresponding value of x for a given quantile. Simple example again: The 50%
quantile of the normal is at 0, the 97.5% quantile at 1.96 (Rule of thumb: 2):
qnorm(0.5, 0, 1)

## [1] 0
qnorm(0.025, 0, 1)

## [1] -1.959964

5
This function is really helpful in confidence intervals and testing, where you often need to reach a certain
minimum probability and want to know the corresponding minimum value of your random variable to reach
it. For example, say we want to know the 95% confidence interval of a normally distributed variable with
µ = 50 and V ar = 100. We just get the 2.5% and 97.5% quantile of this variable:
qnorm(c(0.025, 0.975), mean = 50, sd = sqrt(100))

## [1] 30.40036 69.59964


This way, you can get rid of the tedious transformation to the standard normal (at least when you’re allowed
to use R).
Plotting quantiles on a normal density:
plot(x, dnorm(x), type = "l")
abline(v = qnorm(c(0.025, 0.975), 0, 1))
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

r
This one you are likely most familiar with. It simulates a random sample from the data. Arguably the most
fun out of the four functions.
rnorm(5)

## [1] 0.2159454 0.2187151 -2.2901460 0.2321641 -0.6583549


I would highly encourage you to try to use the proper function for a given task. Not only will it make your
code cleaner, you will repeat probability theory while doing it.
One more note: A nicer way of plotting a function is the curve function. It takes away the need to define
a sequence of x with proper spacing (you can control the number of points it calculates in the background
with the parameter n though). As first parameter, curve expects an expression (function) that has x as first
parameter and returns a value for a given x, so you can plot any arbitrary function that you may also define
yourself.
curve(dnorm(x), from = -5, to = 5)

6
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

x There
are many more distribution functions available in R. The nice thing about curve is that it also has an “add”
parameter, so you can stack as many plots as you want:
# unregister x
rm(x)
left <- -2
right <- 6

curve(dnorm, from = left, to = right, col = 1, ylim = c(0,1))


curve(dexp(x, 1), from = left, to = right, add = TRUE, col = 4)
curve(dgamma(x, 1, 0.5), from = left, to = right, add = TRUE, col = 5)
curve(dweibull(x, 2, 1), from = left, to = right, add = TRUE, col = 6)
1.0
0.8
0.6
dnorm(x)

0.4
0.2
0.0

−2 0 2 4 6

7
Of course, everything showed here can be done much more professionally, there are density objects and
professional plotting libraries like ggplot (in which you can of course also build line plots using the methods
from here). These are sometimes a bit overkill and might take quite long to get something right, which can
be tedious when you just want to quickly check something. Base R plotting will be sufficient for everything
we do in this course, so we’ll stick to it for now.

You might also like