Lecture 1
Lecture 1
(ETC2420/ETC5242)
Dan Simpson
Week 1: Random things
Welcome to Statistical Thinking
2
What are we doing here?
3
Who am I?
4
How do you contact me?
The discussion forum: Everything that isn’t secret and all questions about
course materials and the content of assessment
Office Hour: The hour after the lecture (same Zoom link). This won’t be
recorded but it will be an easy chance to talk and will go for as long as it
needs to.
Course Email: [email protected]. This is for everything
that isn’t for the discussion forum. It is only accessed by me and the head
tutor Lachlan Macquire.
(Only for something that can’t be read by Lachlan) My email:
[email protected]
5
What is the assessment?
6
Assessment policies
Lateness: 10% per day or part thereof. (There will be a very small amount of
grace granted for things that could have been conceivably held up by the
submission process. But we are talking a few minutes only.)
Extensions: Please follow the Special Considerations policy.
7
Academic honesty
8
Computing
9
RMarkdown
10
What is in the course?
11
Random variables
12
Week 1 Learning Goals (Lab)
13
Week 1 Learning Goals (Lecture)
14
What is a random variable?
If you put more than one random variable together, you get a random vector.
Eg. If X is as above and Y is a variable indicating if my mother will win the
lottery this year, then (X, Y) is a random vector.
15
Example: A coin flip
= (1 − p)n
Pr(X = 0) = (1 − p)2
Pr(X = 1) = p(1 − p)
Pr(X = 2) = p2
Is this correct?
17
Ok but maths is annoying. Let’s use a computer
19
Lesson: Everything about probabiity is a statement about realisations
of random variables
20
What about continuous outcomes
When outcomes are discrete (eg number of heads), we can enumerate them
and look at the probability of each outcome.
But when it’s continuous, (eg height) we cannot do this.
So instead we look at the cumulative distribution function or CDF
F(x) = Pr(X ≤ x)
F(x) is always zero at −∞ and always one at ∞
F(x) never decreases. Why?
For example: The exponential distribution
F(x) = 1 − e−λx ,
for some λ > 0.
21
We can plot the CDF!
22
We can plot the CDF!
1.00
0.75
0.50
y
0.25
0.00
0 2 4
X
23
But is this really how we usually visualise a distribution?
data %>%
ggplot(aes(x = X)) + geom_histogram(...)
24
But is this really how we usually visualise a distribution?
0.6
0.4
density
0.2
0.0
0 2 4 6
X
25
This histogram “is” the derivative of the CDF
26
Here’s a plot
1.00
0.75
density
0.50
0.25
0.00
0 2 4 6
X
27
A confession: That last plot was not very easy to make
28
What is the value of simualtion?
29
Doing it with simualtions
0.5
0.4
0.3
density
0.2
0.1
0.0
0 5 10
x
30
But we can do much more complex things
Scenario: You are in your mum’s car and she has the radio tuned to an oldies
station. You notice that they play Meat Loaf’s Paradise By The Dashboard Light
every time the disk jockey needs a break. Curious, you time the intervals between
plays and you notice that it the gaps are roughly exponentially distributed with
rate parameter λ = 0.01.
1 Assuming the gaps are independent, what is the median amount of time
between spins of Paradise by the Dashboard Light?
2 If you listen to the station long enough for Paradise by the Dashboard Light
to be played five times, what is the distribution of the smallest gap between
plays? (Assuming playing times are independent)
31
The exact solutions: Part 1.
The median of a continuous distribution is the point x0.5 such that F(x0.5 ) = 0.5.
Plugging in the CDF for the exponential distribution, we get
1
F(x0.5 ) =
2
−λx0.5 1
1−e =
2
1
= e−λx0.5
2
− log 2 = −λx0.5
log 2
x0.5 =
λ
In this case, we get x0.5 ≈ 69.3.
32
Part 2 is harder.
33
Task: Use a simulation to validate this result
0.04
0.03
density
0.02
0.01
0.00
34
Simulations and hypothesis tests
35
What is a hypothesis test (a reminder)
Hypothesis tests are used to assess whether or not a data set is meaningfully
different from some baseline distribution.
Often, the baseline distribution corresponds to the least interesting case.
In this situation, the hypothesis test can be used to assess if the data is
different from the least interesting case.
But how do you assess difference? We need a single number summary.
This is the test statistic. It usually measures something relevant.
For instance, if our null hypothesis was that the mean of a random variable
was zero, then a test statistic might by T(y) = ȳ
36
A note of caution
37
A hypothesis test
Imagine you have a sample of 100 times between Paradise by the Dashboard
Light and you wanted to test if the average time between plays was more
than one hour.
H0 : µ = 60 vs H1 : µ > 60.
Furthermore, we know that whatever µ is, the data is iid Exp(µ−1 ).
Our test statistic is T(y) = ȳ − 60
For our observed data, we have T(y) = 11
Is this observed data unusual under H0 ?
38
How do we answer this with simulation?
We want to look at Pr(T(y) > 11) when y is 100 iid samples from Exp(1/60)
We can compute this if we have samples from T(y).
We can get one sample form T(y) under H0 as follows:
1 Sample 100 iid exponentials with mean 60: y = rexp(100, 1/60)
2 Compute mean(y) - 60
If we get a lot of samples from T(y) under H0 we can assess how unusual the
value of 11 is.
39
A plot
4000
3000
count
2000
1000
−20 0 20
test
41