0% found this document useful (0 votes)
4 views

Lecture 1

Uploaded by

rezkisananda08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 1

Uploaded by

rezkisananda08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Statistical Thinking

(ETC2420/ETC5242)

Dan Simpson
Week 1: Random things
Welcome to Statistical Thinking

2
What are we doing here?

We are learning about randomness


We are learning about using computers for statistics
We are learning how to leverage computational tools to understand the
statistical properties of things
We are going to be using a lot of R!

3
Who am I?

Name: Daniel (Dan) Simpson (he/him)


From: Queensland originally, but most recently Canada.
How long have I been here? Since May.
Favourite Liza Minnelli Album: Results

4
How do you contact me?

The discussion forum: Everything that isn’t secret and all questions about
course materials and the content of assessment
Office Hour: The hour after the lecture (same Zoom link). This won’t be
recorded but it will be an easy chance to talk and will go for as long as it
needs to.
Course Email: [email protected]. This is for everything
that isn’t for the discussion forum. It is only accessed by me and the head
tutor Lachlan Macquire.
(Only for something that can’t be read by Lachlan) My email:
[email protected]

5
What is the assessment?

Week 4: 24 hours take home quiz (Electronic). 15%


Week 7: Assignment 1. Assigned groups of 3-4. 20%
Week 11: Assignment 2. Assigned groups of 3-4. 20%
Final Exam. 2 hours. 45%. (40% hurdle.)

6
Assessment policies

Lateness: 10% per day or part thereof. (There will be a very small amount of
grace granted for things that could have been conceivably held up by the
submission process. But we are talking a few minutes only.)
Extensions: Please follow the Special Considerations policy.

7
Academic honesty

Do. Not. Cheat.


If you copy someone’s work that is cheating.
If you copy an answer from the internet without a clear reference, that is
cheating.
Check the policies for a more thorough definition of cheating and the
consequences.
Do. Not. Cheat. It will ruin your semester and it will ruin mine.

8
Computing

The course will be run using R.


You are expected to use R and RStudio. Please follow the Week 0
instructions if you have not installed it.
We will assume that you have the latest versions of both R (4.1.0) and
RStudio (1.4.1717).
We also assume you are using the most recent version of all of the packages
(if in doubt re-install them!)
The best resource for R help is always Google. You can also try our
discussion forum.

9
RMarkdown

We want you to use RMarkdown for your assignments and recommend


using it for the labs and the exercises.
There will be examples throughout the semester on how to use it. (Eg the
week 1 bonus video)
For the assignments, we may allow the submission of documents that
weren’t prepared in RMarkdown, but there will be extra requirements.

10
What is in the course?

11
Random variables

12
Week 1 Learning Goals (Lab)

The learning goals forthe lab in Week 1:

Learn how to set up R and RStudio on your own device.


Learn to install and load R packages.
Learn what are R Markdown files and reproducible research.
Learn what is ‘the tidyverse’.
Learn some basic R commands to manipulate and plot data.

13
Week 1 Learning Goals (Lecture)

The learning goals for the lecture in Week 1:

Learn what a random variable is.


What is a cumulative density function and a probability density function.
Use simulations to reason about random variables

14
What is a random variable?

A random variable X is the a variable that could potentially take a number of


possible values at random.
An example is X = {Number of Heads from 3 coin flips}.

The realisation x of the random variable X is the value of the random


variable after it has been observed.
In the above example, x = 0 is a realisation of the random variable.

If you put more than one random variable together, you get a random vector.
Eg. If X is as above and Y is a variable indicating if my mother will win the
lottery this year, then (X, Y) is a random vector.

15
Example: A coin flip

Imagine a biased coin with probability of seeing a head from a flip p.


We can characterize the random variable X that we get by flipping the coin n
times by considering the probabiity that it is equal to 0, 1, . . . , n − 1

Pr(X = 0) = {All coins land on tails}


= (1 − p) × (1 − p) × · · · × (1 − p)
| {z }
n times

= (1 − p)n

Can get the probaility of exactly k heads?


Pr(X = 0) = {k coins land on heads and (n − k) coins land on tails}
= p × p × · · · × p × (1 − p) × (1 − p) × · · · × (1 − p)
| {z } | {z }
p times (n−k) times
k (
= p (1 − p) n − k)
Is this correct?
16
Let’s do some maths

Obvously we want to check this is correct!


So let’s check the example when n = 2.

Pr(X = 0) = (1 − p)2
Pr(X = 1) = p(1 − p)
Pr(X = 2) = p2

Is this correct?

17
Ok but maths is annoying. Let’s use a computer

N = 1e+06 # number of simualtions


p = 0.3 # Any number will work

## Simulate N trials using rbinom


X = rbinom(N, size = 2, prob = p)

paste("k = 0: ", mean(X == 0), (1 - p)ˆ2)

[1] "k = 0: 0.490152 0.49"

paste("k = 1: ", mean(X == 1), p * (1 - p))

[1] "k = 1: 0.419769 0.21"

paste("k = 2: ", mean(X == 2), pˆ2)

[1] "k = 2: 0.090079 0.09" 18


So obviously something went wrong

Pr(X = 1) is twice what we expected.


Why? Because there are 2 ways it could happen: HT or TH!
Each of these events has probabity p(1 − p)
So the probability of either HT or TH happening is 2p(1 − p).
Checking your maths with simulations saves a lot of heartache!

19
Lesson: Everything about probabiity is a statement about realisations
of random variables

A probability of an even is the idealised proportion of times an event


happens.
It corresponds to how often it woudl happen on average if we tried an
infinite nubmer of times.
We defintiely cannot try things an infinte number of times.
But we can definitely try them a large and finite number of times.

20
What about continuous outcomes

When outcomes are discrete (eg number of heads), we can enumerate them
and look at the probability of each outcome.
But when it’s continuous, (eg height) we cannot do this.
So instead we look at the cumulative distribution function or CDF
F(x) = Pr(X ≤ x)
F(x) is always zero at −∞ and always one at ∞
F(x) never decreases. Why?
For example: The exponential distribution
F(x) = 1 − e−λx ,
for some λ > 0.

21
We can plot the CDF!

22
We can plot the CDF!

1.00

0.75

0.50
y

0.25

0.00

0 2 4
X

23
But is this really how we usually visualise a distribution?

data %>%
ggplot(aes(x = X)) + geom_histogram(...)

24
But is this really how we usually visualise a distribution?

0.6

0.4
density

0.2

0.0

0 2 4 6
X

25
This histogram “is” the derivative of the CDF

We call this the probability denisty function.


We usually write it as p(x) and it satisfies
Zx
Pr(X ≤ x) = p(t) dt
−∞
By the fundamental theorem of calculus, this is the derivative of the CDF.
So for the exponential distribution with F(x) = 1 − e−λx , the pdf is
p(x) = λe−λx

26
Here’s a plot

1.00

0.75
density

0.50

0.25

0.00

0 2 4 6
X

27
A confession: That last plot was not very easy to make

28
What is the value of simualtion?

We can use simulations to look at werid functions of random variables.


This is important because a test statistics in a hypothesis test is a function of
the data and the data is assumed to be iid realisations of a random variable.
An example: If z1 , z2 ∼ N(0, 1), what is the density of x = z12 + z22 .
(You can prove using calculus that x has a Chi-squared distribution with 2
degrees of freedom)

29
Doing it with simualtions

0.5

0.4

0.3
density

0.2

0.1

0.0

0 5 10
x

30
But we can do much more complex things

Scenario: You are in your mum’s car and she has the radio tuned to an oldies
station. You notice that they play Meat Loaf’s Paradise By The Dashboard Light
every time the disk jockey needs a break. Curious, you time the intervals between
plays and you notice that it the gaps are roughly exponentially distributed with
rate parameter λ = 0.01.

1 Assuming the gaps are independent, what is the median amount of time
between spins of Paradise by the Dashboard Light?
2 If you listen to the station long enough for Paradise by the Dashboard Light
to be played five times, what is the distribution of the smallest gap between
plays? (Assuming playing times are independent)

31
The exact solutions: Part 1.

The median of a continuous distribution is the point x0.5 such that F(x0.5 ) = 0.5.
Plugging in the CDF for the exponential distribution, we get
1
F(x0.5 ) =
2
−λx0.5 1
1−e =
2
1
= e−λx0.5
2
− log 2 = −λx0.5
log 2
x0.5 =
λ
In this case, we get x0.5 ≈ 69.3.

32
Part 2 is harder.

It is possible to work out this answer mathematically.

Let X1 , . . . , X4 be the lengths of the four gaps.


We can assume that X1 , . . . , X4 ∼iid Exp(0.01).
Let Y = min{X1 , X2 , X3 , X4 }.
Then we can show (how?) that
Pr(Y ≤ y) = 1 − e−0.04y

33
Task: Use a simulation to validate this result

0.04

0.03
density

0.02

0.01

0.00

0 50 100 150 200


y

34
Simulations and hypothesis tests

35
What is a hypothesis test (a reminder)

Hypothesis tests are used to assess whether or not a data set is meaningfully
different from some baseline distribution.
Often, the baseline distribution corresponds to the least interesting case.
In this situation, the hypothesis test can be used to assess if the data is
different from the least interesting case.
But how do you assess difference? We need a single number summary.
This is the test statistic. It usually measures something relevant.
For instance, if our null hypothesis was that the mean of a random variable
was zero, then a test statistic might by T(y) = ȳ

36
A note of caution

A hypothesis test is only as meaningful as its assumptions.


The distribution of the null hypothesis needs to be plausible
The test statistic needs to be useful (eg, two distibutions can have the same
mean but be very very different)

37
A hypothesis test

Imagine you have a sample of 100 times between Paradise by the Dashboard
Light and you wanted to test if the average time between plays was more
than one hour.
H0 : µ = 60 vs H1 : µ > 60.
Furthermore, we know that whatever µ is, the data is iid Exp(µ−1 ).
Our test statistic is T(y) = ȳ − 60
For our observed data, we have T(y) = 11
Is this observed data unusual under H0 ?

38
How do we answer this with simulation?

We want to look at Pr(T(y) > 11) when y is 100 iid samples from Exp(1/60)
We can compute this if we have samples from T(y).
We can get one sample form T(y) under H0 as follows:
1 Sample 100 iid exponentials with mean 60: y = rexp(100, 1/60)
2 Compute mean(y) - 60
If we get a lot of samples from T(y) under H0 we can assess how unusual the
value of 11 is.

39
A plot
4000

3000
count

2000

1000

−20 0 20
test

[1] "Prob T(y) > 11 = 0.03971"


40
This is a useful general technique that we will look at more

This is a generic procedure that lets us perform a hypothesis test without


needing to know the precise distributrion of the test statistic!
We need 2 ingredients: A test statistic and a null hypothesis specifies the
distribution of the null data.
This procedure is often called a Monte Carlo hypothesis test

41

You might also like