0% found this document useful (0 votes)
5 views

16-Two-Sample-T-tests

The document discusses statistical methods for comparing means, focusing on two-sample t-tests and their applications, including bootstrap simulations for confidence intervals and p-values. It presents a case study on heart rates of students after consuming Red Bull, illustrating the use of classical and Welch tests for hypothesis testing. The document emphasizes the importance of assumptions regarding variance and normality in statistical inference.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

16-Two-Sample-T-tests

The document discusses statistical methods for comparing means, focusing on two-sample t-tests and their applications, including bootstrap simulations for confidence intervals and p-values. It presents a case study on heart rates of students after consuming Red Bull, illustrating the use of classical and Welch tests for hypothesis testing. The document emphasizes the importance of assumptions regarding variance and normality in statistical inference.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Two-Sample T-tests

Decisions with Data | Inference for means

© University of Sydney MATH1062/1005


15 October 2024

 Module4 Decisions with Data

The z-test
How can we make evidence based decisions? Is an observed result due to
chance or something else? How can we test whether a population has a certain
proportion?

The t-test
How can we test whether an unkown population has a certain mean?

The two-sample test


How can we test whether two variables have the same mean?

𝜒 2 -test
How to compare frequencies of categories?
2/40

 Today’s outline

Exam marks: bootstrap simulation

Comparing two sample means


The classical two-sample t-test

The Welch test

3/40
Exam marks: bootstrap simulation
Exam marks: bootstrap simulation
· What about a simulation? We repeatedly sample from the data, and compute the
value taken by the t-statistic.

n = length(marks)
sig.hat = sd(marks)
t = (mean(marks) - 65)/(sig.hat/sqrt(n)) # obtaining the original t statistic
t

## [1] -2.371497

T.stats.sim = 0 #producing several simulated/bootstrap t statistics


for (i in 1:10000) {
samp = sample(marks, size = n, replace = T)
sig.hat = sd(samp)
T.stats.sim[i] = (mean(samp) - mean(marks))/(sig.hat/sqrt(n)) #mean(marks) is the 'population mean' here
}

5/40
Exam marks: bootstrap simulation
hist(T.stats.sim, breaks = 50, pr = T)
curve(dt(x, df = n - 1), add = T, lty = 2)
legend("topright", legend = c("Student's t-dist. with 99 d.f."), lty = 2)

6/40
Exam marks: bootstrap p-value
· How significant is our observed t-statistic value of -2.371 , based on the
simulation?
· What proportion of the values in T.stats.sim exceed abs(t)= 2.371 (in
absolute value)?

mean(abs(T.stats.sim) > abs(t))

## [1] 0.0209

· We get an observed t-statistic of −2.37 and a p-value of 0.0209 .

7/40
Exam marks: bootstrap confidence interval
· We firstly get the upper and lower 2.5% points from T.stats.sim :

u.l = quantile(T.stats.sim, prob = c(0.975, 0.025))


u.l

## 97.5% 2.5%
## 2.022866 -1.966097

These are then used to construct the interval [𝑋¯ − 𝑢 𝜎𝑛̂ , 𝑋¯ − ℓ 𝜎𝑛̂ ]:
·
√ √

mean(marks) - u.l * sd(marks)/sqrt(n)

## 97.5% 2.5%
## 60.29340 64.56579

8/40
Reflecting on the three different methods
· The p-values of each test:

Test z-test t-test bootstrap


p-value 0.0111 0.0197 0.0209
· What can we conclude? All the results seem to produce a significant result.
· The bootstrap has the least amount of assumptions and the z-test and t-test
results are close.
- So the assumptions were probably reasonable

9/40
Comparing two sample means
Red bull example
· Red Bull is an energy drink advertised to “give you wings”.

· What does research say about the medical effects of drinking a Red Bull?

· Consider the following data on heart rates (beats per minute), for 2 independent
groups of Sydney students, collected 20 minutes after the ‘RedBull’ group had
drunk a 250ml cold can of Red Bull.

No Red Bull 84 76 68 80 64 62 74 84 68 96 80 64 65 66
Red Bull 72 88 72 88 76 75 84 80 60 96 80 84 - -

11/40
Red bull example
No_RB <- c(84, 76, 68, 80, 64, 62, 74, 84, 68, 96, 80, 64, 65, 66)
RB <- c(72, 88, 72, 88, 76, 75, 84, 80, 60, 96, 80, 84)
boxplot(No_RB, RB, names = c("No RB", "RB"), horizontal = T)

· The Red Bull group seems to have a higher heart rate.


· Is the apparent difference significant?

12/40
Two-box model
· We can model the two groups as samples taken from two separate boxes
(independently of each other).
· the No Red Bull group is considered as a random sample 𝑋1 , … , 𝑋𝑚 with
replacement from a box with:
- mean 𝜇𝑋 and
- SD 𝜎𝑋
· the Red Bull group is considered as a random sample 𝑌1 , … , 𝑌𝑛 with
replacement from a box with:
- mean 𝜇𝑌 and
- SD 𝜎𝑌
· We wish to make a statement about the population mean difference 𝜇𝑋 and 𝜇𝑌 ,
based on the sample mean difference 𝑋¯ − 𝑌¯ .

13/40
Expected value and SE of 𝑋¯ − 𝑌¯
· 𝐸(𝑋¯ − 𝑌¯ ) = 𝐸(𝑋¯ ) + 𝐸(−𝑌¯ ) = 𝐸(𝑋¯ ) − 𝐸(𝑌¯ ) = 𝜇𝑋 − 𝜇𝑌
· where the second equality follows from 𝐸(𝑎𝑋) = 𝑎𝐸(𝑋) for a random draw 𝑋 .
· Most importantly

𝜎 2 𝜎 2
𝑆𝐸(𝑋¯ − 𝑌¯ )2 = 𝑆𝐸(𝑋¯ )2 + 𝑆𝐸(−𝑌¯ )2 = 𝑋 + 𝑌 .
𝑚 𝑛
· where the second equality follows from 𝑆𝐸(−𝑋) = 𝑆𝐸(𝑋) for a random draw
𝑋.

14/40
Two-sample test statistic
· We wish to test the null hypothesis 𝐻0 : 𝜇𝑋 = 𝜇𝑌 .
· An observed z-statistic is usually given by

observed − expectation
𝑧=
standard error
where the expectation and standard error are taken assuming 𝐻0 is true.
· The observation is the observed mean sample difference 𝑥¯ − 𝑦¯ .
· The expectation is the difference 𝜇𝑋 − 𝜇𝑌 = 0 under 𝐻0 .
· ‾𝜎‾‾‾‾‾‾
𝜎𝑌2‾
The standard error is √ 𝑚 + 𝑛
2
𝑋

· This gives us
𝑥¯ − 𝑦¯
𝑧=
‾𝜎‾‾‾‾‾‾ 𝜎𝑌2‾
√𝑚 + 𝑛
2
𝑋

15/40
Two-sample test statistic
· Assuming 𝜎𝑋 and 𝜎𝑌 is known, the Z-statistic is distributed

𝑋¯ − 𝑌¯
𝑍= ∼ 𝑁(0, 1)
‾𝜎‾‾‾‾‾‾𝜎𝑌‾
√𝑚
2 2
𝑋
+ 𝑛

- using the fact that 𝑋¯ and 𝑌¯ are approximately normal by CLT.


- using the fact (which you have not seen) that the difference of two
(independent) normal draws are also normal.
· In general, 𝜎𝑋 and 𝜎𝑌 are both unknown.
· In this case we have two options:
- assume 𝜎𝑋 = 𝜎𝑌 = 𝜎 is the same in both boxes: classical two-sample t-
test
- do not assume 𝜎𝑋 = 𝜎𝑌 = 𝜎 is the same in both boxes: Welch test.

16/40
The classical two-sample t-test
Equal variance assumption
· In some cases it is reasonable to assume 𝜎𝑋 = 𝜎𝑌 = 𝜎
2 = 𝜎2 .
- This is often called an equal variances assumption, i.e. 𝜎𝑋 𝑌
· Then the SE may be written as

‾1‾‾‾‾‾1‾
√𝑚
𝑆𝐸(𝑋¯ − 𝑌¯ ) = 𝜎 + .
𝑛

18/40
Extra assumptions: Student’s 𝑡-distribution
· In this case, if
- it is also assumed the boxes are (approx.) normal-shaped,
- a special “combined” or “pooled” estimate 𝜎̂𝑝 of the common 𝜎 is used
then Student’s theory can be applied to show the statistic

𝑋¯ − 𝑌¯
𝑇 = ∼ 𝑡𝑚+𝑛−2
𝜎̂𝑝 √‾𝑚1‾‾‾‾‾
+ 1𝑛‾

i.e. has Student’s-𝑡 distribution with 𝑚 + 𝑛 − 2 degrees of freedom

19/40
The pooled estimate 𝜎̂𝑝
· The form of the pooled estimate of 𝜎 is given by

‾∑
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
𝑚 ¯ 2 + ∑𝑛 (𝑌 − 𝑌¯ )2‾


𝑖=1 (𝑋 𝑖 − 𝑋 ) 𝑗=1 𝑗
𝜎̂𝑝 =
𝑚+𝑛−2
‾(𝑚
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
− 1)𝜎̂2𝑋 + (𝑛 − 1)𝜎̂2𝑌‾

= .
𝑚+𝑛−2

· Written this way, we see 𝜎̂2𝑝 is a weighted average of 𝜎 2 and 𝜎 2 .


𝑋 𝑌
- The bigger sample gets more weight.
- The estimate from the larger sample is “more trustworthy”.
· Why do we have this strange denominator 𝑚 + 𝑛 − 2 ?

20/40
2
Squared estimate 𝜎̂𝑝 is “on target” for 𝜎 2
· Recall that each sample variance (squared sample SD) estimates 𝜎 2 “on
average”, in that

(𝑚 − 1 ∑ )
1
2
𝐸(𝜎̂𝑋 ) = 𝐸 (𝑋𝑖 − 𝑋¯ )2 = 𝜎 2
𝑖=1

and so

(∑ )
𝐸 ((𝑚 − 1)𝜎̂2𝑋 ) = 𝐸 (𝑋𝑖 − 𝑋¯ )2 = (𝑚 − 1)𝜎 2
𝑖=1

· Similarly we have

(∑ )
𝐸 ((𝑛 − 1)𝜎̂2𝑌 ) = 𝐸 (𝑌𝑖 − 𝑌¯ )2 = (𝑛 − 1)𝜎 2
𝑖=1 21/40
2
Squared estimate 𝜎̂𝑝 is “on target” for 𝜎 2
· Then the numerator inside the √⋅ has

𝑚 𝑛

(∑ )
(𝑋𝑖 − 𝑋¯ )2 + (𝑌𝑗 − 𝑌¯ )2 = 𝐸 ((𝑚 − 1)𝜎̂2𝑋 ) + 𝐸 ((𝑛 − 1)𝜎̂2𝑌 )

𝐸
𝑖=1 𝑗=1

= (𝑚 − 1)𝜎 2 + (𝑛 − 1)𝜎 2
= (𝑚 + 𝑛 − 2)𝜎 2 .
· Dividing through by 𝑚 + 𝑛 − 2 we get

𝐸 (𝜎̂2𝑝 ) = 𝜎 2 ,
2 2 2
so 𝜎̂𝑝 shares the “on-target on average” property that 𝜎̂𝑋 and 𝜎̂𝑌 have.
· Hence 𝑚 + 𝑛 − 2 is the number of degrees of freedom for the pooled estimate
2
of the variance, 𝜎̂𝑝 .
22/40
Red Bull example
· Based on the boxplots, we see that
- each looks reasonably symmetric;
- the spreads are similar
· It may therefore be reasonable to assume we have samples obtained from
approximate normal boxes with a common SD.

sd(No_RB)

## [1] 10.07363

sd(RB)

## [1] 9.452833

Also the sample SDs are similar.

23/40
Red Bull example: pooled estimate
m = length(No_RB)
n = length(RB)
print(c(m, n))

## [1] 14 12

numer = (m - 1) * (sd(No_RB)^2) + (n - 1) * (sd(RB)^2)


denom = m + n - 2
sig.hat.p = sqrt(numer/denom)
sig.hat.p

## [1] 9.793984

24/40
Red Bull example: test statistic
· We therefore compute the value taken by the (Classical) Two-Sample T-statistic:

est.SE = sig.hat.p * sqrt((1/m) + (1/n))


est.SE

## [1] 3.852933

mean.diff = mean(No_RB) - mean(RB)


mean.diff

## [1] -5.940476

stat = mean.diff/est.SE
stat

## [1] -1.541806

25/40
Red Bull example: p-value
· Is this a one- or two-sided test?
· As originally phrased, i.e. “is the apparent difference significant?”, it is (strictly
speaking) two-sided.
· A two-sided P-value is thus

2 * pt(abs(stat), df = m + n - 2, lower.tail = F)

## [1] 0.1362041

· This is not small, so the apparent difference is not significant.


· One sided p-values can be achieved as usual, by using either:
- pt(stat, df=m+n-2) for 𝐻1 : 𝜇𝑋 < 𝜇𝑌 and
- pt(stat, df=m+n-2, lower.tail=F) for 𝐻1 : 𝜇𝑋 > 𝜇𝑌

26/40
Using t.test()
· Of course, the t.test() function can do all of this in one line;
- we must supply the var.equal=T parameter:

t.test(No_RB, RB, var.equal = T)

##
## Two Sample t-test
##
## data: No_RB and RB
## t = -1.5418, df = 24, p-value = 0.1362
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.892538 2.011586
## sample estimates:
## mean of x mean of y
## 73.64286 79.58333

27/40
Confidence interval
· Note that the confidence interval given here is obtained in the familiar way.
· Instead of referencing a normal curve and using qnorm() we use qt() .
· Specifically, we use 𝑚 + 𝑛 − 2 degrees of freedom:

qt(0.975, df = m + n - 2)

## [1] 2.063899

mean.diff + c(-1, 1) * qt(0.975, df = m + n - 2) * est.SE

## [1] -13.892538 2.011586

· In general the confidence interval computed using the observed data 𝑥1 , … , 𝑥𝑚


and 𝑦1 , … , 𝑦𝑚 and q = qt(1-alpha/2,df=m+n-2) is given by:

𝜎̂𝑝
(𝑥¯ − 𝑦¯) ± 𝑞 ⋅
√𝑛
28/40
The Welch Test
Relaxing the equal variance assumption
· Do we really need to assume 𝜎𝑋 = 𝜎𝑌 ?
- Yes, if we want to apply Student’s theory directly.
· What if it is not reasonable to assume this?
- E.g. the two boxplots may have very different spread.
· An “obvious” approach would be to instead consider the statistic

𝑋¯ − 𝑌¯
𝑇 = ,
‾𝜎‾‾‾‾‾‾
2
𝜎𝑌̂‾
2

√𝑚
̂
𝑋
+ 𝑛

which just plugs in the two sample SD estimates in for 𝜎𝑋 and 𝜎𝑌 .


· How to get a p-value?
- How is it distributed under the assumption that 𝐻0 : 𝜇𝑋 = 𝜇𝑌 is true?

30/40
Welch’s paper
· In 1947 (some time after Student’s paper) B. L. Welch “solved” this problem:

31/40
Approximate Student’s-𝑡 distribution
· Welch found that the statistic behaved approximately like a student’s-𝑡 distribution
whose degrees of freedom was a complicated function of 𝑚 , 𝑛 , 𝜎𝑋 and 𝜎𝑌 .
· He also proposed implementing the test by “plugging in” 𝜎̂𝑋 and 𝜎̂𝑌 .
· The Welch Test thus obtains a p-value, etc. by imagining the statistic 𝑇 has a
Student’s-𝑡 distribution with a data-dependent degrees of freedom.

32/40
The formula for the degrees of freedom at a
glance!

(𝑚 )
2 2
̂
𝜎𝑋 𝜎2𝑌̂
+ 𝑛
𝑑𝑓 = 2
̂ /𝑚)2
(𝜎 𝑋 (𝜎2𝑦̂ /𝑛)2
𝑚−1 + 𝑛−1

- A bit more complicated that the one-sample t-test of 𝑑𝑓 = 𝑛 − 1 or the classical


two sample t-test of 𝑑𝑓 = 𝑚 + 𝑛 − 2 .

33/40
Default two-sample t.test()
· It turns out Welch’s procedure works very well,
- i.e. the “approximate” p-values returned have nice properties
- rejection rates are in line with the desired false-alarm rate when simulating
from normal boxes.
· It works so well that 𝚁 uses the Welch test as the default two-sample t-test:

t.test(No_RB, RB) # note: data-dependent d.f. close to Classical (which was 24 d.f.)

##
## Welch Two Sample t-test
##
## data: No_RB and RB
## t = -1.5497, df = 23.776, p-value = 0.1344
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.856127 1.975175
## sample estimates:
## mean of x mean of y
## 73.64286 79.58333

34/40
Using bootstrap simulation
· The Welch test does not assume 𝜎𝑋 = 𝜎𝑌 .
· But it does still assume the two boxes are “approximately normal”.
· What if we are uncomfortable making this assumption?
· We can try simulating from two “best guess” boxes which
- have “similar shapes” to the “true” boxes that generated our data;
- have equal means.

35/40
Centre the samples before simulating
· We thus sample from each observed sample with replacement, but we subtract the
means so both “populations” have the same mean i.e. zero.

Welch.stats.sim = 0
for (i in 1:10000) {
samp.x = sample(No_RB - mean(No_RB), size = m, replace = T) # both 'boxes' have
samp.y = sample(RB - mean(RB), size = n, replace = T) # mean zero
est.SE = sqrt((sd(samp.x)^2)/m + (sd(samp.y)^2)/n)
Welch.stats.sim[i] = (mean(samp.x) - mean(samp.y))/est.SE
}

36/40
The histogram
hist(Welch.stats.sim, n = 50, pr = T)
curve(dt(x, df = 23.776), add = T, lty = 2) # data-dependent d.f. from original sample
legend("topleft", legend = c("Students-t with 23.776 d.f."), lty = 2)

37/40
The histogram
· The histogram is a little skewed.
- This is maybe due to the large positive near-outlier in the No_RB sample.
- Simulated samples without that value chosen will have a smaller mean, giving
a cluster of statistic values less than zero.

38/40
Two-sided p-value by simulation
est.SE = sqrt((sd(No_RB)^2)/m + (sd(RB)^2)/n)
stat = (mean(No_RB) - mean(RB))/est.SE
stat

## [1] -1.549672

mean(abs(Welch.stats.sim) >= abs(stat))

## [1] 0.1361

This is very close to the earlier p-values:

· classical two-sample t-test had 0.1362


· Welch t-test had 0.1344

39/40
Confidence interval by simulation
· We use the simulated values in Welch.stats.sim to approximate the “true
distribution” of the Welch statistic when 𝜇𝑋 = 𝜇𝑌 :

u.l = quantile(Welch.stats.sim, prob = c(0.975, 0.025))


u.l

## 97.5% 2.5%
## 1.878815 -2.285400

· That these are not the same magnitude indicates the slight lack of symmetry.

mean(No_RB) - mean(RB) - u.l * est.SE

## 97.5% 2.5%
## -13.142681 2.820322

· The interval is quite close to those obtained by (both versions of) t.test() , but
is slightly shifted to the right;
- indicates influence of the large positive value in No_RB .
40/40

You might also like