Study Guide Biostatistic
Study Guide Biostatistic
HAP 602
Population vs. Sample – Example
• NH primary exit poll
- 59% of Republican men supported T vs. 51% of women supported H
- “These are results from a survey of 2,129 voters as they exited randomly
selected voting sites in New Hampshire on Jan. 23, 2024. The poll was
conducted by Edison Research for the National Election Pool…”
• Null hypothesis?
• Alternative hypothesis?
Hypotheses – Example (2)
• Younger nurses are more likely to complain about adverse working conditions
at work
- H0?
- H1?
• Paid sick leave increases the probability that workers use more primary care
- H0?
- H1?
Descriptive vs. Inferential – Examples (2)
• 1,000 30-year old women were followed for 20 years. At the end of the study,
75% of unmarried women were still alive, and 86% of married women were
still alive.
- Q. What parts of the statement include descriptive statistics?
- Q. Can any inferences be drawn from this statement?
Variables
Variables – Examples
• Q. Identify the following as qualitative or quantitative variables. For
quantitative variables, indicate whether they are continuous or discrete.
• The data show SAT scores of selected students (N=321 each) at 2 large high
schools in Virginia. One group participates in a newly-developed curriculum,
while the other participates in a more traditional curriculum
• Determine whether there is a significant difference in the mean SAT scores
between the two schools at a 5% level of significance
• Hypotheses?
• Results?
Two-Sample T Test – Example Practice
• Determine whether there is a significant difference in the mean SAT scores
between the two schools at a 5% level of significance
• Hypotheses?
- H0: Mean SAT for traditional curriculum = Mean SAT for new curriculum
- Ha: Mean SAT for traditional curriculum Mean SAT for new curriculum
• Results?
- t statistics (3.094) is bigger than t critical value (1.96) for two-sided test
- Thus, we reject the null
- There is statistically significant difference
Distribution of Sample Means – Practice
- Assume there are 3 numbers: 10, 20, 30
20
20
- Q. Take the means of each sample 15
15
• (10, 10) = 10 10
10
• (10, 20) = 15 5
• (10, 30) = 20 0
Sample 1 Sample 2 Sample 3
-: z-score
-: individual value
-: sample mean
-: standard deviation
Continue next slide
Z-Score & Probability – Example
- Q. ( > 6.00 | = 4.43 and = 1.32) = ?
So, now we subtract 0.5 from 0.3830 to get the number of the small
area under the curve
- Z – Score = 1
- Now look at the table to get the area under the curve
- Area under the curve = .3413
Z – Scores Table
- Z – Score for 8.5 = 0.49 0.49 = .1879 -: z-score
1.17 = .3790 -: individual value
- Z – Score for 10 = 1.17
-: sample mean
-: standard deviation
37.90 %
18.79 %
The area between 8.5 and 10 = 0.3790 – 0.1879 = 0.1911 (19.11 %)
Z-Score & Probability – Practice (3)
- Suppose you have healthcare
utilization records of all Mason
students (N=3,000) as shown in right Variable Mean SD
Outpatient visits /
3.2 2.0
year
- A column headed “Mean” presents
Hospitalization /
0.4 0.8
mean score for each variable year
Emergency
- A column headed “SD” presents department 1.3 1.2
visits / year
standard deviation for each variable
Z-Score & Probability – Practice (3)
- Suppose you randomly selected three students with annual
outpatient visits as;
- Student A: 4
- Student B: 5
- Student C: 1 Variable Mean SD
• Regulators need to balance social welfare and potential harm by finding the
optimal regulation
Type I & II Errors – Role of Sample Size
• The trade-off between the two errors is NOT
inevitable
• Central limit theorem; the means of a random
sample of size distribute normally with mean
and variance
• Thus, the larger the sample size, the smaller
the variance of the sampling distribution of
the mean
• Traditional campaign: “buy insurance, it could save your $ and your life”
• Alternative campaign: “buy insurance, you save your $ AND you can protect
others who are not able to buy it” (because healthy people’s uptake would
lower the community-rated premium)
Sample size calculation – Example
• General research idea; comparing insurance uptake (%) between traditional
campaign group vs. alternative campaign group (where each individual will
be randomly assigned to either group)
• How big should my sample be? We can’t request million dollars to the
funder without explanations
• Set & power (1 – 𝜷): Typically, 0.05 for & 0.8 for power
• Predict effect size; literature on the impact of traditional campaign (about
8%)
- I expect the intervention would increase uptake by 0.8 ppt -> 38K people
needed
Sample size calculation – Example
• How big should my sample be?
- I expect the intervention would increase uptake by 0.8 ppt -> 38K people
needed
• What if…
- Actual effect were small (0.4 ppt) -> 147K
- With lower (0.01) & unchanged expected effect (+0.8ppt) -> 56K
- With higher power (0.9) & unchanged expected effect (+0.8ppt) -> 50K
• Assuming that the experiment costs $10 per person, expected budget can
change by $1 million
Sample Size Calculation – Implication
• Even randomized controlled trials (“gold standard”) cannot detect statistically
significant effect with insufficient sample size
• When reading research articles, always ask: “do you have statistical power
strong enough to detect statistically significant effect?” (= “is your sample
big enough?”)
One-Sample Z test
Z-Test Statistic – Example
,
• Compare mean math score of sample to
the population (whole students) mean
- H0? Size Mean
Standard
Deviation
- H1? Sample 25 100 5.0
- Z=?
Population 1,000 99 2.5
- Conclusion (at =0.05)?
-; mean of the sample
Solution -; mean of the population
- (standard error of the mean); the value we would
- H0: expect by chance, given all the variability that surrounds
the selection of all possible sample means from a
- H1: (two-sided test) population
-; standard deviation of population
-, -; size of the sample
- Conclusion (at =0.05)? Continue next slide
Z-Test Statistic – Example
- H0:
- H1? (two-sided test)
Sample 100 45 5
- H0?
Population 100,000 42 8
- H1?
- Z=?
- Conclusion (at =0.05)? (Z-score for one-
sided test with =5% is 1.645)
(Independent/Unpaired) Two-Sample T test
Independent T-Test – Steps
• State null & research hypotheses
- If you have not done yet, you should load “analysis toolpak” in Excel
Independent T Test – Practice
• Open “SAT” excel file from Blackboard (“Week4”)
• The data show SAT scores of selected students at 2 large high schools in
Virginia. One participates in a newly-developed curriculum, while the other
participates in a more traditional curriculum
• Determine whether there is a significant difference in the mean SAT scores
between the two schools
- At a 5% level of significance
- At a 1% level of significance
150 128
115 115
Statistical Significance
Statistical Significance – Interpretation
• Assume a researcher conducted a
study on caffeine’s behavioral effect
on 150 students
- For two groups: high-caffeine vs. no
caffeine group
- And conducted independent-
sample t test
- On 4 outcome measures
• t statistics:
- ; difference between each individual’s score from point 1 (before) to point 2
(after)
- ; the sum of all the differences between groups of scores
- ; the sum of the differences squared between groups of scores
- ; the number of “pairs” of observations
Dependent T Test – Steps
• State hypothesis; if two-tailed tests,
• Set the level of risk (the level of significance/Type I error): typically 0.05
• Determine the degrees of freedom (df)= n – 1 (where n is the number of pairs
of observations)
• Compare the obtained t statistic and the critical value (Table B.2. or t table); if
your obtained value is larger (more extreme) than the critical value, reject
the null hypothesis
Dependent T Test – Example
• Open “InsuranceRating” excel file from Blackboard (“Week4”)
• The data contains ratings provided by seven users of an insurance program
before and after the company made some administrative changes in the
process of filing for claims
• At a 5% level of significance, test whether there was a statistically significant
improvement in ratings after the change was made
• Tip: when running one-tail test, put “after” column first in “Variable 1 range”
Dependent T Test – Example
• State your null and research t-Test: Paired Two Sample for Means
- is # of items
ID Item 1 Item 2 Item 3
- ; sum of variance of each item 1 3 5 2
2 4 4 3
- ; variance for “overall score” 3 3 4 4
- > 0.7 is considered “acceptable” & < 0.5 is 4 3 3 3
5 3 4 3
poor/unacceptable 6 4 5 5
7 2 5 5
8 3 4 4
• Data show 10 patients’ satisfaction scores (1 (lowest) to 9 3 5 4
5 (highest)) for each item (doctor, nurse, location) 10 3 3 2
- is # of items
ID Item 1 Item 2 Item 3 Total
- ; sum of variance of each item
1 3 5 2 10
- ; variance for “overall (total) score” 2 4 4 3 11
3 3 4 4 11
4 3 3 3 9
5 3 4 3 10
• First, get “total scores” by summing up each 6 4 5 5 14
person’s scores 7 2 5 5 12
8 3 4 4 11
• Then, get and (use Excel’s “var.s” function) 9 3 5 4 12
10 3 3 2 8
Validity vs. Reliability
• Think of throwing a dart
- Let’s say score 10 is “true value” and
your dart is “sample measurement”
- Your goal is getting all 10 or very close
to 10
• Observed agreement ()
• Chance agreement of “yes”
Test B
• Chance agreement of “no”
yes no total
• Overall Chance agreement () 21
Test A Yes 25 5 30
No 45 25 70
total 70 30 100
Validity – Practice
total 50 50 100
=
K = 1-1 = 0
Validity – Example (Table 9)
• Cohen’s Kappa ()
yes no total
• Chance agreement of “no” 0.439
Saliva Yes 88 25 113
• Overall Chance agreement ()
No 15 192 207
• 0.72
total 103 217 320
ANOVA – Example (“CognitiveScore” File)
• The file contains cognitive scores (continuous from 0 to 15) of 25 people in 3
groups
• Test if mean cognitive scores were different between the groups
-:
• In this example,
- : or or
- A: 3, 3, 3 (mean=3) 8
- B: 5, 5, 5 (mean=5) 6
- C: 7, 7, 7 (mean=7) 3
0
a b c
9
- A: 1, 3, 5 (mean=3) 8
- B: 3, 5, 7 (mean=5) 6
- C: 5, 7, 9 (mean=7) 4
- A: 3, 3, 3 (mean=3) 8
6
- B: 3, 3, 3 (mean=3) 5
- C: 5, 6, 7 (mean=6) 3
0
a b c
- A: 3, 3, 3 (mean=3) 9
7
- B: 3, 3, 3 (mean=3) 6
- C: 3, 3, 9 (mean=5) 4