Basics of Biostatistics Course
Basics of Biostatistics Course
1 Introduction
5 Sampling Distribution
6 References
What is Statistics?
Data Collection
Organization and Presentation of
data Data Analysis
Interpretation of the results
Limitation of Statistics
Classification of Statistics
Descriptive Statistics
It helps to describe a given set of data without going beyond that
data It consists of collection, organization, summarization,and anaysis
of data
Inferential Statistics
It helps to make inference/conclusion about a population based on
the selected sample
It consists of predict and forecast values of population parameters, test
hypothesis about values of population parameters and make decisions
What is Probability?
Random Experiments
Sample space
Events
1 Mutually exclusive events (Disjoint events)
2 Equally likely events - equal chance to
3 occur.
Favourable events - the number of outcomes favourable to an event
in an experiment is the number of outcomes which entail the
4 happening of the event
Exhaustive events - outcomes are said to be exhaustive when they
5 include all possible outcomes.
Independent events - if the occurrence or non-occurrence of an
event does not affect the occurrence or non-occurrence of the other.
Random variable
Binomial distribution
1 Mean = µ = np
2 Variance = σ2 = npq
3 √
Standard Deviation = σ = npq
4
Moment Generating function (MGF) = (q +
5
pet ) n Characteristic function (CF) = (q + pei t ) n
6 Skewness = √q−npq
p
1−6pq
7 Kurtosis = npq
...Example 1
P(X = x ) = 2 10
x ( 2 )x ( ) 10−x
10
1 1 = x ( 1 )10
2 2
...Example 3
Examples of Poisson
Distribution
1 Number of Network Failures per Week
2 Number of Bankruptcies Filed per Month
3 Number of Website Visitors per Hour
4 Number of Arrivals at a Restaurant
5 Number of Calls per Hour at a Call
6 Center Number of Books Sold per Week
7 Average Number of Storms in a City
8 Number of Emergency Calls Received by
a Hospital Every Minute
Poisson distribution
1 Mean = λ
2 Variance = λ
3 √
Standard Deviation = λ
4
Moment Generating function (MGF) = exp[λ(et −
5 Characteristic function (CF) = exp[λ(ei t −
1)]
6 1)]
Skewness = √1λ
7 Kurtosis = 1
λ
Example 2: A travel company has two cars for hiring. The demand for a
car on each day is distributed as poisson variate, with mean 1.5.
Calculate the proportion of days on which
1 Neither cars were used
2 Some demand is
refused
Solution
Let X be random variable representing the number of demands
for cars.
λ = 1.5
e−λλx e −1.5 (1.5) x
P(X = x) = P(x demands in a day ) = =
x! x!
...Example 2
1 The proportion of days in which neither car is
used −1.5 0
P(X = 0) = e (1.5) 0! = e−1.5 = 0.2231
2 The proportion of days on which some demand is refused.
...Example 3
e −2 (2) 3
P(exactly 3 suffer) = P(X = 3) = 3! = 0.180447
x!
x!
...Example 4
f (x ) √1 e −21 (x −σµ )2
σ 2π
=
for x = 1,2,3, ...
The parameter µ is the mean or expectation of the distribution
(and also its median and mode), while the parameter σ is its
standard deviation.
The variance of the distribution is σ2.
A random variable with a Gaussian distribution is said to be
normally distributed, and is called a normal deviate.
Normal distribution
1 Mean = µ
2 Variance = σ2
3 Standard Deviation = σ
Moment Generating function (MGF) = exp[µt + σ2 2t ]
2
4
e−
z
1
φ(Z ) = √
2π
2 , −∞ < x <
∞
1 Birthweight of Babies
2 Height of Babies
3 Shoe Sizes
4 Blood Pressure
5 Students
mark
66 − X − µ 71 −
P(66 ≤ X ≤ 71) = P( ≤ ≤ )
µ σ µ σ
66 − 68 71 −68
= P( σ ≤Z≤ )
2.5
2.5≤ Z ≤ 1.20)
= P(−0.8
= P(0 ≤ Z ≤ 0.8) + P(0 ≤ Z ≤ 1.20)
= 0.2881 + 0.3849
= 0.673
2 20
X≤
3 0 ≤X ≤
20
12
Solution:
X− 20 −
P(X ≥ 20) = P( ) ≥
µσ µ σ
20 −
= P(Z ≥ )
12 4
= P(Z ≥ 2)
= 0.5 −
0.4772
= 0.0228
Zeytu Gashaw Asfaw (PhD) Basics
Department of Biostatistics andfor Biostatistics
Epidemiology School of Public Health,June
Addis29,
Ababa
2023 University
55 / 123Ad
Probability distributions (Normal, Binomial, Poisson)
...Example 4 Solution:
0− X− 12 −
P(0 ≤ X ≤ 12) = P( ≤ ≤ )
µσ µσ µ σ
0− 12 −
= P( ≤Z )
12 4 12 4
≤
= P(−3 ≤ Z≤
0)
= P(0 ≤ Z ≤ 3)
= 0.4987
Zeytu Gashaw Asfaw (PhD) Basics
Department of Biostatistics andfor Biostatistics
Epidemiology School of Public Health,June
Addis29,
Ababa
2023 University
56 / 123Ad
Probability distributions (Normal, Binomial, Poisson)
What is statistical
data?
When census data cannot be collected, statisticians collect sample data
by developing specific experiment designs and survey samples.
What is Sampling?
Why we need it?
When we need it?
How do we get it?
How much we need
it?
Parameter Vs Statistic
Sample Population
n ←− size −→ N
X¯ ←− mean −→ µ
S2
←− variance −→ σ2
s ←− st.dev −→ σ
p ←− Proportion −→ P
ˆ
Impossibility of sampling.
Chances of bias. The serious limitation of the sampling method
is that it involves biased selection and thereby leads us to draw
erroneous conclusions.
Difficulties in selecting a truly representative
sample. In adequate knowledge in the subject.
Changeability of units.
Impossibility of sampling.
Types of Sampling
Methods
Probability sampling involves random selection, allowing you to make
strong statistical inferences about the whole group.
Non-probability sampling involves non-random selection based on
convenience or other criteria, allowing you to easily collect data.
Multistage Sampling
Complex form of cluster sampling in which two or more levels of
units are embedded one in the other.
First stage, random number of districts chosen in all
states. Followed by random number of villages.
Then third stage units will be houses.
All ultimate units (houses, for instance) selected at last step are
surveyed.
...Multistage Sampling
This technique, is essentially the process of taking random samples
of preceding random samples.
Not as effective as true random sampling, but probably solves more
of the problems inherent to random sampling.
An effective strategy because it banks on multiple randomizations.
As
such, extremely useful.
Multistage sampling used frequently when a complete list of all
members of the population not exists and is inappropriate.
Moreover, by avoiding the use of all sample units in all
selected
clusters, multistage sampling avoids the large, and perhaps
unnecessary, costs associated with traditional cluster
sampling.
Zeytu Gashaw Asfaw (PhD) Basics
Department of Biostatistics andfor Biostatistics
Epidemiology School of Public Health,June
Addis29,
Ababa
2023 University
69 / 123Ad
Types of Sampling Methods
Multiphase Sampling
Part of the information collected from whole sample and part from
subsample.
In Tb survey MT in all cases Phase I
X Ray chest in MT +ve cases Phase II
Sputum examination in X Ray +ve
cases - Phase III
Survey by such procedure is less costly, less laborious and
more purposeful
Panel Sampling
Method of first selecting a group of participants through a random
sampling method and then asking that group for the same
information again several times over a period of time.
Therefore, each participant is given same survey or interview at two
or more time points; each period of data collection called a ”wave”.
This sampling methodology often chosen for large scale or nation-
wide
studies in order to gauge changes in the population with regard to
any number of variables from chronic illness to job stress to weekly
food expenditures.
Panel sampling can also be used to inform researchers about
within-person health changes due to age or help explain changes
in continuous dependent variables such as spousal interaction.
There have been several proposed methods of analyzing panel
sample
data, including
Zeytu Gashaw Asfaw (PhD)
growth curves.Basics
Department of Biostatistics andfor Biostatistics School of Public Health,June
Epidemiology Addis29,
Ababa
2023 University
72 / 123Ad
Types of Sampling Methods
Non-probability Sampling
What are the factors that could affect the choice of sampling
method?
Level of Precision
Degree of Variability
β and
Power
β: The probability of failing to reject the null hypothesis when it
is false (or the probability of making a Type II error)
Power: The probability of correctly rejecting the null hypothesis when
it is false; commonly denoted by 1 − β
Study Designs
Cross-sectional studies
Case control studies
Cohort studies
where
Z is the value from the standard normal distribution reflecting
the confidence level that will be used (e.g., Z = 1.96 for 95%)
E is the desired margin of error/Absolute error or precision Has to be
decided by researcher.
P is the proportion of successes in the population. Here we are
planning a study to generate a 95% confidence interval for the
unknown population proportion, p.
Zeytu Gashaw Asfaw (PhD) Basics
Department of Biostatistics andfor Biostatistics
Epidemiology School of Public Health,June
Addis29,
Ababa
2023 University
91 / 123Ad
Sample Size Calculation
Example 1
Z2 sd 2
α/2
n= E2
where
Z is the value from the standard normal distribution reflecting
the confidence level that will be used (e.g., Z = 1.96 for 95%)
E is the desired margin of error/Absolute error or precision Has to be
decided by researcher.
SD = Standard deviation of variable. Value of standard deviation can
be taken from previously done study or through pilot study.
Zeytu Gashaw Asfaw (PhD) Basics
Department of Biostatistics andfor Biostatistics
Epidemiology School of Public Health,June
Addis29,
Ababa
2023 University
93 / 123Ad
Sample Size Calculation
Example 2
1(0.35−0.20)2
Zα (1+ m
1 )p ∗(1−p ) + Z β p 1 (1−p 1 )/m+p 2 (1−p 2 )
n= ∗
(p 1 −p 2 ) 2
n = 2×25 (1.96+0.84)
2 2
102 = 98
So in this case the researcher needs 98 subjects per
Zeytu Gashaw Asfaw (PhD) Basics
Department of Biostatistics andfor Biostatistics
Epidemiology June
School of Public Health, 29,Ababa
Addis 2023 University
104 / 123Ad
Sample Size Calculation
For sample size of this type of study below mentioned formula can
be used.
2× [Z α / 2 + Z β ] 2 P(1− P)
n= [p 1 −p 2 ] 2
The researcher feels that if the drug being tested increases survival
to 30% then the finding can be considered as clinically significant.
Effect size will be difference between proportions. 0.2 - 0.3= -
0.1. At 5% of significance level and 80% power sample size will
be Pooled prevalence = (0.20 + 0.30)/2 = 0.25
[−0.1] 2
...Summar
y
The following table summarizes the sample size formulas for each scenario
described here. The formulas are organized by
the proposed analysis, a confidence interval estimate or a test of
Situation
hypothesis. To Estimate CI To conduct HT
2
(Z σ) 2
One Sample n = ,
α/2
1−α/2 1−β
Continuous n=
2 |µ1 −µ2 |
E
(Z α / 2 σ) 2 Z 1−α/2+ Z 1 − β
Two Indep ni = 2 E2 ni = 2 |µ 1 −µ 2 |
σ
2
(Zα/2 σd )2 Z 1−α/2+ Z 1 − β
Two pairs n= E2 n= µd
σd
2
(Z α / 2 ) 2 Z 1−α/2 + Z1 −β
One Sample n = p(1 − p) E2 , n=
p 1 −p 0
Dichotomous
2
(Z α / 2 ) 2 Z 1−α/2+ Z 1 − β
Two Indep.
Zeytu Gashaw Asfaw (PhD) nDepartment
= [p1 (1 − p1Basics ) and
of Biostatistics +p for (1 − p2 )]School Eof 2Public Health,
2Biostatistics
Epidemiology n= June |p 1University
29,Ababa
Addis 2023 −p 2109
| / 123Ad
Sample Size Calculation
...Summary
...Summar
y
Estimating the difference between two population proportions
with specified absolute precision
2 α / 2 [P 1 (1−P 1 )+P 2 (1−P 2 )]
Z 1−
n=
e2
Hypothesis tests for two population
proportions
For a one-sidedhtest √ i2
1−α √ 1−β 1 −P 1)+ P 2(1−P 2)
Z 2P(1−P)+Z P (1 ,
[P1 − P2 ] 2
where P = P1 +2 P2
For a two-sidedh test √ i2
√
Z 1−α/2 2P(1−P)+Z1 − β P1 (1−P1 )+ P2 (1−P2 ) ,
[P1 − P2 ] 2
For a one-sided test for small proportion 2
[Z 1− α + Z 1− β ]
n = 0.00061(arcsin P −arcsin P ) 2 √ √
[ 2 1 ]
For a two-sided test for small proportion
2
[Z √ + Z
n = 0.00061(arcsin
1− α / 2 1− β ] √
[ P 2−arcsin P 1) 2 ]
...Summar
y
Estimating an odds ratio with specified relative
precision
1
Z 21 − α / 2 ∗ 1 ∗+ ∗
n= 1 (1−P 1 )
P∗ P2 (1−P2 )
2
[loge (1− e)]
...Summary
where P = P 1 +P 2
2
...Summar
y
Estimating an incidence rate with specified relative
hZ i2
precision n = 1−eα / 2
Hypothesis tests for an incidence
rate For a one-sided test 2
[Z 1− α λ 0 + Z 1− β λ a ]
n= 2
[λ 0 − λ a ]
For a two-sided test 2
Z λ +Z λ
n= [ 1− α / 2 0 1− β a ]
[λ 0 − λ a ] 2
Hypothesis tests for two incidence rates in follow-up (cohort)
studies (study duration nor fixed)
For a one-sided test
hZ
1−α
√(1+k)λ 2 + Z1 − β √ (kλ12 + λ22 )
i2
n1 = k [λ 1 − λ 2 ]
2
where λ = 2
λ 1+ λ 2
and k is the ratio of the sample size for the
group
second
Zeytu Gashaw Asfaw (PhD)
of subjects
Department o
n to Basics
thatforfor the first group n
Biostatistics June 29, 2023 114 / 123 d
2 1
Sampling Distribution
Sampling distribution
...Sampling distribution
...Sampling distribution
Examples
Example 1: Suppose we have a hypothetical population of size 3,
consisting of three children: A is 3 years old, B is 6 years old and C
is 9 years old. Construct sampling distribution of the sample mean
of size 2 using sampling without replacement and with replacement.
Solution:
The mean and variance of the population are 6 and 6, respectively.
If sampling is without replacement we will have 3C2 = 3
possible samples. E (X ) = 6 and V (X ) = 3
If sampling is with replacement we will have Nn = 32 = 9
possible samples. E (X ) = 6 and V (X ) = 3
Sampling distribution of
mean
Example 2: Let X be the mean of a random sample of size 50 drawn
from a population with mean 112 and standard deviation 40
Find the mean and standard deviation of X
Solution:
µX = µ = 112
σ = √σ = √40
50
=
X n
5.65685
References
1 Mukhopadhyay, Nitis. Probability and statistical inference/Nitis
and Zimmerman D. L.
4 Mathematical Statistics: A Textbook, S. Biswas and
G.L.Sriwastav, Narosa
5 Cai J, Zeng D. Sample size/power calculation for casecohort studies.
Biometrics 2004;60:101524.
6 S. K. Lwanga and S. Lemeshow. Sample size determination in