Chapter 102 Biostatistics
Chapter 102 Biostatistics
Introduction to Biostatistics
What is Statistics?
- "singular‖: the science that deals with the collection, classification, analysis, and
interpretation of numerical facts or data, and that , by use of mathematical
theories of probability, imposes order and regularity on aggregates of more or
less disparate elements.
- This includes design issues as well.
(plural)‖: the numerical facts or data themselves. (Webster‘s Dictionary)
Statistics is about common sense and good design! (Campbell and Machin 1983)
Biostatistics:
Statistics applied to biological (life) problems, including: Medicine, Public health,
Ecological and environmental.
Statistical Analyse:
Descriptive Statistics:
o Describe the sample.
Inference:
o Make inferences about the population.
o Primarily performed in two ways:
Hypothesis testing.
Estimation (more important!!).
Prediction.
Bwasiq 86 45
Chapter two: Biostatistics
Collect Data.
Analyze Data.
o Descriptive statistics.
o Statistical Inference.
DATA
The vast majority of errors in research arise from a poor planning (e.g., data
collection).
Fancy statistical methods cannot rescue garbage data.
Collect exact values whenever possible.
Types of Data
…the selection of an appropriate statistical technique is determined by the
research design, hypothesis, and the data collected
Definitions
Variable: Characteristic or attribute that can assume different values.
Random Variable: A variable whose values are determined by
chance.
Population: All subjects possessing a common characteristic that is
being studied.
Sample: A subgroup or subset of the population.
—
TYPE of data:
1- Constant 2-Variable
Type of variable:
1- Qualitative "nominal & ordinal "
2- Quantitative " continuous & discrete "
1.Qualitative data
A. Nominal data
It can be classified into more than two categories. "Blood type, Race.
No meaningful order.
B. Ordinal data
It can be classified into categories that have a natural ordering.
Very satisfied, satisfied, neutral, unsatisfied, very unsatisfied.
2.Quantitative data
A. Discrete data
When each element of a set lies at a few isolated point.
Countable variables. Integer form.
Numbers of things. "Age, numbers of men".
Bwasiq 86 46
Chapter two: Biostatistics
B. Continuous data
When each element of a set can theoretically lie any where on the number
scale.
Measurable variables. Bp, Kg, hour, years.
Round to the nearest integer.
Inferential statistics
It use the probability theory to obtain conclusions of a population, from data
obtained in a sample.
It is very hard to study all population, because of this, we study samples.
Methods for estimations and hypothesis tests are important to obtain
inferences.
Descriptive Statistics
Use of numerical information to summarize, simplify, and present masses of
data.
Organized and summarized for clearer presentation
For ease of communications
Bwasiq 86 47
Chapter two: Biostatistics
Data may come from studies of populations (often called a census study) or
samples
— Population = parameter
— Sample = statistic
— Often, either called statistic
—
Descriptive Methods for Continuous Data
Statistical procedures used to summarise, organise, and simplify data. This
process should be carried out in such a way that reflects overall findings
Raw data is made more manageable
Raw data is presented in a logical form
Patterns can be seen from organised data
- Tables
- Graphical techniques
- Measures of Central Tendency
- Measures of Dispersion (variability).
- Coefficient of Correlation.
Describing data with tables
4) open- ended groups 1) frequency table
5) cross-tabulation 2) relative and cumulative frequency
6) tables that are not contingency tables 3) grouped frequency
Bwasiq 86 48
Chapter two: Biostatistics
4) Open-ended group:
One or two values which are called outliers, are a long way from the general mass of
the data. Use ≤ or ≥
5) Cross-tabulation
Breast 2 or
lump fewer
diagnosis children
—
Yes No
Benign 21 (84%) 11 32(100)
(66) (73%)(34)
Malignant 4 (16%) 4 8(100)
(50) (27%)(50)
Totals 25(100%) 15(100%) 40
Bwasiq 86 49
Chapter two: Biostatistics
1. numbers, percentages and proportions
Numbers-the numerical summaries of data
A percentage is a proportion multiplied by 100. (categorical data)
1) Prevalence: number of existing cases in some population at a given time.
2) Incidence (inception): the number of new cases occurring per 100, or
per 1000, of the population, during some period of time.
X
i1
i X 1 X 2 X 3 ....... X n
Bwasiq 86 50
Chapter two: Biostatistics
Mean from a Positively Skewed Distribution:
When the data is positively skewed analyses are commonly done on the log scale.
This is done to minimize the effect of extreme observations.
Method of obtaining the mean:
Take the log of each data value
Calculate the mean on the log scale
Take the antilog of the mean to return to the original scale of measurement.
This is called the ―GEOMETRIC‖ Mean.
Advantages Disadvantages
- Simple and easy
- Most widely used - Affected by extreme values
- Can be used for further statistical tests - Sometimes looks ridiculous e.g. average
- All values are included number of children = 2.7
- Does not need arrangement of data
Median
Value which divides the data into two equal parts after arrangement of data into
ascending or descending order.
If the number of observations in the dataset is odd the median will be the ½(n+1) th
observation.
If the number of observations is even the median is defined as the average of the
(½n)th and the ½(n+1)th observation.
i.e. {8,5,4,12,15,7,28} the median is 8.
- First put observations in order: 4,5,7,8,12,15,28
- Find the ½ (n+1)th(which is the 4th) observation.
-
Advantages Disadvantages
- Not affected by extreme values - Needs arrangement of data
- Used for growth curves and income - Difficult to calculate from large amounts
- Can be determined graphically of data
- Not all values are represented
Mode
It is the most common value found in the dataset (fashionable value)
o Hb level of 5 pregnant women
12, 12.5, 11, 13, 12.5 Mode = 12.5
Bwasiq 86 51
Chapter two: Biostatistics
More than one mode may occur (bimodal, trimodal) Sometimes there is no
mode .
Advantages : Not affected by extreme values
Disadvantages :Not all values are represented
Distribution Characteristics
Mode: Peak(s) Median: EqualareaspointMean: Balancingpoint
4. Measures of Dispersion
measure the degree of variation or dispersion around the mean.
The measurement of dispersion (or variation) plays an important role in the
methods of statistical inference.
We will discuss: Range. , Variance. &Standard Deviation.
Range:
Difference between highest and lowest value
Range = largest value-smallest value
o E.g: Hb level of 5 pregnant women 12, 12.5, 11, 13, 12.5
Range = 13-11 = 2
Advantages: Easy to calculate
Disadvantages
o It is affected by extreme values.
o Value of range is only determined by two values.
o The interpretation of the range is difficult.
o It does not provide information about other values and how dispersed
they are.
Bwasiq 86 52
Chapter two: Biostatistics
The variance is obtained by squaring these deviations and dividing their sum by
one less than n.
8 0 0
5 -3 9
4 -2 4
12 4 16
15 7 49
5 -3 9 X =8 S2 =100\6 = 4.08 SD=
7 -1 1
Coefficient of Variation
o Measure of spread that is independent of the units of measurement
variables.
o We divide the standard deviation by the mean and express this quotient
as a percentage.
Coefficient of Correlation
o Measure of linear association between 2 continuous variables.
Setting:
o two measurements are made for each observation.
o Sample consists of pairs of values and you want to determine the
association between the variables.
Association Examples
o Association between a mother‘s weight and the birth weight of her child
2 measurements: mother‘s weight and baby‘s weight , Both continuous
measures
o Example 2: Association between a risk factor and a disease 2
measurements: disease status and risk factor status , Both dichotomous
measurements
Correlation Analysis
o When you have 2 continuous measurements you use correlation
analysis to determine the relationship between the variables. Through
correlation analysis you can calculate a number that relates to the
strength of the linear association.
o Scatter Plots and Association
o You can plot the 2 variables in a scatter plot (one of the types of charts
in SPSS/Excel).
Bwasiq 86 53
Chapter two: Biostatistics
The pattern of the ―dots‖ in the plot indicate the statistical relationship between the
variables (the strength and the direction).
Positive relationship – pattern goes from lower left to upper right.
Negative relationship – pattern goes from upper left to lower right.
The more the dots cluster around a straight line the stronger the linear relationship.
Pearson Correlation Coefficient
Interpretation:
values near 1 indicate strong positive linear relationship
values near –1 indicate strong negative linear relationship
values near 0 indicate a weak linear association
Probability theory
It is also of interest to investigate how the information contained in a sample can
be used to infer the characteristics of the population from which it was drawn.
The foundation for statistical inference is the theory of probability.
Proportion – the relative size of the portion of a population with a certain
characteristic.
Random selection – a selection where each person has an equal chance of
being selected.
The chance depends on the size of the sub-population to which he/she belongs.
The chance is measured by the proportion, a number between 0 and 1, called the
probability.
Proportion measures size "It is a descriptive statistic ".but Probability measures chance
Bwasiq 86 54
Chapter two: Biostatistics
Sampling
we can‘t study the entire population. So We rely on a sample " a subgroup of the
population under investigation". By calculating probability we can describe what has
happened and predict what should happen in the future under the same conditions.
Probability and Random Sampling
Suppose that out of N=100,000 persons a total of 5500 are positive to a certain
screening test ―the probability of a randomly selected person from the target population
having a positive test result is 0.055 or 5.5%
Rationale: on an initial draw the person may or may not have a positive test. However
when this process is repeated over and over again a large number of times, the relative
frequency of positive people will approximate 0.055.
Cancer Screening Example
Test result X
Disease Y + - Total
+ 154 225 379
- 362 23,362 23,724
Total 526 23,587 24,103
Each member of the population is characterized by two variables:
1. Test result – X
2. True disease status - Y
Marginal Probabilities
P(X=+) = probability of a positive test = 526/24103 = 0.021
P(X=-) = probability of a negative test = 23587/24,103 = 0.979
P(Y=+) = probability of having the disease = 379/24,103 = 0.015
P(Y=-) = probability of not having the disease = 23724/24,103 = 0.985
Joint Probabilities
P(X=+, Y=+) = probability of a positive test and having the disease = 154/24103 =
0.006
P(X=+, Y=-) = probability of a positive test and not having the disease = 362/24,103 =
0.015
P(X=-, Y=+) = probability of a negative test and having the disease = 225/24,103 =
0.009
P(X=-, Y=-) = probability of a negative test and of not having the disease =
23362/24,103 =0.970
Disease Y + - Total
+ P(X=+,Y=+) =0.006 P(X=-,Y=+)=0.009 P(Y=+)=0.015
- P(X=+,Y=-)=0.015 P(X=-,Y=-)=0.970 P(Y=-)=0.985
Total P(X=+)=0021 P(X=-)=0.979
Bwasiq 86 55
Chapter two: Biostatistics
Conditional Probabilities
P(X=+ | Y=+) = probability of a positive test given cancer is present = 154/379 = 0.406
" this is the SENSITIVITY of the test"
P(X=- | Y=-) = probability of a negative Test given cancer is not present
= 23362/23724 = 0.984 " this is the SPECIFICITY of the test"
P(Y=+ | X=+) = probability cancer is present given positive test
= 154/516 = 0.298 " this is the POSITIVE PREDICTIVITY of the test"
P(Y=- | X=-) = probability cancer is not present given a negative test
= 23362/23587 = 0.990 " this is the NEGATIVE PREDICTIVITY of the test"
Predictive Values
(prevalence)(sensitivity)
positive predictivity
(prevalence)(sensitivity) (1 - prevalence)(1- specificity)
(1 - prevalence)(specificity)
negative predictivity
(1 - prevalence)(specificity) (prevalence)(1- sensitivity)
These formulas, called ―Bayes‘ Theorem‖ allow us to calculate the predictive values
without having the data from the 2x2 table. If a test is applied to a target population
with a low disease prevalence the positive predictive value will be low.
Conditional Probability
The probability that an event B will happen given that we already know the outcome of
another event A. We are looking to see if the prior occurrence of A causes the
probability of B to change.
If P(B|A)=P(B) then we say that the two event A and B are independent and
P(A,B)=P(A)*P(B).
Back to Example P(X+|Y+) = 0.406 and P(X+) = 0.021
Therefore we can say that X+ and Y+ are not independent and that knowing that cancer
is present causes the probability of the test being positive to change.
Relative Risk
the chance that a member of a group receiving some exposure will develop a disease
relative to the chance that a member of an unexposed group will develop the same
disease.
Bwasiq 86 56
Chapter two: Biostatistics
Recall: a RR of 1.0 indicates that the probabilities of disease in the exposed and
unexposed groups are identical – an association between exposure and disease does
not exist.
Sampling
Sampling In Quantitative Research
1. Total Population "The total collection of units, elements or individuals that you
want to analyse.
2. Representative sample
3. Probability Sampling
4. Non-Probability Sampling
5. Sample Size
Sample
A sample is a group of units selected from a larger group (the population). samples
selected because the population is too large to study in its entirety.
Important that the researcher carefully and completely defines the population, including
a description of the members to be included
Representative sample
A sample whose characteristics correspond to, or reflect, those of the original
population or reference population
Probability Sampling
A probability provides a quantitative description of the likely occurrence of a particular
event. A probability sampling method is any method of sampling that uses some form of
random selection. In order to have a random selection method, you must set up some
process or procedure that assures that the different units in your population have equal
probabilities of being chosen (Clark 2002: 37).
Bwasiq 86 57
Chapter two: Biostatistics
Stratified Random Sampling
Often factors which divide up the population into sub-populations (groups / strata ) A
stratified sample is obtained by taking samples from each stratum or sub-group of a
population.
Non-Probability Sampling
Convenience/ opportunity/accidental sampling.
Purposive/ judgemental sampling.
Quota sampling.
Snowball sampling.
Sample Size
In general, the larger sample size (selected with the use of probability
techniques) the better. The more heterogeneous a population is on a variety of
characteristics (e.g. race, age, sexual orientation, religion) then a larger sample
is needed to reflect that diversity. (Papadopoulos 2003)
Response rates vary on the type of surveys = number of respondents \ number
of sample size .
Bwasiq 86 58
Chapter two: Biostatistics
Estimation Techniques for Binary Data
Numerical Methods for Binary Data
A special case of continuous data is binary data where each outcome has only 2
possible values. When outcomes are classified as belonging to one of two
possible outcomes, usually one of the outcomes is considered to be of primary
interest (ie. Presence of disease).
Standard Deviation vs. Standard Error
Standard deviation measures the variability in the population or sample
Standard error measures the precision of a statistic—such as the sample mean or
proportion—as an estimate of the population mean or population proportion
Proportions (P)For each individual in the study, we record a binary outcome (Yes/No;
Success/Failure) rather than a continuous measurement
Compute a sample proportion,p (pronounced ―p-hat‖), by taking observed
number of ―yes‘s‖ divided by total sample size
The sample proportion can be viewed as a special case of the sample mean
when data is coded as 0 or 1.
Variance of Dichotomous Data Consider the data that is coded ―0‖ or ―1‖.
write out the variance s 2 using the shortcut formula
but using n instead of n - 1 :
x 2
s
xi2 i
n since the data is coded 0 or 1 x 2 x
i i
n
x 2
s2
xi i
n xi xi
1 p(1 p)
n n n
Bwasiq 86 59
Chapter two: Biostatistics
In other words the statistic p(1-p) can be used in place of s2 as a measure of
variation for dichotomous data.
Standard Error of the Sample Proportion
The standard deviation of the data is :
s pˆ (1 pˆ )
Therefore the standard error of the sample proportion is :
pˆ (1 pˆ )
SE ( pˆ )
n
More specifically the CLT states that the sampling distribution
of the sample proportion (p) will be approximat ely normal when
the sample size is large; The mean and variance of the sampling
distribution are :
p
(1 )
p2
n
where is the population proportion.
What is the probability that our estimate of the population proportion will be correct
within 3%?
pˆ 1.96 * SE( pˆ )
pˆ (1 pˆ )
where SE( pˆ )
n
Bwasiq 86 60
Chapter two: Biostatistics
Confidence Intervals for the Population Proportion
From the sampling distribution of the sample proportion we can create a 95%
confidence interval for the population proportion .
This is only applicable in LARGE samples (n≥25)!!!
Notes on 95% Confidence Interval for a Proportion
Example: Suppose that n=25 newborn infants of obese women are sampled
and x=10 weigh less than 2500 grams. Create a 95% CI for the population
proportion ().
x 10
sample proportion pˆ 0.4
n 25
pˆ (1 pˆ ) 0.4(0.6)
SE ( pˆ ) 0.098
n 25
95% CI for the population proportion :
pˆ 1.96*SE(pˆ ) 0.4 1.96 * 0.098
0.4 0.192
(0.208,0.592)
There is a 95% chance that this interval will cover the true
population proportion for the proportion of newborns from obese
mothers that weig h less than 2500 grams.
Bwasiq 86 61
Chapter two: Biostatistics
OC Users:
n1 5000
x1 13
13
therefore: pˆ 1 0.0026
5000
pˆ 1 (1 pˆ 1 ) 0.0026(1 .0026)
and SE( pˆ 1 ) 0.0007
n1 5000
NON-OC Users:
95% for the population proportion :
pˆ 1 1.96 * SE( pˆ 1 ) 0.0026 1.96 * 0.0007
(0.0012,0.0040)
n1 10,000
x1 7
7
therefore: pˆ 2 0.0007
10000
pˆ 2 (1 pˆ 2 ) 0.0007(1 .0007)
and SE( pˆ 2 ) 0.0003
n2 10000
95% for the population proportion :
pˆ 2 1.96 * SE( pˆ 2 ) 0.0007 1.96 * 0.0003
(0.0001,0.0013)
It can be seen that the 95% CI for the two population proportions barely overlap
which is a good indication that the two population MI rates are probably not the
same.
The 95% CI for the difference in population proportions:
p̂1 (1 p̂1 ) p̂ 2 (1 p̂ 2 )
(p̂1 p̂ 2 ) 1.96 *
n1 n2
0.0026(1 0.0026) 0.0007(1 0.0007)
(0.0026 0.0007) 1.96 *
5000 10,000
0.0019 1.96 * 0.0008
(0.0003,0.0035)
Bwasiq 86 62
Chapter two: Biostatistics
CI for the Odd‘s Ratio
The odd‘s ratio is calculated as:
Exposure Cases Controls
Exposed A B
Unexposed C D
=Odd‘s
of exp for cases/Odd‘s of exp for controls
The formula for the 95% CI for the population Odd‘s Ratio:
ln(OR ) 1.96 * SE(ln(OR ))
1 1 1 1
where SE(ln(OR ))
a b c d
result (A, B) (e A ,e B )
Tests of Significance
General Concepts
One approach is to construct a confidence interval for the population parameter;
another is to construct a statistical hypothesis test.
With statistical tests we claim that the mean0 of the population is equal to some
postulated value which is called the null hypothesis or H0.
The alternative hypothesis is a second statement that contradicts H 0.
Serum Cholesterol Example
If we wanted to test whether the mean serum cholesterol level of hypertensive
smokers is equal to the mean of the general population of 20-74 year old males.
Together the null and alternative hypotheses cover all possible values of the
population mean. H : 211mg / 100ml
0 0
H : 211mg / 100ml
A
Reject H0 α 1-
NOT 1- α
Reject H0
Types of Errors
Two possible ways to commit an error:
(i) Type I: Reject Ho when it is true ()
(ii) Type II: Fail to reject Ho when it is false ()
Bwasiq 86 63
Chapter two: Biostatistics
The goal in hypothesis testing is to keep and (the probabilities of type I and II
errors) as small as possible.
Usually is fixed at some specific level - say 0.05 - significance level of the test.
1- is called the power of the test.
Hypothesis Testing
Want to draw a conclusion about a population parameter
In a population of women who use oral contraceptives, is the average
(expected) change in blood pressure (after-before) 0 or not?
Sometimes statisticians use the term expected for the population averageμ is
the expected (population) mean change in blood pressure
Hypothesis Testing
We are testing both hypotheses at the same time
Our result will allow us to either ―reject H0‖ or ―fail to reject H0‖
We start by assuming the null (H0) is true, and asking:
―How likely is the result we got from our sample?‖
Hypothesis Testing
Null hypothesis: H0 = µ0 = 0
Alternative hypothesis: HA µ0 ≠ 0
We reject H0 if the sample mean is far away from 0
Bwasiq 86 64
Chapter two: Biostatistics
Using the p-value to Make a Decision
Recall, we specified two competing hypotheses about the underlying, true mean
blood pressure change, µ
We now need to use the p-value to choose a course of action . . . either reject
H0, or fail to reject H0
Bwasiq 86 65
Chapter two: Biostatistics
Hypothesis Testing for Proportions
The proportion of patients surviving five years after being diagnosed with lung
cancer among those who are over 40 at the time of diagnosis is 8.2%. Is it possible that
the proportion surviving in the under-40 population is 0.082 as well? For a sample of 52
persons under 40 who have been diagnosed with lung cancer the survival proportion is
0.115.
Step 1: Determine the hypotheses
H0: 0=0.082
HA: 00.082 (two-sided alternative)
Step 2: Calculate the test statistic:
pˆ 0 0.115 0.082
Z 0.87
0 (1 0 ) / n 0.082(1 0.082) / 52
Hypothesis Testing
Steps in Hypothesis Testing
1. State the null hypothesis H0 and the alternative hypothesis Ha.
2. Calculate the value of the test statistic on which the test will be based.
3. Find the P-value for the observed data.
4. State a conclusion.
Bwasiq 86 66
Chapter two: Biostatistics
One sided or 2-sided
If you do not have a specific direction firmly in mind in advance (before looking
at the data), a 2-sided alternative hypothesis should be used.
Hypothesis Examples
Does mean age of onset of a certain acute disease for school children differ
from 11.5?
Is the average cross-sectional area of the lumen of coronary arteries for men,
ages 4 to 59 less than 31.5% of the total arterial cross section?
Bwasiq 86 67
Chapter two: Biostatistics
Step 4- Making a Conclusion
We need to compare our p-value with a fixed value that we regard as decisive.
This value determines how much evidence against H0 we will require to reject H0
and we call it the significance level ().
With a significance level set at 0.05 we are requiring that the data give evidence
against H0 so strong that it would happened no more than 5% of the time when
H0 is true.
If the p-value is as small or smaller than , we say that the data are statistically
significant at level and we would reject H0.
Example for the Population Mean
Do middle aged male executives have different average blood pressure than the
general population? The National Center for Health Statistics reports that the mean
systolic blood pressure for males 35 to 44 years of age is 128 mg/100ml and the
standard deviation in this population is 15 mg/100ml. The medical director of a company
looks at 72 company executives in this age group and finds that the mean systolic blood
pressure in this sample is 126.07 mg/100ml. Is this evidence that the executive blood
pressure differs from the national average?
Step 1: State your hypotheses
H0: μ= μ0=128 mg/100ml H a: μ128 mg/100ml
Step 2 – Calculate your test statistic
We make the unrealistic assumption that the population standard deviation is known.
x 0 126.07 128
z 1.09
15 72
n
Bwasiq 86 68
Chapter two: Biostatistics
H0: μ = 450 H a: μ> 450
Step 2 – Calculate your test statistic
x 0 461 450
z 2.46
100 500
n
Bwasiq 86 69
Chapter two: Biostatistics
Do middle aged male executives have different average blood pressure than the
general population? The National Center for Health Statistics reports that the mean
systolic blood pressure for males 35 to 44 years of age is 128 mg/100ml and the
standard deviation in this population is 15 mg/100ml. The medical director of a company
looks at 72 company executives in this age group and finds that the mean systolic blood
pressure in this sample is 126.07 mg/100ml. Is this evidence that the executive blood
pressure differs from the national average?
Step 1: State your hypotheses
H0: μ= μ0=128 mg/100ml H a: μ128 mg/100ml
Step 2 – Calculate your test statistic
x 0 126.07 128
z 1.09
15 72
n
We make the unrealistic assumption that the population standard deviation is known
Step 3 – Calculate the p-value
The probability that a standard normal variable Z takes a value at least 1.09 away from
zero.
P-value=2*P(Z 1.09) = 2* (0.5-0.3621)=0.2758
This means that 27.6% of the time a SRS of size 72 from the general male population
would have a mean blood pressure at least as far from 128 mg/100ml as that of the
executive sample.
Step 4- Make your conclusion
At a significance level of 0.05 we would fail to reject H0 and concluded that the data do
not provide enough evidence to conclude that the mean blood pressure of executives is
different from 128 mg/100ml.
The t-Distribution
Both confidence intervals and tests of significance for the mean of a normal
population are based on the sample mean
x .
The sampling distribution ofx depends on .
is either known or it estimated using the sample standard deviation s.
x ~ N ( , 2 n)
Setting: SRS of size n from a normally distributed population with mean and
standard deviation . This is based on the results of the CLT: x ~ N ( , 2 n)
Test Statistics
The standardized sample mean or the one-sample z statistic, when is known:
When we substitute the sample standard deviation we get:
x
z ~ N (0,1)
x
t ~ t( n 1) n
s
n
Bwasiq 86 70
Chapter two: Biostatistics
One-sample t-test
An SRS is drawn from a population having unknown mean . Test the
hypothesis: x
t
H0: = 0 s
n
The random variable T has a t(n-1) distribution so the p-value for the test is:
If : Ha: >0 p-value=P(Tt)
Ha: <0 p-value=P(Tt)
Ha: 0 p-value=2*P(T |t|)
These P-values are exact if the population distribution is normal and are approximately
correct for large n in other cases.
Rejection Region
From the table for the t-distribution with (n-1) degrees of freedom and the
choice of the rejection region is determined by:
For a one-sided test use the column corresponding to an upper tail area of 0.05
t-tabulated value for Ha: <0
t tabulated value for Ha: >0
For a two-sided test use the column corresponding to an upper tail of 0.025:
t-tabulated value OR t tabulated value
Example for 1-sample t-test
The following data are amounts of vitamin C (mg/100g) for a random sample
corn soy blend (population is normally distributed).
26 31 23 22 11 22 14 31
The specifications are designed to produce a mean vitamin C content of 40 mg/100 g.
Test the null hypothesis that the mean vitamin C content of the production run from
which we got or sample conforms to these specifications.
We are told that: x 22.5 and s 7.19
1. Ho: =40 mg/100g
Ha: 40 mg/100g x 0 22.5 40
t 6.88
2. Calculate the test statistic: s 7.2 8
n
3. This test statistics has the t(7) distribution. Need to calculate the p-value:
2*P(T6.88) =0.00024=2*TDIST(6.88,7,1)=TDIST(6.88,7,2)
4. Therefore we reject the null hypothesis based on an -level of 0.05 and
conclude that the vitamin C content for this run does not meet the specifications.
Bwasiq 86 71
Chapter two: Biostatistics
Matched Pairs t Procedures
One common comparative design is the matched-pairs study where subjects are
matched in pairs and are compared within each pair.
One example of matched pairs is before-and-after observations on the same subjects.
In this setting the variable that we are measuring on the subjects is a continuous
measure. We looked at the design where the outcome was dichotomous.
With large sample size and assuming that the null hypothesis of no difference is true the
mean d of these differences is distributed as normal with mean and variance given
by:
d 0
d2
d
n
Since we do not know the variance this has to be estimated from our data by the
sample variance. d 0
t ~ t (n 1)
Therefore our test statistic becomes: sd
n
20 teachers were tested for their understanding of French before and after a 4
week immersion program. We want to test if the program improved the teacher‘s
comprehension of spoken French.
We are told the following: The average difference in scores is 2.5
The sample standard deviation of the differences in the scores is 2.893.
1. H0: d= 0
Ha: d>0
2. This test statistics has the t(19) distribution. Need to calculate the p-value:
P(T3.86) = tdist(3.86,19,1)=0.00053
d 0 2.5 0
t 3.86
sd 2.893 20
n
4. Therefore we reject the null hypothesis based on an -level of 0.05 and
conclude that there is strong evidence that the program improved
comprehension.
Bwasiq 86 72
Chapter two: Biostatistics
Two sample Problem
population variable Mean Standard deviation
1 X1 µ1 1
2
2 X2 µ2
We have two independent samples from 2 distinct populations and the same continuous
variable is measured for both samples.
population Sample size Sample Mean Sample Standard
deviation
n1 x1 s1
1
n2 x2
2 s2
Inference is based on 2 independent SRS, one from each population.
Two-sample t-test
Hypotheses for comparing two population means:
One-tailed test: Two-tailed test:
H 0 : ( 1 2 ) Do H 0 : ( 1 2 ) Do
H a : ( 1 2 ) Do H a : ( 1 2 ) Do
OR
H 0 : ( 1 2 ) Do
H a : ( 1 2 ) Do
population and that x2 is the mean of a SRS of size n2 drawn from an N (2 , 2 2 )
population. Then the 2-sample a statistic is: ( x1 x2 ) Do
z
12 22
Has the N(0,1) sampling distribution
n1 n2
T procedures
If the population standard deviations are not known we estimate them by the sample
standard deviations from our two samples.
To simplify the test we will assume that the two normal population distributions have the
same standard deviations so we use a pooled standard error in the test statistic.
Bwasiq 86 73
Chapter two: Biostatistics
(n1 1) s12 (n2 1) s22
s 2p
n1 n2 2
and
( x x ) Do
t 1 2 with t (n1 n2 2) distribution
1 1
s 2p
n1 n2
Rejection Region
From the table for the t-distribution with (n-1) degrees of freedom and the choice of
the rejection region is determined by:
For a one-sided test use the column corresponding to an upper tail area of 0.05 and Ho
is rejected if:
t-tabulated value for Ha: 1<2
ttabulated value for Ha: 1>2
For a two-sided test use the column corresponding to an upper tail of 0.025 and H0 is
rejected if:
t - tabulated value OR ttabulated value
Sample size 17 12
Test the null hypothesis that the population means are equal vs. the alternative that
they are not equal.
1. Ho: 1- 2 =0
Ha: 1- 20 (n1 1) s12 (n2 1) s22 (17 1)(3.4 2 ) (12 1)(4.82 )
2. Calculate the test statistic: p
s 2
16.24
n n 2 1 2 17 12 2
and
( x x ) Do (5.4 7.9) 0
t 1 2 1.645
2 1 1 1 1
s p 16.24
1
n n2 17 12
2. This test statistic follows the t-distribution with 27 degrees of freedom.
Bwasiq 86 74
Chapter two: Biostatistics
3. P-value=2*P(T>1.645) =
2*TDIST(1.645,27,1)=TDIST(1.645,27,2)=0.112
- critical value for rejection region=2.052
4. Therefore we fail to reject the null hypothesis based on an -level of 0.05 and
conclude that the two population means are not different.
Independent random samples from approximately normal populations produced the
results shown in the table. Do the data provide sufficient evidence to conclude that
(2-1)>10?
1. Ho: 2- 110 and Ha: 2- 1>10 x1 43.6 x2 53.63 s1 5.47 s2 5.41
2. Calculate the Test statistic:
(n1 1) s12 (n2 1) s22 (15 1)(5.472 ) (16 1)(5.412 )
s 2p 29.58
n1 n2 2 15 16 2
and
( x x ) Do (53.63 43.6) 10
t 2 1 0.015
1 1 1 1
s 2p 29.58
n1 n2 15 16
Comparing Means
For the cholesterol example the mean reduction in each treatment group was:
0.2 1.5 0.8
Is the observed difference the result of chance variation?
Would not expect the sample means to be equal even if the population means
are identical.
To answer this we need to know the variation within the groups under
observation and the sizes of the samples.
To assess the equality of several population means we compare the variation
among the means of several groups with the variation within groups.
This method is called Analysis of Variance.
Bwasiq 86 75
Chapter two: Biostatistics
The null and alternative hypotheses for the one-way ANOVA are:
H o : 1 2 3 .... k
Contingency table
It is a table that cross-classifies the observations from two variables. Each cell in the
table contains the counts of the combinations of the two variables.
Setting: Let X1 and X2 denote categorical variables, X1 having I levels and X2 having
J levels. There are IJ possible combinations of classifications.
When the cells contain frequencies of outcomes, the table is called a contingency table.
Paired-Matched Studies
The distinguishing characteristic of paired samples for counts is that each observation
in the first group has a corresponding observation in the second group. For this type of
data we use McNemar’s Test to evaluate hypotheses about the data.
For paired matched data with a single binary response the data can be represented by
a 2x2 table where (+,-) denote the exposed and non-exposed outcome.
Case Control
+ -
+ a b
- c d
Bwasiq 86 76
Chapter two: Biostatistics
McNemar‘s Test
Decisions based on the standardized z-score is for a one-sided alternative.
In the two-sided form, the square of the z-statistic is denoted by: b c 2
And the test is known as McNemar’s chi-square. 2
bc
If the test is one-sided, z is used and the null hypothesis is rejected at the 0.05 level
when z>1.65
If the test is two-sided, 2 is used and the null hypothesis is rejected at the 0.05 level
when 2 >3.84
X2 Distribution
The probabilities associated with the Chi-Squared Distribution are in Chi table.
The table is set up in the same way for the t-distribution.
The chi-squared distribution with 1 df is the same as the square of the N(0,1)
distribution.
Since the distribution only takes on positive values all the probability is in the
right-tail.
For a significance level of 0.05 and df=1 the rejection region for a 2-sided test is:
X2 test statistic > 3.84.
Example of Paired-Matched Study
A study in Maryland identified 2408 white persons enumerated in an unofficial
1963 census who became widowed between 1963 and 1974. These people
were matched, one-to-one, to married persons on the basis of race, gender,
year of birth, and geography of residence. The matched pairs were followed to a
second census in 1975 and vital status was obtained.
Bwasiq 86 77
Chapter two: Biostatistics
Independent Studies
The null hypothesis of the Chi-square test is that there is no association
between the row and column variables
The alternative hypothesis of the Chi-square test is that an association exists
between these two variables. It is always a two-sided hypothesis.
If the null hypothesis is true then each cell count
= row total * column total
N(total sample)
The Chi-square test statistic is the sum of the squares of the difference between
observed count (O) and the expected count (E) divided by expected count
( xij eij ) 2
2
eij
A B C D
Small 157 65 181 10 413
Bwasiq 86 78
Chapter two: Biostatistics
Manufacturer TOTALS
A B C D
Manufacturer TOTALS
A B C D
TOTALS
Manufacturer TOTALS
A B C D
Small 157 65 181 10 413
(140.833) (79.296) (158.179) (34.692)
Large 58 45 60 28 191
(65.131) (36.672) (73.153) (16.044)
Bwasiq 86 79
Chapter two: Biostatistics
Large values of the test statistic implies that the observed counts are not close to the
expected counts under the null hypothesis and therefore imply that the null hypothesis
is false.
For this example the appropriate degrees of freedom (df) are: (3-1)*(4-1)=6
We can find the critical value for the rejection region:
o For =0.05 the critical value is 12.592
o For =0.01 the critical value is 16.812
We therefore reject the null hypothesis of no association at the 0.05 significance
level since 45.81>12.592.
We conclude that there is an association between the size of car and the car
manufacturer. In other words ―the size and manufacturer of a car selected by a
purchaser are not independent events.
Regression Analysis
Previously we were interested in testing population parameters.
If the data was binary or categorical we discussed the comparison of population
proportions
If the data was continuous we discussed the comparison of population means
In other studies the goal is to assess the relationships among a set of variables.
relationship between a mother‘s weight and her newborn‘s weight.
Birth weight data :
x (oz) y(%)
112 63
111 66
( x x )( y y )
107 72
119 52
r i i
( x x ) ( y y)
92 75
80 118 2 2
81 120 i i
84 114
118 42
106 72
103 90
94 91
Bwasiq 86 80
Chapter two: Biostatistics
H0 : 0
Ha : 0
n2 12 2
tr (0.946) 9.23
1 r 2
1 (0.946) 2
at 0.05 and df 10, the critical value for the rejection region is - 2.228.
The test statistic t is -2.228so we would reject the null hypothesis and
conclude that ther e is a linear association between birth weig ht and % increase
in weight. The birth weig ht (x) accounts for r 2 (0.946) 2 0.895 or 89.5%
of the variablil ity in % growth rates (y).
What is Regression?
Like correlation analysis, simple linear regression is a technique that is used to explore
the nature of a relationship between two continuous random variables.
Regression analysis allows us to investigate the change in one variable which
corresponds to a given change in the other.
Instead of just quantifying the strength of the relationship between the 2 variables we
can predict the value of one variable given a value for the other.
Bwasiq 86 81
Chapter two: Biostatistics
Assumptions for Linear Regression
1. For a specified value of x, the distribution of the yvalues is normal with mean
y|xand standard deviation y|x.
2. The relationship between y|xand x is described as the straight line y| x 0 1 x
3. For any specified value of x, y|xdoes not change.
4. The outcomes of y are independent.
Scatter Plot
If you plot the mean of Y vs. X, the graph is a straight line.
The observed values of Y may be greater or less than its mean. Therefore the plot of
the observed values will not fall perfectly on the line.
A scatter diagram consists of a single point for each (x,y) pair of numbers.
Regression Coefficients
0 is the intercept of the regression line. It does not have any particular meaning as
a separate term in the regression model.
1 is the slope of the regression line. It represents the increase (or decrease if it is
negative) in the mean of Y associated with a 1 unit increase in X.
For m unit increase in the value of X, the corresponding increase (or decrease) in the
mean of Y is m * 1.
xy n
( x)( y )
b1
n
i 1 ( xi x )( yi y )
x
n
i 1 ( xi x ) 2 2 ( x) 2
n
b0 y b1 x and
Yˆ b b X
0 1
Normal Distribution
Discrete Probability Distributions
Binomial distribution – the random variable can only assume 1 of 2 possible
outcomes. There are a fixed number of trials and the results of the trials are
independent.
Discrete Random Variable
A discrete random variable X has a finite number of possible values. The probability
distribution of X lists the values and their probabilities.
1. Every probability pi is a number between 0 and 1.
Bwasiq 86 82
Chapter two: Biostatistics
2. The sum of the probabilities must be 1.
Find the probabilities of any event by adding the probabilities of the particular
values that make up the event.
Example : The instructor in a large class gives 15% each of A‘s and D‘s, 30%
each of B‘s and C‘s and 10% F‘s. The student‘s grade on a 4-point scale is a
random variable X (A=4).
What is the probability that a student selected at random will have a B or better?
ANSWER: P(grade of 3 or 4)=P(X=3) + P(4) = 0.3 + 0.15 = 0.45
The probability density is a smooth idealized curve that shows the shape of the
distribution in the population
Areas in an interval under the curve represent the percent of the population in the
interval
Normal Distribution
You can tell which normal distribution you have by knowing the mean and standard
deviation.
The mean is the center &The standard deviation measures the spread (variability)
The most common continuous distribution is the normal distribution – the bell
shaped curve.
The normal curve is unimodal and symmetric about its mean ().
In this distribution the mean, median and mode are all identical.
The standard deviation () specifies the amount of dispersion around the mean.
The two parameters and completely define a normal curve.
Bwasiq 86 83
Chapter two: Biostatistics
When applied to ‗real data‘, these estimates are considered approximate!
Distributions of Blood Pressure
Standard Normal Variable
It is customary to call a standard normal random variable Z.
The outcomes of the random variable Z are denoted by z.
The table in the coming slide give the area under the curve (probabilities)
between the mean and z.
The probabilities in the table refer to the likelihood that a randomly selected
value Z is equal to or less than a given value of z and greater than 0 (the mean
of the
Calculating Probabilities
Probability calculations are always concerned with finding the probability that the
variable assumes any value in an interval between two specific points a and b.
The probability that a continuous variable assumes the a value between a and b
is the area under the graph of the density between a and b.
97 125
Z 2.0
14
Bwasiq 86 84
Chapter two: Biostatistics
Standardization F ( x) P( X x)
X x
P
x
P Z
P( Z z )
T-Distribution
Similar to the standard normal in that it is unimodal, bell-shaped and
symmetric.
The tail on the distribution are ―thicker‖ than the standard normal
The distribution is indexed by ―degrees of freedom‖ (df).
The degrees of freedom measure the amount of information available in the
data set that can be used for estimating the population variance (df=n-1).
Area under the curve still equals 1.
Probabilities for the t-distribution with infinite df equals those of the standard
normal.
The table of t-distribution will give you the probability to the right of a critical
value – i.e. area in the upper tail.
We are only given the area (or probability) for a few selected critical values for
each degree of freedom.
T-Distribution Example
For a t-curve from a sample of size 15 find the area to the left of 2.145.
Answer: df=15-1=14
In the table of the t~distribution, the area to the right of 2.145 is 0.025.
Therefore the area to the left of 2.145 is:
1-0.025=0.975
Bwasiq 86 85
Chapter two: Biostatistics
Therefore the observed value of the sample proportion can be converted into a
Z-statistic which is the number of standard errors away from the hypothesized
value 0.
From the table for the standard normal distribution and the choice of (e.g.
=0.05) the rejection region is determined by:
o For a one-sided test: z -1.65 for Ha: 0
z 1.65 for Ha: >0
o For a two-sided test or Ha: 0
o z -1.96 OR z 1.96
From the table for the standard normal distribution and the choice of (e.g.
=0.01) the rejection region is determined by:
o For a one-sided test: z -2.33 for Ha: 0
z 2.33 for Ha: >0
o For a two-sided test or Ha: 0
o z -2.56 OR z 2.56
One-sample problem
A group of investigators wish to explore the relationship between the use of hair
dyes and the development of breast cancer in women. A sample of n=1000
female beauticians 40-49 years of age is identified and followed for 5 years.
After 5 years there are 20 new cases of breast cancer. It is known that breast
cancer incidence over this time period for average American women in this age
group is 0.007. Does hair dye increase the risk of breast cancer?
Step 1: State your hypotheses
o H0: = 0 = 0.007
o Ha: >0.007 pˆ 0 0.02 0.007
Step 2 – Calculate your test statistic z 4.93
0 (1 0 ) 0.007(1 0.007)
Step 3 – Calculate the p-value n 1000
The probability that a standard normal variable Z takes a value at least 4.93 s.d
away from zero. P-value=P(Z 4.93) < 0.001
Assuming a significance level of 0.05 the rejection region would be z>1.96.
Since 4.93>1.96 the test statistic is in the rejection region.
Step 4- Make your conclusion
At a significance level of 0.05 we would reject H0 and concluded that there is enough
evidence to conclude that the incidence rate of breast cancer for beauticians is greater
than the incidence rate for the average American woman in the age group 40-49 years.
Bwasiq 86 86
Chapter two: Biostatistics
Comparison of Two Proportions
Setting: We have 2 independent samples of binary data (n1,x1) and (n2,x2)
the n‘s are adequately large and are not necessarily equal
the x‘s are the numbers of ―positive‖ outcomes in the two
samples
Therefore to test the null hypothesis that the two population proportions are the
same we need to develop a test statistic that normalizes the difference
Bwasiq 86 87
Chapter two: Biostatistics
accident was assessed. Is there evidence of an association between the type of
drug and the responsibility?
Drug A Drug B
Child responsible 8 12
Child not responsible 31 19
Step 1: Write out your Hypotheses
o H0: 1 = 2
o Ha: 12
o 2= population proportion for drug B 1 = population proportion
for drug A
o Step 2: Calculate the Test Statistic
Step 3: Find the rejection region or the p-value.
1. Rejection region: since this is a two sided test with
=0.05 the region region is: >1.96 or <-
1.96.
The z-statistic=1.674 which is not in the rejection region
Calculate the p-value:
p-value=2*p(z>1.674)=2*(0.5-0.4525)=0.095
Step 4: Make your conclusion
o At a significance level of =0.05 we would fail to reject the null
hypothesis and say that there is no association between responsibility
and the type of drug used in the overdose
Bwasiq 86 88