0% found this document useful (0 votes)
97 views

Chapter 102 Biostatistics

This document provides an overview of biostatistics. It defines statistics as the science dealing with numerical data and defines biostatistics as statistics applied to biological problems. Some key points: - Biostatisticians identify risk factors and treatments for diseases, design and analyze clinical studies, and develop statistical methods for medical data. - There are different types of data (qualitative like gender vs. quantitative like age) and variables (constant, random, population and sample). Data can also be nominal, ordinal, discrete or continuous. - Descriptive statistics summarize data through tables, graphs and numeric measures like mean, median and mode. Inferential statistics make conclusions about populations from samples using methods like hypothesis testing and estimation

Uploaded by

Yassir Ounsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Chapter 102 Biostatistics

This document provides an overview of biostatistics. It defines statistics as the science dealing with numerical data and defines biostatistics as statistics applied to biological problems. Some key points: - Biostatisticians identify risk factors and treatments for diseases, design and analyze clinical studies, and develop statistical methods for medical data. - There are different types of data (qualitative like gender vs. quantitative like age) and variables (constant, random, population and sample). Data can also be nominal, ordinal, discrete or continuous. - Descriptive statistics summarize data through tables, graphs and numeric measures like mean, median and mode. Inferential statistics make conclusions about populations from samples using methods like hypothesis testing and estimation

Uploaded by

Yassir Ounsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Chapter two: Biostatistics

Introduction to Biostatistics
What is Statistics?
- "singular‖: the science that deals with the collection, classification, analysis, and
interpretation of numerical facts or data, and that , by use of mathematical
theories of probability, imposes order and regularity on aggregates of more or
less disparate elements.
- This includes design issues as well.
(plural)‖: the numerical facts or data themselves. (Webster‘s Dictionary)
Statistics is about common sense and good design! (Campbell and Machin 1983)

Biostatistics:
Statistics applied to biological (life) problems, including: Medicine, Public health,
Ecological and environmental.

Why should I study Statistics?


 A tool for research.
 Easier to communicate with Statisticians/ Biostatisticians.
 Understanding medical literature (improve literature appraisal-skills).

What Do Biostatisticians Do?


 Identify and develop treatments for disease and estimate their effects.
 Identify risk factors for diseases.
 Design, monitor, analyze, interpret, and report results of clinical studies.
 Develop statistical methodologies to address questions arising from
medical/public health data.

Statistical Analyse:
 Descriptive Statistics:
o Describe the sample.
 Inference:
o Make inferences about the population.
o Primarily performed in two ways:
 Hypothesis testing.
 Estimation (more important!!).
 Prediction.

How to properly use Biostatistics


 Develop an underlying question of interest.
 Generate a hypothesis.
 Design a study. To answer the scientific question. &Include sampling process.

Bwasiq 86 45
Chapter two: Biostatistics
 Collect Data.
 Analyze Data.
o Descriptive statistics.
o Statistical Inference.

DATA
 The vast majority of errors in research arise from a poor planning (e.g., data
collection).
 Fancy statistical methods cannot rescue garbage data.
 Collect exact values whenever possible.

Types of Data
 …the selection of an appropriate statistical technique is determined by the
research design, hypothesis, and the data collected
 Definitions
 Variable: Characteristic or attribute that can assume different values.
 Random Variable: A variable whose values are determined by
chance.
 Population: All subjects possessing a common characteristic that is
being studied.
 Sample: A subgroup or subset of the population.

TYPE of data:
1- Constant 2-Variable

Type of variable:
1- Qualitative "nominal & ordinal "
2- Quantitative " continuous & discrete "
1.Qualitative data
A. Nominal data
 It can be classified into more than two categories. "Blood type, Race.
 No meaningful order.
B. Ordinal data
 It can be classified into categories that have a natural ordering.
 Very satisfied, satisfied, neutral, unsatisfied, very unsatisfied.
2.Quantitative data
A. Discrete data
 When each element of a set lies at a few isolated point.
 Countable variables. Integer form.
 Numbers of things. "Age, numbers of men".

Bwasiq 86 46
Chapter two: Biostatistics

B. Continuous data
 When each element of a set can theoretically lie any where on the number
scale.
 Measurable variables. Bp, Kg, hour, years.
 Round to the nearest integer.

Descriptive statistics - Examples


 In a cohort study on risk factors to coronary heart disease, it was measure the
levels of cholesterol in blood; we can show the mean and standard deviation.
 Level cholesterol in blood X ± S = 268 ± 49
 In a cross – sectional study in a group of children, it was determine the gender
of the subjects and it was determine frequency and percentage of males and
females; it can be shown in graphical form.

Inferential statistics
 It use the probability theory to obtain conclusions of a population, from data
obtained in a sample.
 It is very hard to study all population, because of this, we study samples.
 Methods for estimations and hypothesis tests are important to obtain
inferences.

Inferential statistics - Examples


 In a National survey on the danger of smoking, we cannot interview all
population, only we can interview a sample of it.
 To measure prevalence of amebiasis in a population, we study a random
sample. With the prevalence of the sample, we can obtain the estimate of
prevalence of amebiasis in the population.
 Role of statisticians
 To guide the design of an experiment or survey prior to data collection.
 To analyze data using proper statistical procedures and techniques.
 To present and interpret the results to researchers and other decision makers.

Descriptive Statistics
 Use of numerical information to summarize, simplify, and present masses of
data.
 Organized and summarized for clearer presentation
 For ease of communications

Bwasiq 86 47
Chapter two: Biostatistics
 Data may come from studies of populations (often called a census study) or
samples
— Population = parameter
— Sample = statistic
— Often, either called statistic

Descriptive Methods for Continuous Data
 Statistical procedures used to summarise, organise, and simplify data. This
process should be carried out in such a way that reflects overall findings
 Raw data is made more manageable
 Raw data is presented in a logical form
 Patterns can be seen from organised data
- Tables
- Graphical techniques
- Measures of Central Tendency
- Measures of Dispersion (variability).
- Coefficient of Correlation.
Describing data with tables
4) open- ended groups 1) frequency table
5) cross-tabulation 2) relative and cumulative frequency
6) tables that are not contingency tables 3) grouped frequency

1) Frequency table : ordinal and discrete metric data


Mortality % tally ycqeuqerf"no. of ICU "
11.2-15.1 1,1,1,1,1,1,1,1,1 9
15.2-20.1 1,1,1,1,1,1,1 7

2) Relative frequency, cumulative frequency


Relative frequency: percentage of the total Cumulative frequency
score frequency Cumulative frequency Cumulative percentage
1 2 2 14.2%
2 5 7 50%
3 4 11 78.5 %
4 2 13 92.8%
5 1 14 100%

3) Grouped frequency: for continuous data .


Birth wt No . of infant
2700-2999 2
3000-3299 3

Bwasiq 86 48
Chapter two: Biostatistics
4) Open-ended group:
One or two values which are called outliers, are a long way from the general mass of
the data. Use ≤ or ≥

5) Cross-tabulation

Breast 2 or
lump fewer
diagnosis children

Yes No
Benign 21 (84%) 11 32(100)
(66) (73%)(34)
Malignant 4 (16%) 4 8(100)
(50) (27%)(50)
Totals 25(100%) 15(100%) 40

6) Not contingency tables :


because two quite separate groups of individuals are involved.

Describing data with charts


1. The pie chart
4-5 categories , One variable & Start at 0° in the same order as the table.
2. The simple bar chart:
Same widths, equal spaces b/w bars
3. The clustered bar chart:
4. The stacked bar chart
5. The dot plot
This is particularly useful with ordinal variables if the number of categories is too
large for a bar chart

Describing data from its distributional shape:


1. symmetric mound-shaped distributions
Skewed distributions

Describing data with numeric summary value


1. Numbers, percentages and proportions.
2. Summary measures of location.
3. Summary measures of dispersion.

Bwasiq 86 49
Chapter two: Biostatistics
1. numbers, percentages and proportions
 Numbers-the numerical summaries of data
 A percentage is a proportion multiplied by 100. (categorical data)
 1) Prevalence: number of existing cases in some population at a given time.
2) Incidence (inception): the number of new cases occurring per 100, or
per 1000, of the population, during some period of time.

Measures of central tendency


 Also called measures of location ,Gives one number which is representative of
all the data
 They are the: Mean, Median &Mode.

Sample Mean: "sample average or arithmetic mean"


 called the sample mean to distinguish it from population mean
 Measures of Location – Mean
For given a date set of size n: {x1, x2, x3, …,xn}

The mean of the x's will be donated by ̅ =
Example: How many hours of television do you watch in weak?
{ }in hours, n=5

= 60 = sum of the data points
leading to∶ ̅ = = 12 hours
 Summation Sign‖‖
 In the formula to find the mean, we use the ―summation sign‖ — 
 This is just mathematical shorthand for ―add up all of the
observations‖
n

X
i1
i  X 1  X 2  X 3  .......  X n

Geometric Mean: Example

x In (x) The mean using the raw data is :


8 2.08
5 1.61 79
x  11.3
4 1.39 7
12 2.48
While on the log scale :
15 2.71
7
28
1095
3.33
 ln x  15.5  2.22
79 15.55 n 7
leading to a geometric mean of : 9.22

Bwasiq 86 50
Chapter two: Biostatistics
Mean from a Positively Skewed Distribution:
 When the data is positively skewed analyses are commonly done on the log scale.
 This is done to minimize the effect of extreme observations.
 Method of obtaining the mean:
 Take the log of each data value
 Calculate the mean on the log scale
 Take the antilog of the mean to return to the original scale of measurement.
 This is called the ―GEOMETRIC‖ Mean.

Advantages Disadvantages
- Simple and easy
- Most widely used - Affected by extreme values
- Can be used for further statistical tests - Sometimes looks ridiculous e.g. average
- All values are included number of children = 2.7
- Does not need arrangement of data

Median
 Value which divides the data into two equal parts after arrangement of data into
ascending or descending order.
 If the number of observations in the dataset is odd the median will be the ½(n+1) th
observation.
 If the number of observations is even the median is defined as the average of the
(½n)th and the ½(n+1)th observation.
 i.e. {8,5,4,12,15,7,28} the median is 8.
- First put observations in order: 4,5,7,8,12,15,28
- Find the ½ (n+1)th(which is the 4th) observation.
-
Advantages Disadvantages
- Not affected by extreme values - Needs arrangement of data
- Used for growth curves and income - Difficult to calculate from large amounts
- Can be determined graphically of data
- Not all values are represented

Mode
 It is the most common value found in the dataset (fashionable value)
o Hb level of 5 pregnant women
 12, 12.5, 11, 13, 12.5 Mode = 12.5

Bwasiq 86 51
Chapter two: Biostatistics
 More than one mode may occur (bimodal, trimodal) Sometimes there is no
mode .
Advantages : Not affected by extreme values
Disadvantages :Not all values are represented
 Distribution Characteristics
Mode: Peak(s) Median: EqualareaspointMean: Balancingpoint

4. Measures of Dispersion
 measure the degree of variation or dispersion around the mean.
 The measurement of dispersion (or variation) plays an important role in the
methods of statistical inference.
 We will discuss: Range. , Variance. &Standard Deviation.

Range:
 Difference between highest and lowest value
 Range = largest value-smallest value
o E.g: Hb level of 5 pregnant women 12, 12.5, 11, 13, 12.5
Range = 13-11 = 2
 Advantages: Easy to calculate
 Disadvantages
o It is affected by extreme values.
o Value of range is only determined by two values.
o The interpretation of the range is difficult.
o It does not provide information about other values and how dispersed
they are.

Variance and Standard Deviation


 Uses deviations from the mean to measure the variation in the dataset.

Bwasiq 86 52
Chapter two: Biostatistics
 The variance is obtained by squaring these deviations and dividing their sum by
one less than n.

 The standard deviation which is the square root of the variance.


 Variance Example
X (xi-x) (xi-x)2

8 0 0
5 -3 9
4 -2 4
12 4 16
15 7 49
5 -3 9 X =8 S2 =100\6 = 4.08 SD=
7 -1 1

 Coefficient of Variation
o Measure of spread that is independent of the units of measurement
variables.
o We divide the standard deviation by the mean and express this quotient
as a percentage.
 Coefficient of Correlation
o Measure of linear association between 2 continuous variables.
 Setting:
o two measurements are made for each observation.
o Sample consists of pairs of values and you want to determine the
association between the variables.
 Association Examples
o Association between a mother‘s weight and the birth weight of her child
2 measurements: mother‘s weight and baby‘s weight , Both continuous
measures
o Example 2: Association between a risk factor and a disease 2
measurements: disease status and risk factor status , Both dichotomous
measurements
 Correlation Analysis
o When you have 2 continuous measurements you use correlation
analysis to determine the relationship between the variables. Through
correlation analysis you can calculate a number that relates to the
strength of the linear association.
o Scatter Plots and Association
o You can plot the 2 variables in a scatter plot (one of the types of charts
in SPSS/Excel).

Bwasiq 86 53
Chapter two: Biostatistics
The pattern of the ―dots‖ in the plot indicate the statistical relationship between the
variables (the strength and the direction).
Positive relationship – pattern goes from lower left to upper right.
Negative relationship – pattern goes from upper left to lower right.
The more the dots cluster around a straight line the stronger the linear relationship.
 Pearson Correlation Coefficient
 Interpretation:
values near 1 indicate strong positive linear relationship
values near –1 indicate strong negative linear relationship
values near 0 indicate a weak linear association

 Interpreting the correlation coefficient should be done cautiously!


 A result of 0 does not mean there is NO relationship …. It means there is
no linear association.
 There may be a perfect non-linear association.

Probability theory
 It is also of interest to investigate how the information contained in a sample can
be used to infer the characteristics of the population from which it was drawn.
 The foundation for statistical inference is the theory of probability.
 Proportion – the relative size of the portion of a population with a certain
characteristic.
 Random selection – a selection where each person has an equal chance of
being selected.
The chance depends on the size of the sub-population to which he/she belongs.
The chance is measured by the proportion, a number between 0 and 1, called the
probability.
Proportion measures size "It is a descriptive statistic ".but Probability measures chance

Bwasiq 86 54
Chapter two: Biostatistics
Sampling
we can‘t study the entire population. So We rely on a sample " a subgroup of the
population under investigation". By calculating probability we can describe what has
happened and predict what should happen in the future under the same conditions.
Probability and Random Sampling
Suppose that out of N=100,000 persons a total of 5500 are positive to a certain
screening test ―the probability of a randomly selected person from the target population
having a positive test result is 0.055 or 5.5%
Rationale: on an initial draw the person may or may not have a positive test. However
when this process is repeated over and over again a large number of times, the relative
frequency of positive people will approximate 0.055.
 Cancer Screening Example
Test result X
Disease Y + - Total
+ 154 225 379
- 362 23,362 23,724
Total 526 23,587 24,103
Each member of the population is characterized by two variables:
1. Test result – X
2. True disease status - Y
Marginal Probabilities
P(X=+) = probability of a positive test = 526/24103 = 0.021
P(X=-) = probability of a negative test = 23587/24,103 = 0.979
P(Y=+) = probability of having the disease = 379/24,103 = 0.015
P(Y=-) = probability of not having the disease = 23724/24,103 = 0.985
Joint Probabilities
P(X=+, Y=+) = probability of a positive test and having the disease = 154/24103 =
0.006
P(X=+, Y=-) = probability of a positive test and not having the disease = 362/24,103 =
0.015
P(X=-, Y=+) = probability of a negative test and having the disease = 225/24,103 =
0.009
P(X=-, Y=-) = probability of a negative test and of not having the disease =
23362/24,103 =0.970
Disease Y + - Total
+ P(X=+,Y=+) =0.006 P(X=-,Y=+)=0.009 P(Y=+)=0.015
- P(X=+,Y=-)=0.015 P(X=-,Y=-)=0.970 P(Y=-)=0.985
Total P(X=+)=0021 P(X=-)=0.979

Bwasiq 86 55
Chapter two: Biostatistics
Conditional Probabilities
P(X=+ | Y=+) = probability of a positive test given cancer is present = 154/379 = 0.406
" this is the SENSITIVITY of the test"
P(X=- | Y=-) = probability of a negative Test given cancer is not present
= 23362/23724 = 0.984 " this is the SPECIFICITY of the test"
 P(Y=+ | X=+) = probability cancer is present given positive test
= 154/516 = 0.298 " this is the POSITIVE PREDICTIVITY of the test"
P(Y=- | X=-) = probability cancer is not present given a negative test
= 23362/23587 = 0.990 " this is the NEGATIVE PREDICTIVITY of the test"
 Predictive Values
(prevalence)(sensitivity)
positive predictivity 
(prevalence)(sensitivity)  (1 - prevalence)(1- specificity)
(1 - prevalence)(specificity)
negative predictivity 
(1 - prevalence)(specificity)  (prevalence)(1- sensitivity)

These formulas, called ―Bayes‘ Theorem‖ allow us to calculate the predictive values
without having the data from the 2x2 table. If a test is applied to a target population
with a low disease prevalence the positive predictive value will be low.

Conditional Probability
The probability that an event B will happen given that we already know the outcome of
another event A. We are looking to see if the prior occurrence of A causes the
probability of B to change.
If P(B|A)=P(B) then we say that the two event A and B are independent and
P(A,B)=P(A)*P(B).
 Back to Example P(X+|Y+) = 0.406 and P(X+) = 0.021
Therefore we can say that X+ and Y+ are not independent and that knowing that cancer
is present causes the probability of the test being positive to change.

Relative Risk
the chance that a member of a group receiving some exposure will develop a disease
relative to the chance that a member of an unexposed group will develop the same
disease.

Bwasiq 86 56
Chapter two: Biostatistics
Recall: a RR of 1.0 indicates that the probabilities of disease in the exposed and
unexposed groups are identical – an association between exposure and disease does
not exist.
Sampling
Sampling In Quantitative Research
1. Total Population "The total collection of units, elements or individuals that you
want to analyse.
2. Representative sample
3. Probability Sampling
4. Non-Probability Sampling
5. Sample Size

Sample
A sample is a group of units selected from a larger group (the population). samples
selected because the population is too large to study in its entirety.
Important that the researcher carefully and completely defines the population, including
a description of the members to be included

Representative sample
 A sample whose characteristics correspond to, or reflect, those of the original
population or reference population

Probability Sampling
A probability provides a quantitative description of the likely occurrence of a particular
event. A probability sampling method is any method of sampling that uses some form of
random selection. In order to have a random selection method, you must set up some
process or procedure that assures that the different units in your population have equal
probabilities of being chosen (Clark 2002: 37).

Most Common Types of Probability Sampling


 Simple Random Sampling
 Stratified Random Sampling
 Systematic Random Sampling
 Cluster Or Multistage Sampling

Simple Random Sampling


where we select a group of subjects (a sample) for study from a larger group (a
population). Each individual is chosen randomly and each member of the population
has an equal chance of being included in the sample.

Bwasiq 86 57
Chapter two: Biostatistics
Stratified Random Sampling
Often factors which divide up the population into sub-populations (groups / strata ) A
stratified sample is obtained by taking samples from each stratum or sub-group of a
population.

Systematic Random Sampling


sometimes called interval sampling, means that there is a gap, or interval, between
each selection.
used when questioning people in surveys eg market researcher selecting every 10th
person who enters a particular store, after selecting a person at random as a starting
point;
interviewing occupants of every 5th house in a street, after selecting a house at random
as a starting point.
In fixed sample size .first necessary to know the whole population size from which the
sample is being selected. The appropriate sampling interval, I, is then calculatedas
follows:
If a systematic sample of 500 students were to be carried out in a university with an
enrolled population of 10,000, the sampling interval would be:I = N/n = 10,000/500 =20

Cluster Or Multistage Sampling


it is a sampling technique where the entire population is divided into groups, or clusters,
and a random sample of these clusters are selected typically used when the researcher
cannot get a complete list of the members of a population they wish to study but can get
a complete list of groups or 'clusters' of the population
it is Cheap, easy economical method of data collection.

Non-Probability Sampling
 Convenience/ opportunity/accidental sampling.
 Purposive/ judgemental sampling.
 Quota sampling.
 Snowball sampling.

Sample Size
 In general, the larger sample size (selected with the use of probability
techniques) the better. The more heterogeneous a population is on a variety of
characteristics (e.g. race, age, sexual orientation, religion) then a larger sample
is needed to reflect that diversity. (Papadopoulos 2003)
 Response rates vary on the type of surveys = number of respondents \ number
of sample size .

Bwasiq 86 58
Chapter two: Biostatistics
Estimation Techniques for Binary Data
 Numerical Methods for Binary Data
 A special case of continuous data is binary data where each outcome has only 2
possible values. When outcomes are classified as belonging to one of two
possible outcomes, usually one of the outcomes is considered to be of primary
interest (ie. Presence of disease).
 Standard Deviation vs. Standard Error
Standard deviation measures the variability in the population or sample
Standard error measures the precision of a statistic—such as the sample mean or
proportion—as an estimate of the population mean or population proportion

Proportions (P)For each individual in the study, we record a binary outcome (Yes/No;
Success/Failure) rather than a continuous measurement
 Compute a sample proportion,p (pronounced ―p-hat‖), by taking observed
number of ―yes‘s‖ divided by total sample size

Mean of Dichotomous Data


 An outcome is positive (ie. Data value of 1) if the primary category is observed
and negative (data value of 0) if the other category is observed.
x
The proportion is defined as : p 
n
where : x is the number of positive outcomes and
n is the sample size

This can be expressed as : p 


x i
n
where xi is "1" if the ith outcome is positive and "0" otherwise.

 The sample proportion can be viewed as a special case of the sample mean
when data is coded as 0 or 1.
 Variance of Dichotomous Data Consider the data that is coded ―0‖ or ―1‖.
write out the variance s 2 using the shortcut formula
but using n instead of n - 1 :
 x  2

s
 xi2  i
n since the data is coded 0 or 1 x 2  x
i i
n
 x  2

s2 
 xi  i
n   xi   xi 
1    p(1  p)
n n  n 

Bwasiq 86 59
Chapter two: Biostatistics
 In other words the statistic p(1-p) can be used in place of s2 as a measure of
variation for dichotomous data.
 Standard Error of the Sample Proportion
The standard deviation of the data is :
s pˆ (1  pˆ )
Therefore the standard error of the sample proportion is :
pˆ (1  pˆ )
SE ( pˆ ) 
n
More specifically the CLT states that the sampling distribution
of the sample proportion (p) will be approximat ely normal when
the sample size is large; The mean and variance of the sampling
distribution are :
p  
 (1   )
 p2 
n
where  is the population proportion.

Example : Suppose that the true proportion of smokers in a community is known to be


in the vicinity of =0.4 and we want to estimate it using a sample size of n=100. What is
the probability that our estimate will be correct within 3%?
 p    0.4
 (1   ) 0.4(1  0.4)
 p2    0.0024
n 100
or the standard error for the sample proportion (p̂)   p2  0.049

What is the probability that our estimate of the population proportion will be correct
within 3%?

 0.37  0.4 0.43  0.4 


P(0.37  p  0.43)  P z 
 0.049 0.049 
 P(  0.612  z  0.612)
 2 * P(0  z  0.612)  2 * 0.2291  0.4582
Therefore we are only correct within 3% approximat ely 45.8% of the time.
Using SPSS we can calculate the probability :

pˆ  1.96 * SE( pˆ )
pˆ (1  pˆ )
where SE( pˆ ) 
n

Bwasiq 86 60
Chapter two: Biostatistics
Confidence Intervals for the Population Proportion
From the sampling distribution of the sample proportion we can create a 95%
confidence interval for the population proportion .
 This is only applicable in LARGE samples (n≥25)!!!
 Notes on 95% Confidence Interval for a Proportion
 Example: Suppose that n=25 newborn infants of obese women are sampled
and x=10 weigh less than 2500 grams. Create a 95% CI for the population
proportion ().
x 10
sample proportion  pˆ    0.4
n 25
pˆ (1  pˆ ) 0.4(0.6)
SE ( pˆ )    0.098
n 25
95% CI for the population proportion :
pˆ  1.96*SE(pˆ )  0.4  1.96 * 0.098
 0.4  0.192
(0.208,0.592)
There is a 95% chance that this interval will cover the true
population proportion for the proportion of newborns from obese
mothers that weig h less than 2500 grams.

Two Sample Problem


 In many trials for interventions the comparison of proportions is based on data
from 2 independent samples.
 The process of constructing two confidence intervals separately, one from each
sample, is not efficient.
 The estimation of the difference in proportions should be done instead.
 95% CI for the Difference in Population Proportions
 The formula that should be used to calculate a confidence interval for the
difference in population proportions is:
( pˆ 1  pˆ 2 )  1.96 * SE ( pˆ 1  pˆ 2 )
where :
pˆ 1 (1  pˆ 1 ) pˆ 2 (1  pˆ 2 )
SE ( pˆ 1  pˆ 2 )  
n1 n2

A study was conducted to look at the effects of oral contraceptives (OC) on


heart disease in women 40-44 years of age. It was found that among 5000
current OC users, 13 develop a MI over a 3-year period, while among 10,000
non-OC users, seven develop an MI over a three year period. Calculate 95% CI
for the population proportion for OC users and non-OC users as well as for the
difference in the two population proportions.

Bwasiq 86 61
Chapter two: Biostatistics
OC Users:
n1  5000
x1  13
13
therefore: pˆ 1   0.0026
5000
pˆ 1 (1  pˆ 1 ) 0.0026(1  .0026)
and SE( pˆ 1 )    0.0007
n1 5000
 NON-OC Users:
95% for the population proportion :
pˆ 1  1.96 * SE( pˆ 1 )  0.0026  1.96 * 0.0007
 (0.0012,0.0040)

n1  10,000
x1  7
7
therefore: pˆ 2   0.0007
10000
pˆ 2 (1  pˆ 2 ) 0.0007(1  .0007)
and SE( pˆ 2 )    0.0003
n2 10000
95% for the population proportion :
pˆ 2  1.96 * SE( pˆ 2 )  0.0007  1.96 * 0.0003
 (0.0001,0.0013)

 It can be seen that the 95% CI for the two population proportions barely overlap
which is a good indication that the two population MI rates are probably not the
same.
 The 95% CI for the difference in population proportions:
p̂1 (1  p̂1 ) p̂ 2 (1  p̂ 2 )
(p̂1  p̂ 2 )  1.96 * 
n1 n2
0.0026(1 0.0026) 0.0007(1 0.0007)
 (0.0026 0.0007)  1.96 * 
5000 10,000
 0.0019  1.96 * 0.0008
 (0.0003,0.0035)

CI for the Odd’s Ratio


 Unlike the sample mean and the sample proportion we can not use the results
of the CLT to obtain the sampling distribution of the sample odds ratio.
 The sampling distribution of the odds ratio is positively skewed – therefore it can
be ―normalized‖ by a data transformation (i.e. like the geometric mean).
 We need to construct the CI on the log scale and then take the antilog of the two
endpoints.

Bwasiq 86 62
Chapter two: Biostatistics
 CI for the Odd‘s Ratio
 The odd‘s ratio is calculated as:
Exposure Cases Controls
Exposed A B
Unexposed C D
=Odd‘s
of exp for cases/Odd‘s of exp for controls
The formula for the 95% CI for the population Odd‘s Ratio:
ln(OR )  1.96 * SE(ln(OR ))
1 1 1 1
where SE(ln(OR ))    
a b c d
result (A, B)  (e A ,e B )

Tests of Significance
General Concepts
 One approach is to construct a confidence interval for the population parameter;
another is to construct a statistical hypothesis test.
With statistical tests we claim that the mean0 of the population is equal to some
postulated value which is called the null hypothesis or H0.
 The alternative hypothesis is a second statement that contradicts H 0.
 Serum Cholesterol Example
 If we wanted to test whether the mean serum cholesterol level of hypertensive
smokers is equal to the mean of the general population of 20-74 year old males.
 Together the null and alternative hypotheses cover all possible values of the
population mean. H :     211mg / 100ml
0 0
 H :   211mg / 100ml
A

Hypothesis Test Results


Result of HYPOTHESIS
Test
H0 is true HA is true

Reject H0 α 1-

NOT 1- α 
Reject H0
Types of Errors
Two possible ways to commit an error:
 (i) Type I: Reject Ho when it is true ()
 (ii) Type II: Fail to reject Ho when it is false ()

Bwasiq 86 63
Chapter two: Biostatistics
The goal in hypothesis testing is to keep  and  (the probabilities of type I and II
errors) as small as possible.
 Usually  is fixed at some specific level - say 0.05 - significance level of the test.
 1- is called the power of the test.
 Hypothesis Testing
 Want to draw a conclusion about a population parameter
 In a population of women who use oral contraceptives, is the average
(expected) change in blood pressure (after-before) 0 or not?
 Sometimes statisticians use the term expected for the population averageμ is
the expected (population) mean change in blood pressure

Hypothesis Testing
We are testing both hypotheses at the same time
 Our result will allow us to either ―reject H0‖ or ―fail to reject H0‖
 We start by assuming the null (H0) is true, and asking:
 ―How likely is the result we got from our sample?‖

Hypothesis Testing
 Null hypothesis: H0 = µ0 = 0
Alternative hypothesis: HA µ0 ≠ 0
 We reject H0 if the sample mean is far away from 0

The Null Hypothesis, H0


 Typically represents the hypothesis that there is ―no association‖ or ―no
difference‖
 It represents current ―state of knowledge‖ (i.e., no conclusive research exists)

The Alternative Hypothesis HA (or H1)


Typically represents what you are trying to prove For example, there is an
association between blood pressure and oral contraceptive use

P-values - what do they mean?


The probability of obtaining a mean (from our sample) as extreme or more extreme then
the observed sample mean, given that the null hypothesis is true is called the p-value
of the test, or simply p.
If the p-value is ―sufficiently small‖ we reject the null hypothesis. In most applications
0.05 is chosen as the cut-point and we call this value , the ―significance level‖.
 Therefore we reject incorrectly 5% of the time.

Bwasiq 86 64
Chapter two: Biostatistics
Using the p-value to Make a Decision
 Recall, we specified two competing hypotheses about the underlying, true mean
blood pressure change, µ
 We now need to use the p-value to choose a course of action . . . either reject
H0, or fail to reject H0

Serum Cholesterol Example


To conduct a test of hypothesis we use our knowledge of the sampling distribution
of the mean. Therefore, according to the CLT.
X  0
Z

n
 For a given sample we can calculate the test statistic Z. The standard deviation
of this distribution is assumed to be 46 mg/100ml.
 Test the following null and alternative hypotheses:
Ho: 0 = 211 mg/100ml Ha: 0 211 mg/100ml
 using the information from a sample of 30 with mean serum cholesterol
level of 217 mg/100ml.
x   0 217  211
Z   0.7144
 n 46 30

To test the hypothesis we have to compute the test statistic:


 If the null hypothesis is true, this statistic is the outcome of a standard
normal random variable and we can find the area to the right of
z=0.7144.
 P(Z>0.7144)=0.237
 Thus the area in the two tails of the standard normal distribution sums to 0.474
and this is the p-value of the test FAIL to Reject Ho.

Z tests for a population mean


 To test the hypothesis H0: =0 based on a SRS of size n from a population
with unknown mean  and known standard deviation , compute the test
x  0
statistic:
Z

n

In terms of a standard normal random variable Z, the p-value for a test of H0


against:

Bwasiq 86 65
Chapter two: Biostatistics
Hypothesis Testing for Proportions
The proportion of patients surviving five years after being diagnosed with lung
cancer among those who are over 40 at the time of diagnosis is 8.2%. Is it possible that
the proportion surviving in the under-40 population is 0.082 as well? For a sample of 52
persons under 40 who have been diagnosed with lung cancer the survival proportion is
0.115.
Step 1: Determine the hypotheses
H0: 0=0.082
HA: 00.082 (two-sided alternative)
Step 2: Calculate the test statistic:
pˆ   0 0.115  0.082
Z   0.87
 0 (1   0 ) / n 0.082(1  0.082) / 52

p  value  2 * P( Z  0.87)  2 * [1  NORMDIST (0.87,0,1,1)]


 2 * 0.193  0.386
Step 3: Conclusion
Since 0.386 is greater than the level of significance 0.05 we do not reject the
null hypothesis.
 Two-sided Significance Tests and CIs
Consider the following hypotheses:
H0:  = 0 HA: 0
If 0 is not included in the 95% confidence interval for , H0 should be rejected at the
0.05 level.

Hypothesis Testing
Steps in Hypothesis Testing
1. State the null hypothesis H0 and the alternative hypothesis Ha.
2. Calculate the value of the test statistic on which the test will be based.
3. Find the P-value for the observed data.
4. State a conclusion.

Step 1 – State the Hypotheses


The test is designed to assess the strength of the evidence against H0; Ha is the
statement we will accept if the evidence enables us to reject H 0.
Usually the null hypothesis is a statement of ―no effect‖ or ―no difference‖.
HA the statement of what we hope or suspect is true.
Hypotheses always refer to some population or model and are always stated in terms
of population parameters.

Bwasiq 86 66
Chapter two: Biostatistics
One sided or 2-sided
 If you do not have a specific direction firmly in mind in advance (before looking
at the data), a 2-sided alternative hypothesis should be used.
 Hypothesis Examples
 Does mean age of onset of a certain acute disease for school children differ
from 11.5?
 Is the average cross-sectional area of the lumen of coronary arteries for men,
ages 4 to 59 less than 31.5% of the total arterial cross section?

Step 2 – Calculate the Test Statistic


 The test is based on a statistic that estimates the parameter that appears in the
hypothesis.
 When H0 is true we would expect the sample statistic to take a value near the
parameter estimate specified by H0.
 The alternative hypothesis determines which directions count against H0.
 A test statistic measures compatibility between the null hypothesis and the
sample data.
 We calculate the test statistic assuming that the null hypothesis is true.
 When we have a one-sided hypothesis we use the value of the parameter in H0
that is closest to Ha to calculate our test statistic.
 we will assume that the data are normally distributed and that we know the
value of the population standard deviation (when testing the population mean μ).
 To calculate the test statistic we need to create a z-statistic.
 For a hypothesis regarding the population mean we would use: x
z 0

n
 For a hypothesis regarding the population proportion we would use:
pˆ   0
z
 0 (1   0 )
n
Step 3 – Calculating the P-value
 A test of significance assesses the evidence against the null hypothesis in terms
of probability.
 The p-value is the probability, computed assuming H0 is true, that the test
statistic would take a value as extreme or more extreme than that actually
observed.
 The smaller the p-value the more evidence against H0.

Bwasiq 86 67
Chapter two: Biostatistics
Step 4- Making a Conclusion
 We need to compare our p-value with a fixed value that we regard as decisive.
This value determines how much evidence against H0 we will require to reject H0
and we call it the significance level ().
 With a significance level set at 0.05 we are requiring that the data give evidence
against H0 so strong that it would happened no more than 5% of the time when
H0 is true.
 If the p-value is as small or smaller than , we say that the data are statistically
significant at level  and we would reject H0.
 Example for the Population Mean

Do middle aged male executives have different average blood pressure than the
general population? The National Center for Health Statistics reports that the mean
systolic blood pressure for males 35 to 44 years of age is 128 mg/100ml and the
standard deviation in this population is 15 mg/100ml. The medical director of a company
looks at 72 company executives in this age group and finds that the mean systolic blood
pressure in this sample is 126.07 mg/100ml. Is this evidence that the executive blood
pressure differs from the national average?
Step 1: State your hypotheses
H0: μ= μ0=128 mg/100ml H a: μ128 mg/100ml
Step 2 – Calculate your test statistic
We make the unrealistic assumption that the population standard deviation is known.
x  0 126.07  128
z   1.09
 15 72
n

 Step 3 – Calculate the p-value


 P-value=2*P(Z  1.09) = 2* (0.5-0.3621)=0.2758
 Step 4- Make your conclusion
 At a significance level of 0.05 we would fail to reject H0 and concluded
that the data do not provide enough evidence to conclude that the mean
blood pressure of executives is different from 128 mg/100ml.
 Example: One-sided Alternative for the Population Mean
In a discussion of SAT scores, someone comments: ―Because only a minority of high
school students take the test, the scores overestimate the ability of typical high school
seniors. The mean SAT math score is about 475 but I think if all seniors took the test,
the mean score would be no more than 450‖. You gave the test to a SRS of 500 seniors
and these students had a mean score of 461. Is this good evidence that the mean for all
seniors is more than 450?
One-sided Alternative for the Population Mean
Step 1: State your hypotheses

Bwasiq 86 68
Chapter two: Biostatistics
H0: μ = 450 H a: μ> 450
Step 2 – Calculate your test statistic
x  0 461 450
z   2.46
 100 500
n

standard deviation is known and that it is 100.


Step 3 – Calculate the p-value
 The probability that a standard normal variable Z takes a value at least
2.46 away from zero.
 P-value=P(Z  2.46) = 0.5 – 0.4931=0.0069
 Step 4- Make your conclusion
 At a significance level of 0.05 we would reject H0 and concluded that the
data provides enough evidence to conclude that the mean SAT math
score for high school seniors is higher than 450.
 Example for the Population Proportion
The French Naturalist Count Buffon once tossed a coin 4040 times and obtained 2048
heads. The sample proportion is p=0.5069. If Buffon‘s coin was balanced, the
probability of obtaining heads on any toss is 0.5.
Does the data provide evidence that the coin was not balanced?
 Step 1: State your hypotheses
H0:  = 0 = 0.5 Ha:  0.5
 Step 2 – Calculate your test statistic pˆ   0.5069  0.5
z 0
  0.88
 0 (1   0 ) 0.5(1  0.5)
n 4040
 Step 3 – Calculate the p-value
 The probability that a standard normal variable Z takes a value at least
0.88 away from zero.
 P-value=2*P(Z  0.88) = 2* (0.5-0.3106)=0.3788
 This means a proportion as large as that observed would occur
approximately 38% of the time if the coin were balanced.
 Step 4- Make your conclusion
 At a significance level of 0.05 we would fail to reject H0 and concluded
that the data does not provide enough evidence to conclude that the
coin was not fair.

RECALL: Z tests for a population mean


To test the hypothesis H0: =0 based on a SRS of size n from a population with
unknown mean  and known standard deviation , compute the test statistic:
Example : x  0
Z

n

Bwasiq 86 69
Chapter two: Biostatistics
Do middle aged male executives have different average blood pressure than the
general population? The National Center for Health Statistics reports that the mean
systolic blood pressure for males 35 to 44 years of age is 128 mg/100ml and the
standard deviation in this population is 15 mg/100ml. The medical director of a company
looks at 72 company executives in this age group and finds that the mean systolic blood
pressure in this sample is 126.07 mg/100ml. Is this evidence that the executive blood
pressure differs from the national average?
 Step 1: State your hypotheses
H0: μ= μ0=128 mg/100ml H a: μ128 mg/100ml
Step 2 – Calculate your test statistic
x  0 126.07  128
z   1.09
 15 72
n

We make the unrealistic assumption that the population standard deviation is known
 Step 3 – Calculate the p-value
The probability that a standard normal variable Z takes a value at least 1.09 away from
zero.
P-value=2*P(Z  1.09) = 2* (0.5-0.3621)=0.2758
This means that 27.6% of the time a SRS of size 72 from the general male population
would have a mean blood pressure at least as far from 128 mg/100ml as that of the
executive sample.
Step 4- Make your conclusion
At a significance level of 0.05 we would fail to reject H0 and concluded that the data do
not provide enough evidence to conclude that the mean blood pressure of executives is
different from 128 mg/100ml.

The t-Distribution
Both confidence intervals and tests of significance for the mean  of a normal
population are based on the sample mean
x .
 The sampling distribution ofx depends on .
 is either known or it estimated using the sample standard deviation s.
x ~ N ( ,  2 n)
 Setting: SRS of size n from a normally distributed population with mean  and
standard deviation . This is based on the results of the CLT: x ~ N ( ,  2 n)
 Test Statistics
The standardized sample mean or the one-sample z statistic, when  is known:
When we substitute the sample standard deviation we get:
x
z ~ N (0,1)
x 
t ~ t( n 1) n
s
n

Bwasiq 86 70
Chapter two: Biostatistics
One-sample t-test
 An SRS is drawn from a population having unknown mean . Test the
hypothesis: x
t 
H0: = 0 s
n
 The random variable T has a t(n-1) distribution so the p-value for the test is:
 If : Ha: >0 p-value=P(Tt)
Ha: <0 p-value=P(Tt)
Ha: 0 p-value=2*P(T  |t|)
These P-values are exact if the population distribution is normal and are approximately
correct for large n in other cases.

Rejection Region
 From the table for the t-distribution with (n-1) degrees of freedom and the
choice of  the rejection region is determined by:
For a one-sided test use the column corresponding to an upper tail area of 0.05
 t-tabulated value for Ha: <0
 t tabulated value for Ha: >0
For a two-sided test use the column corresponding to an upper tail of 0.025:
 t-tabulated value OR t  tabulated value
 Example for 1-sample t-test
 The following data are amounts of vitamin C (mg/100g) for a random sample
corn soy blend (population is normally distributed).
26 31 23 22 11 22 14 31
The specifications are designed to produce a mean vitamin C content of 40 mg/100 g.
Test the null hypothesis that the mean vitamin C content of the production run from
which we got or sample conforms to these specifications.
We are told that: x  22.5 and s  7.19
1. Ho: =40 mg/100g
Ha: 40 mg/100g x  0 22.5  40
t   6.88
2. Calculate the test statistic: s 7.2 8
n
3. This test statistics has the t(7) distribution. Need to calculate the p-value:
2*P(T6.88) =0.00024=2*TDIST(6.88,7,1)=TDIST(6.88,7,2)
4. Therefore we reject the null hypothesis based on an -level of 0.05 and
conclude that the vitamin C content for this run does not meet the specifications.

Bwasiq 86 71
Chapter two: Biostatistics
Matched Pairs t Procedures
One common comparative design is the matched-pairs study where subjects are
matched in pairs and are compared within each pair.
One example of matched pairs is before-and-after observations on the same subjects.
In this setting the variable that we are measuring on the subjects is a continuous
measure. We looked at the design where the outcome was dichotomous.
With large sample size and assuming that the null hypothesis of no difference is true the
mean d of these differences is distributed as normal with mean and variance given
by:
d  0
 d2
d 
n
Since we do not know the variance this has to be estimated from our data by the
sample variance. d 0
t ~ t (n  1)
Therefore our test statistic becomes: sd
n
 20 teachers were tested for their understanding of French before and after a 4
week immersion program. We want to test if the program improved the teacher‘s
comprehension of spoken French.
 We are told the following: The average difference in scores is 2.5
The sample standard deviation of the differences in the scores is 2.893.
1. H0: d= 0
Ha: d>0
2. This test statistics has the t(19) distribution. Need to calculate the p-value:
P(T3.86) = tdist(3.86,19,1)=0.00053
d 0 2.5  0
t   3.86
sd 2.893 20
n
4. Therefore we reject the null hypothesis based on an -level of 0.05 and
conclude that there is strong evidence that the program improved
comprehension.

Comparing Two Means


A common goal of inference is to compare the responses in two groups. Each group is
considered to be a sample from a distinct population. The responses in each group are
independent of those in the other group. The sample sizes in the two groups need not
be the same.

Bwasiq 86 72
Chapter two: Biostatistics
Two sample Problem
population variable Mean Standard deviation
1 X1 µ1 1
2
2 X2 µ2
We have two independent samples from 2 distinct populations and the same continuous
variable is measured for both samples.
population Sample size Sample Mean Sample Standard
deviation
n1 x1 s1
1
n2 x2
2 s2
Inference is based on 2 independent SRS, one from each population.

2-sample z-test (population Standard Deviations are known)


 We use x1  x2 to estimate 1  2
 The sampling distribution of x1  x2 is:  12  22

mean: 1  2 Variance: n1 n2
If the two population distributions are both normal, then the distribution of x1  x2 is
also normal.

Two-sample t-test
Hypotheses for comparing two population means:
One-tailed test: Two-tailed test:
H 0 : ( 1   2 )  Do H 0 : ( 1   2 )  Do
H a : ( 1   2 )  Do H a : ( 1   2 )  Do
OR
H 0 : ( 1   2 )  Do
H a : ( 1   2 )  Do

Suppose that x1 is the mean of a SRS of size n1 drawn from an N ( 1 ,  1 )


2

population and that x2 is the mean of a SRS of size n2 drawn from an N (2 ,  2 2 )
population. Then the 2-sample a statistic is: ( x1  x2 )  Do
z
 12  22
Has the N(0,1) sampling distribution 
n1 n2

T procedures
If the population standard deviations are not known we estimate them by the sample
standard deviations from our two samples.
To simplify the test we will assume that the two normal population distributions have the
same standard deviations so we use a pooled standard error in the test statistic.

Bwasiq 86 73
Chapter two: Biostatistics
(n1  1) s12  (n2  1) s22
s 2p 
n1  n2  2
and
( x  x )  Do
t 1 2 with t (n1  n2  2) distribution
1 1
s 2p   
 n1 n2 

Rejection Region
From the table for the t-distribution with (n-1) degrees of freedom and the choice of
 the rejection region is determined by:
For a one-sided test use the column corresponding to an upper tail area of 0.05 and Ho
is rejected if:
 t-tabulated value for Ha: 1<2
 ttabulated value for Ha: 1>2
For a two-sided test use the column corresponding to an upper tail of 0.025 and H0 is
rejected if:
 t - tabulated value OR ttabulated value

2-sample T-test Example


Independent random samples selected from two normal populations produced the
sample means and standard deviations shown in the table:
Sample 1 Sample 2

Sample size 17 12

Mean 5.4 7.9

Sample Standard deviation 3.4 4.8

Test the null hypothesis that the population means are equal vs. the alternative that
they are not equal.
1. Ho: 1- 2 =0
Ha: 1- 20 (n1  1) s12  (n2  1) s22 (17  1)(3.4 2 )  (12  1)(4.82 )
2. Calculate the test statistic: p
s 2
   16.24
n n 2 1 2 17  12  2
and
( x  x )  Do (5.4  7.9)  0
t 1 2   1.645
2 1 1 1 1
s p    16.24  
 1
n n2   17 12 
2. This test statistic follows the t-distribution with 27 degrees of freedom.

Bwasiq 86 74
Chapter two: Biostatistics
3. P-value=2*P(T>1.645) =
2*TDIST(1.645,27,1)=TDIST(1.645,27,2)=0.112
- critical value for rejection region=2.052
4. Therefore we fail to reject the null hypothesis based on an -level of 0.05 and
conclude that the two population means are not different.
Independent random samples from approximately normal populations produced the
results shown in the table. Do the data provide sufficient evidence to conclude that
(2-1)>10?
1. Ho: 2- 110 and Ha: 2- 1>10 x1  43.6 x2  53.63 s1  5.47 s2  5.41
2. Calculate the Test statistic:
(n1  1) s12  (n2  1) s22 (15  1)(5.472 )  (16  1)(5.412 )
s 2p    29.58
n1  n2  2 15  16  2
and
( x  x )  Do (53.63  43.6)  10
t 2 1   0.015
 1 1   1 1 
s 2p    29.58  
 n1 n2   15 16 

2. This test statistic follows the t-distribution with 29 degrees of freedom.


3. P-value=P(T>0.015) =TDIST(0.015,29,1)=0.49
- critical value for rejection region=1.697
4. Therefore we fail to reject the null hypothesis based on an -level of 0.05 and
conclude that the two population means are not different.

Examples of Data for ANOVA


 A medical researcher wants to compare the effectiveness of 3 different
treatments to lower the cholesterol of patients with high blood cholesterol levels.
He assigns 60 individuals at random to the 3 treatments and records the
reduction in cholesterol for each patient.

Comparing Means
 For the cholesterol example the mean reduction in each treatment group was:
0.2 1.5 0.8
 Is the observed difference the result of chance variation?
 Would not expect the sample means to be equal even if the population means
are identical.
 To answer this we need to know the variation within the groups under
observation and the sizes of the samples.
 To assess the equality of several population means we compare the variation
among the means of several groups with the variation within groups.
 This method is called Analysis of Variance.

Bwasiq 86 75
Chapter two: Biostatistics
 The null and alternative hypotheses for the one-way ANOVA are:
H o : 1  2  3  ....  k

Contingency table
It is a table that cross-classifies the observations from two variables. Each cell in the
table contains the counts of the combinations of the two variables.
Setting: Let X1 and X2 denote categorical variables, X1 having I levels and X2 having
J levels. There are IJ possible combinations of classifications.
When the cells contain frequencies of outcomes, the table is called a contingency table.
Paired-Matched Studies
The distinguishing characteristic of paired samples for counts is that each observation
in the first group has a corresponding observation in the second group. For this type of
data we use McNemar’s Test to evaluate hypotheses about the data.
For paired matched data with a single binary response the data can be represented by
a 2x2 table where (+,-) denote the exposed and non-exposed outcome.
Case Control

+ -

+ a b

- c d

a - number of pairs with 2 exposed members


b - number of pairs where the case is exposed and the control is unexposed
c - number of pairs where the case is unexposed and the control is exposed
d - number of pairs with 2 unexposed members
Goal: We want to compare the incidence of exposure among the cases versus the
controls – the parts of the data showing no difference which is ―a‖ and ―d‖ – the
concordant pairs.
a and d contribute nothing as evidence in the comparison.
 McNemar‘s Test
 Null hypotheses: the exposure is not associated with the disease.
 Alternate Hypothesis: There is an association between the exposure and the
disease. b 1
ˆ 0    2
 
p (b c )
o This is a 2-sided alternative hypothesis z
o Under the null hypothesis we expect:  0 (1   0 )  1  1 
 1  
 b=c OR n  2  2 
(b  c)
 b/(b+c) =0.5
2b  b  c
 Special case of the one-sample problem
2(b  c) bc
 
1 bc
4(b  c)

Bwasiq 86 76
Chapter two: Biostatistics
 McNemar‘s Test
Decisions based on the standardized z-score is for a one-sided alternative.
In the two-sided form, the square of the z-statistic is denoted by: b  c 2
And the test is known as McNemar’s chi-square.  2

bc
If the test is one-sided, z is used and the null hypothesis is rejected at the 0.05 level
when z>1.65
If the test is two-sided, 2 is used and the null hypothesis is rejected at the 0.05 level
when 2 >3.84

X2 Distribution
 The probabilities associated with the Chi-Squared Distribution are in Chi table.
 The table is set up in the same way for the t-distribution.
 The chi-squared distribution with 1 df is the same as the square of the N(0,1)
distribution.
 Since the distribution only takes on positive values all the probability is in the
right-tail.
 For a significance level of 0.05 and df=1 the rejection region for a 2-sided test is:
 X2 test statistic > 3.84.
 Example of Paired-Matched Study
 A study in Maryland identified 2408 white persons enumerated in an unofficial
1963 census who became widowed between 1963 and 1974. These people
were matched, one-to-one, to married persons on the basis of race, gender,
year of birth, and geography of residence. The matched pairs were followed to a
second census in 1975 and vital status was obtained.

Widowed Married Men


Men
Dead Alive
Dead 2 292
Alive 210 700

H0: There is no association between marital status and survival


Ha: There is an association between marital status and survival
 McNemar‘s chi-square Test statistic:
(b  c) 2 (292  210) 2
 
2
  13.39
bc 292  210
The null hypothesis of equal mortality should be rejected at the 0.05 level since the test
statistic is greater than 3.84
The null hypothesis of no association should not be rejected at the 0.05 level since the
test statistic is less than 3.84

Bwasiq 86 77
Chapter two: Biostatistics
Independent Studies
 The null hypothesis of the Chi-square test is that there is no association
between the row and column variables
 The alternative hypothesis of the Chi-square test is that an association exists
between these two variables. It is always a two-sided hypothesis.
 If the null hypothesis is true then each cell count
= row total * column total
N(total sample)
 The Chi-square test statistic is the sum of the squares of the difference between
observed count (O) and the expected count (E) divided by expected count
( xij  eij ) 2
 
2

eij

 If there is no association between the variables, the observed counts will be


equal or not equal to the expected counts and the test statistic will be small

Distribution of Chi-Square Statistic


The statistic follows a Chi-Square distribution determined by the degrees of freedom
(df) where df =
(#rows-1)(#col – 1)
 A large statistic will result in a small p-value
 Chi-Squared Example
 Suppose a manufacturer is interested in determining the relationship between
the size and manufacturer of newly purchased automobiles. 1000 recent buyers
of American-made cars are randomly selected and each purchase is classified
with respect to the size and manufacturer of the automobile.
Manufacturer TOTALS

A B C D
Small 157 65 181 10 413

Intermediate 126 82 142 46 396


Large 58 44 60 28 191
TOTALS 341 192 383 84 1000

Bwasiq 86 78
Chapter two: Biostatistics
Manufacturer TOTALS

A B C D

Small x11 x12 x13 x14 x1+

Intermediate x21 x22 x23 x24 x2+

Large x31 x32 x33 x34 x3+

TOTALS x+1 x+2 x+3 x+4 n

Manufacturer TOTALS

A B C D

Small e11= (x1+*x+1)/n e12 e13 e14

Intermediate e21 e22 e23 e24

Large e31 e32 e33 e34

TOTALS

Manufacturer TOTALS

A B C D
Small 157 65 181 10 413
(140.833) (79.296) (158.179) (34.692)

Intermediate 126 82 142 46 396


(135.036) (76.032) (151.68) (33.264)

Large 58 45 60 28 191
(65.131) (36.672) (73.153) (16.044)

TOTALS 341 192 383 84 1000

 H0: There is no association between size of car and manufacturer


 Ha: There is an association between size of car and manufacturer
Chi-square:
( xij  eij ) 2 (157  140.833) 2 (28  16.044) 2
2     ...   45.81
eij 140.833 16.044

Bwasiq 86 79
Chapter two: Biostatistics
Large values of the test statistic implies that the observed counts are not close to the
expected counts under the null hypothesis and therefore imply that the null hypothesis
is false.
For this example the appropriate degrees of freedom (df) are: (3-1)*(4-1)=6
We can find the critical value for the rejection region:
o For =0.05 the critical value is 12.592
o For =0.01 the critical value is 16.812
 We therefore reject the null hypothesis of no association at the 0.05 significance
level since 45.81>12.592.
 We conclude that there is an association between the size of car and the car
manufacturer. In other words ―the size and manufacturer of a car selected by a
purchaser are not independent events.

Regression Analysis
Previously we were interested in testing population parameters.
If the data was binary or categorical we discussed the comparison of population
proportions
If the data was continuous we discussed the comparison of population means
In other studies the goal is to assess the relationships among a set of variables.
relationship between a mother‘s weight and her newborn‘s weight.
Birth weight data :
x (oz) y(%)
112 63
111 66

 ( x  x )( y  y )
107 72
119 52
r i i

 ( x  x )  ( y  y) 
92 75
80 118 2 2
81 120 i i
84 114
118 42
106 72
103 90
94 91

Pearson Correlation Results


Tests For The Pearson Correlation Coefficient
It is often of interest to test for independence between two continuous variables under
investigation. Step 1 : H 0 :   0
Ha :   0
Step 2 : Test Statistic
n- 2
t r ~ tn2
1- r 2
Step 3 : Calculate the p - value
p - value  2 * p(T | t |)  tdist(t, df,2)
OR use Appendix C to find the critical value for the rejection region
Step 4 : Make your conclusion

Bwasiq 86 80
Chapter two: Biostatistics
H0 :   0
Ha :   0
n2 12  2
tr  (0.946)  9.23
1 r 2
1  (0.946) 2
at   0.05 and df  10, the critical value for the rejection region is - 2.228.
The test statistic t is  -2.228so we would reject the null hypothesis and
conclude that ther e is a linear association between birth weig ht and % increase
in weight. The birth weig ht (x) accounts for r 2  (0.946) 2  0.895 or 89.5%
of the variablil ity in % growth rates (y).

p - value  2 * tdist(9.23,10,2)  0.0001

What is Regression?
Like correlation analysis, simple linear regression is a technique that is used to explore
the nature of a relationship between two continuous random variables.
Regression analysis allows us to investigate the change in one variable which
corresponds to a given change in the other.
Instead of just quantifying the strength of the relationship between the 2 variables we
can predict the value of one variable given a value for the other.

Components of Regression Analysis


Dependant or response variable (Y)– a variable to be predicted from or explained by
other variables
Needs to be a continuous measurement
Assumed to be normally distributed.
Independent or explanatory variables (X1, X2, …,Xk) – the variables used to predict the
dependant variable.
The model is formulated to express the mean of the normal distribution for the
dependant variable as a function of potential independent variables under investigation.

Linear Regression Model


The regression model describes the mean of the normally distributed dependant
variable Y as a function of the independent variable X.
yi  0  1xi  
y
Where i = value of the response variable
0 and 1 are the two unknown parameters
xi = value of the independent variable
 = random error term that is distributed N(0,2).

Bwasiq 86 81
Chapter two: Biostatistics
Assumptions for Linear Regression
1. For a specified value of x, the distribution of the yvalues is normal with mean
y|xand standard deviation y|x.
2. The relationship between y|xand x is described as the straight line  y| x   0  1 x
3. For any specified value of x, y|xdoes not change.
4. The outcomes of y are independent.

Scatter Plot
If you plot the mean of Y vs. X, the graph is a straight line.
The observed values of Y may be greater or less than its mean. Therefore the plot of
the observed values will not fall perfectly on the line.
A scatter diagram consists of a single point for each (x,y) pair of numbers.

Regression Coefficients
0 is the intercept of the regression line. It does not have any particular meaning as
a separate term in the regression model.
1 is the slope of the regression line. It represents the increase (or decrease if it is
negative) in the mean of Y associated with a 1 unit increase in X.
For m unit increase in the value of X, the corresponding increase (or decrease) in the
mean of Y is m * 1.

Least Squares Estimation


We use a method called Least Squares to obtain the ―best‖ estimate of the regression
coefficients.

 xy   n 
( x)( y )
b1 
 n
i 1 ( xi  x )( yi  y )

 x  
n
i 1 ( xi  x ) 2 2 ( x) 2

n
b0  y  b1 x and
Yˆ  b  b X
0 1

Normal Distribution
Discrete Probability Distributions
Binomial distribution – the random variable can only assume 1 of 2 possible
outcomes. There are a fixed number of trials and the results of the trials are
independent.
Discrete Random Variable
A discrete random variable X has a finite number of possible values. The probability
distribution of X lists the values and their probabilities.
1. Every probability pi is a number between 0 and 1.

Bwasiq 86 82
Chapter two: Biostatistics
2. The sum of the probabilities must be 1.
 Find the probabilities of any event by adding the probabilities of the particular
values that make up the event.
 Example : The instructor in a large class gives 15% each of A‘s and D‘s, 30%
each of B‘s and C‘s and 10% F‘s. The student‘s grade on a 4-point scale is a
random variable X (A=4).
What is the probability that a student selected at random will have a B or better?
ANSWER: P(grade of 3 or 4)=P(X=3) + P(4) = 0.3 + 0.15 = 0.45

Continuous Probability Distributions


 Between two values of a continuous random variable we can always find a third.
 A histogram is used to represent a discrete probability distribution and a smooth
curve called the probability density is used to represent a continuous probability
distribution.
 The Histogram and the Probability Density

The probability density is a smooth idealized curve that shows the shape of the
distribution in the population
Areas in an interval under the curve represent the percent of the population in the
interval

Normal Distribution
You can tell which normal distribution you have by knowing the mean and standard
deviation.
The mean is the center &The standard deviation measures the spread (variability)
 The most common continuous distribution is the normal distribution – the bell
shaped curve.
 The normal curve is unimodal and symmetric about its mean ().
 In this distribution the mean, median and mode are all identical.
 The standard deviation () specifies the amount of dispersion around the mean.
 The two parameters  and  completely define a normal curve.

Normal Distribution - Notes


 The total area enclosed by the normal distribution curve is 1.0 and the
cumulative probabilities areFgiven
( x) by:
P( X  x)
 Calculating cumulative probabilities from the normal distribution (area under the
curve) is a numeric problem and no easy formula exists.
 There are tables and excel functions to calculate these probabilities.
 The tables are for normally distributed random variables with mean=0 and
variance=1 (=0 and =1) - STANDARD NORMAL VARIABLE

Bwasiq 86 83
Chapter two: Biostatistics
 When applied to ‗real data‘, these estimates are considered approximate!
 Distributions of Blood Pressure
 Standard Normal Variable
 It is customary to call a standard normal random variable Z.
 The outcomes of the random variable Z are denoted by z.
 The table in the coming slide give the area under the curve (probabilities)
between the mean and z.
 The probabilities in the table refer to the likelihood that a randomly selected
value Z is equal to or less than a given value of z and greater than 0 (the mean
of the

Calculating Probabilities
 Probability calculations are always concerned with finding the probability that the
variable assumes any value in an interval between two specific points a and b.
 The probability that a continuous variable assumes the a value between a and b
is the area under the graph of the density between a and b.

Standard Normal Scores


Standard Score (Z) = ⁄

―Z‖ is normal with mean 0 and standard deviation of 1.


A standard score of:
 Z = 1: The observation lies one SD above the mean
 Z = 2: The observation is two SD above the mean
 Z = -1: The observation lies 1 SD below the mean
 Z = -2: The observation lies 2 SD below the mean

Example: Male Blood Pressure, mean = 125, s = 14 mmHg


BP = 167 mmHg 167  125
Z  3.0
BP = 97 mmHg 14

97  125
Z  2.0
14

 Thus, it is a way of quickly assessing how ―unusual‖ an observation is


 Example: Suppose the mean BP is 125 mmHg, and standard deviation = 14
mmHg
o Is 167 mmHg an unusually high measure?
o If we know Z = 3.0, does that help us?
 But, if distribution is not normal, we may not be able to use Z-score approach.

Bwasiq 86 84
Chapter two: Biostatistics
Standardization F ( x)  P( X  x)
 X  x 
 P  
   
 x 
 P Z  
  
 P( Z  z )
T-Distribution
 Similar to the standard normal in that it is unimodal, bell-shaped and
symmetric.
 The tail on the distribution are ―thicker‖ than the standard normal
 The distribution is indexed by ―degrees of freedom‖ (df).
 The degrees of freedom measure the amount of information available in the
data set that can be used for estimating the population variance (df=n-1).
 Area under the curve still equals 1.
 Probabilities for the t-distribution with infinite df equals those of the standard
normal.
 The table of t-distribution will give you the probability to the right of a critical
value – i.e. area in the upper tail.
 We are only given the area (or probability) for a few selected critical values for
each degree of freedom.
 T-Distribution Example
For a t-curve from a sample of size 15 find the area to the left of 2.145.
Answer: df=15-1=14
In the table of the t~distribution, the area to the right of 2.145 is 0.025.
Therefore the area to the left of 2.145 is:
1-0.025=0.975

Significance Tests for Categorical Data


 One Sample Problem
 Type of Data: we have binary data (n,x) with n being a large sample size and x
the number of positive outcomes among the n observations.
 Hypothesis: H0:  = 0
o Where 0 is a fixed and known number between 0 and 1.
o 0 is a standardized or referenced figure. (i.e. national smoking rate)
o the sample statistic that we will use to test this hypothesis is the sample
proportion and we will invoke the results of the CLT.
 Recall the CLT: With large sample size and assuming that H 0 is true the sample
proportion (p) is normal with mean and standard error given by
p = 0 pˆ  π0
z
p = 0(1-0) π0( 1  π0 )
n

Bwasiq 86 85
Chapter two: Biostatistics
 Therefore the observed value of the sample proportion can be converted into a
Z-statistic which is the number of standard errors away from the hypothesized
value 0.
 From the table for the standard normal distribution and the choice of  (e.g.
=0.05) the rejection region is determined by:
o For a one-sided test: z -1.65 for Ha: 0
 z 1.65 for Ha: >0
o For a two-sided test or Ha: 0
o z -1.96 OR z  1.96
 From the table for the standard normal distribution and the choice of  (e.g.
=0.01) the rejection region is determined by:
o For a one-sided test: z -2.33 for Ha: 0
 z 2.33 for Ha: >0
o For a two-sided test or Ha: 0
o z -2.56 OR z  2.56

One-sample problem
 A group of investigators wish to explore the relationship between the use of hair
dyes and the development of breast cancer in women. A sample of n=1000
female beauticians 40-49 years of age is identified and followed for 5 years.
After 5 years there are 20 new cases of breast cancer. It is known that breast
cancer incidence over this time period for average American women in this age
group is 0.007. Does hair dye increase the risk of breast cancer?
 Step 1: State your hypotheses
o H0:  = 0 = 0.007
o Ha: >0.007 pˆ   0 0.02  0.007
 Step 2 – Calculate your test statistic z    4.93
 0 (1   0 ) 0.007(1  0.007)
 Step 3 – Calculate the p-value n 1000
 The probability that a standard normal variable Z takes a value at least 4.93 s.d
away from zero. P-value=P(Z  4.93) < 0.001
 Assuming a significance level of 0.05 the rejection region would be z>1.96.
Since 4.93>1.96 the test statistic is in the rejection region.
 Step 4- Make your conclusion
At a significance level of 0.05 we would reject H0 and concluded that there is enough
evidence to conclude that the incidence rate of breast cancer for beauticians is greater
than the incidence rate for the average American woman in the age group 40-49 years.

Bwasiq 86 86
Chapter two: Biostatistics
Comparison of Two Proportions
 Setting: We have 2 independent samples of binary data (n1,x1) and (n2,x2)
 the n‘s are adequately large and are not necessarily equal
 the x‘s are the numbers of ―positive‖ outcomes in the two
samples

Consider the hypothesis:


H0: 1 = 21-2=0 equality of the two population proportions.
The first step is to decide if the alternative hypothesis should be one-sided or two-sided.
o One-sided: Ha: 2>1 OR Ha: 2<1 ( pˆ 1  pˆ 2)  z * SEd
o Two-sided: Ha: 12 pˆ 1(1  pˆ 1) pˆ 2(1  pˆ 2)
SEd  
 Recall the CI formulation for the difference in proportions: n1 n2
o wherez is a value from the standard normal density curve.

 Therefore to test the null hypothesis that the two population proportions are the
same we need to develop a test statistic that normalizes the difference

Comparison of Two Proportions


pˆ 1(1  pˆ 1) pˆ 2(1  pˆ 2)
 To do this we need to determine the standard error of D. D  
n1 n2

 Under the null hypothesis this becomes: D  p(1  p) p(1  p) 1 1


  p(1  p)(  )
n1 n2 n1 n2
 we estimate the common value of p by the overall proportion of successes in the
two samples. p  x1  x 2
n1  n 2
 to estimate D under the null hypothesis we use p from the above expression
o Step 2: Calculate the Test statistic:
 p=the ―pooled‖ proportion – an estimate of the common proportion under H 0.
 Step 3: Refer to the table for standard normal distribution for selecting a cut-
point. If the choice of  is 0.05, the rejection region is determined by:
o For the one sided alternative Ha: 2>1, z>1.65
o For the one sided alternative Ha: 2<1, z<-1.65
o For the two-sided alternative Ha: 12,
 z<-1.96 or z>1.96
 Step 4: Make your conclusion
 Two-sided Alternative - Example
 An investigation was made into fatal poisonings of children by two drugs which
were among the leading causes of such deaths. In each case, an inquiry was
made as to how the child received the fatal overdose and responsibility for the

Bwasiq 86 87
Chapter two: Biostatistics
accident was assessed. Is there evidence of an association between the type of
drug and the responsibility?
Drug A Drug B
Child responsible 8 12
Child not responsible 31 19
Step 1: Write out your Hypotheses
o H0: 1 = 2
o Ha: 12
o 2= population proportion for drug B 1 = population proportion
for drug A
o Step 2: Calculate the Test Statistic
 Step 3: Find the rejection region or the p-value.
 1. Rejection region: since this is a two sided test with
=0.05 the region region is: >1.96 or <-
1.96.
 The z-statistic=1.674 which is not in the rejection region
 Calculate the p-value:
 p-value=2*p(z>1.674)=2*(0.5-0.4525)=0.095
 Step 4: Make your conclusion
o At a significance level of =0.05 we would fail to reject the null
hypothesis and say that there is no association between responsibility
and the type of drug used in the overdose

Bwasiq 86 88

You might also like