Research Methodology and Biostatistics Unit II Part I
Research Methodology and Biostatistics Unit II Part I
Methodology
and Biostatistics
Unit II – Biostatistics: Definition, application, sample size,
importance of sample size, factors influencing sample size,
dropouts, statistical tests of significance, type of
significance tests, parametric tests(students “t” test,
ANOVA, Correlation coefficient, regression), non-
parametric tests (Wilcoxan rank tests, analysis of variance,
correlation, chi square test), null hypothesis, P values,
degree of freedom, interpretation of P values.
• What is statistics?
A collection of techniques for extracting
information from data, and for ensuring
that the data collected contains the desired
Statistics information.
Data
Basic Concepts of
Biostatistics
Sample
Variable
Population
Population Samples
Basic Definitions and Concepts
Variables
Quantitative Qualitative
(Blood Pressure) (colour of skin)
Discontinuous Continuous
(Discrete) Percentage of haemoglobin
No. of TB patients in
a hospital
• Qualitative (categorical) variables e.g.
Colour of skin
• Quantitative variables: e.g. blood sugar level
• Quantitative Continuous: decimals are
Types of allowed: e.g. blood pressure, height, weight
Variables • Quantitative Discrete: integers only
e.g. number of anesthetic shots, number of
hospital admissions, blood cell count
Qualitative observation
Tuberculoid 151
Indeterminate 18
Borderline 12
Total 181
9.5-9.9 4
10.0-10.4 8
Total 13
Differential Counts from the blood of a person Classified accoording to the number of Esinophils
1 20
2 20
Total 51
• Independent variables
• Precede dependent variables in time
• Are often manipulated by the researcher
• The treatment or intervention that is used in a
study
• Dependent variables
Variables • What is measured as an outcome in a study
• Values depend on the independent variable
• Example:
For instance, if we wish to compare bioavailabilities
of various dosage forms, the dependent variable
would be AUC (area under the concentration–time
curve), and the independent variable would be
dosage form.
• The number (n) of observations taken from a
population through which statistical inferences for
the whole population are made.
• Sample:
“A small portion of the population which truly
represents the population with respect to the study
characteristics .”
• Need for sample size:
Sample Size Biological data is highly variable Crucial element in
the planning of any research project economy in
terms of personnel, equipment's, time and related
aspects but ,not at the cost of a desired precision,
confidence and power.
• Why it is important?
Integral part of quantitative research.
Ensuring validity, accuracy, reliability, scientific and
ethical integrity of research.
• Three main concepts to be considered:
Estimation (depends on several components).
Considerations Justification (in the light of budgetary or
in sample size biological considerations)
Adjustments (accounting for potential dropouts
calculation or effect of covariates)
Importance of Sample Size calculation
Scientific reasons
Ethical reasons
Economic reasons
• I-Scientific Reasons
In a trial with negative results and a sufficient sample size, the
result is concrete (treatment has no effect-no difference).
In a trial with negative results and insufficient power
(insufficient sample size), may mistakenly conclude that the
Importance of treatment under study made no difference (false conclusion).
• II-Ethical Reasons
Error= 1 year
Parameter
20 years Statistic Sample 2
24 Years
Error= 4 year
Statistic Sample 3
26 Years
Error = 6 year
• Sampling error conti….
Sampling error is not a measure error, nor it
is a systematic bias in sample- it is the error
which depends on the representatives of the
sample.
Less the sampling error- greater the
Key Terms in precision of the sample.
Sampling Representative sample depends upon:
Sampling error – function of sample size
Non-sampling error – systematic error - -
study design, correction - execution of
sampling error and non-response error.
Sample Size Determination - quantitative data
• Mean pulse rate of a population is believed to be 70 per minute with a
standard deviation of 8 beats. Calculate the minimum size of the sample to
verify this, if allowable error (i) If E = ±1 beat at 5% risk and (ii) If E = ± 2 beats
with 5% risk.
• Solution:
4𝜎 2 4×8×8
(i) n= 2 = = 256.
𝐸 1×1
4𝜎 2 4×8×8
(ii) n= 2 = =64
𝐸 2×2
If E is less, n will be more, i.e. larger the sample size, lesser will
be the error.
To Solve
• Mean systolic blood pressure in one college students was found to be
120 with SD of 10. Calculate the minimum size of the sample to verify
the result if allowable error is 2 at 5% risk.
• Solution:
4𝜎 2 4×10×10
• n= 2 = = 100
𝐸 2×2
Sample Size Determination - qualitative data
• Incidence rate in the last influenza epidemic was found to be 50 per
thousand (5%) of the population exposed. What should be the size of
sample to find incidence rate in the current epidemic if allowable error is
0.005 and 0.01?
4𝑝𝑞
•n= 2 p = 0.05 q = 1 − p = 1 − 0.05 = 0.95
𝐸
If E=0.005
4𝑝𝑞 4(0.05)×(0.95)
n= 2 = 2 = 7600
𝐸 0.005
If E=0.01
4𝑝𝑞 4(0.05)×(0.95)
n= 2 = 2 = 1900
𝐸 0.01
So larger the permissible error, the smaller will be the size of sample
required for both types of data.
To Solve
• Hookworm prevalence rate was 30% before the specific treatment and adoption of
other measures. Calculate the size of the sample required to find the prevalence
rate now if allowable error is 0.03 and 0.06.
• Solution:
• If E = 0.03
4𝑝𝑞 4(0.3)×(0.7)
n= 2 = =933.3 ~ 934
𝐸 0.03 2
• If E = 0.06
4𝑝𝑞 4(0.3)×(0.7)
n= 2 = =233.3 ~ 234
𝐸 0.06 2
Thus, if we allow a small error, the required sample size will be much larger as
compared to one when the allowable error is increased.
Sampling techniques
I. Random Sampling / probability sampling
• Simple Random Sampling :
Lottery Method.
Table of random Numbers
• Systematic sampling
• Stratified sampling
• Multistage sampling
• Cluster sampling
• Multiphase sampling
Sampling techniques
SAMPLING TECHNIQUES - Simple Random Sampling:
369 495
428 572
565 169
969 786
385 094
Sampling Techniques - Systematic Sampling
• Systematic Sampling:
𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
K=
𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑑𝑒𝑠𝑖𝑟𝑒𝑑
if 10% sample is to be taken out of one thousand patients.
1000
K= = 10
10% 𝑜𝑓 1000
• One random number is found by pulling out one card after shuffling,
out of 10 cards serially numbered 1 to 10. Supposing it is 6, then the
sample will consist of units with sample numbers 6, 6 + 10 = 16, 16 +
10 = 26, 26 + 10 = 36 and so on. Examine every 10th house after the
6th house.
Sampling Techniques - Systematic Sampling
• To assess incidence of influenza in one epidemic in a large city like
Bombay. If 20% sample is to be taken,out of 100 what has to be done
and how to solve?
• Solution:
100
K= = 5
20% 𝑜𝑓 100
examine every 5th case starting with the random number such as 2,
subsequent numbers will be 2 + 5, 7 + 5, 12 + 5, i.e. 7, 12. 17, 22 and so
on.
Sampling Techniques conti….
• Stratified Sampling
This method is followed when the population is not homogeneous.
The population under study is first divided into homogeneous groups or
classes called strata and the sample is drawn from each stratum at random
in proportion to its size.
• Multistage Sampling
employed in large country surveys.
In the first stage, random numbers of districts are chosen in all the states,
followed by random numbers of talukas, villages and units, respectively,
e.g. for hookworm survey in a district, choose 10% villages in the talukas
and then examine stools of all persons in every 10th house.
Sampling
Techniques
conti….
• Cluster Sampling:
• A cluster is a randomly selected group.
E.g., As per module approved by WHO, it
is most often used to evaluate vaccination
coverage in expanded Programme of
Immunization (EPI) and Universal
Immunization Programme (UIP), where
only 210 children, taking 7 from each
cluster in the age group 12–23 months are
to be examined.
• Multiphase Sampling:
In this method, part of the information is
collected from the whole sample and part
from the subsample.
I. II.Non-random sampling/ non probability sampling
Convenience Sampling
Purposive sampling/Judgment sampling
Sampling techniques Quota sampling
Snowball sampling
Purpose of sampling
• Complete coverage may not be possible
• High degree of accuracy
• Short period of time valid and comparable results can be obtained
• Less demanding – Requirements of investigation
• Economical
• Quality control
• Draw inference about the universe
• Generalization
• Random error: error that occur by chance.
• Sources
sample variability,
Subject to subject differences & measurement errors.
It can be reduced
averaging,
increase sample size,
repeating the experiment.
• Systematic error: deviations not due to chance alone.
Several factors, e.g patient selection criteria may contribute.
It can be reduce by good study design and conduct of the experiment.
• Precision: the degree to which a variable has the same value when
measured several times. It is a function of random error. Sample
represents the population
• Accuracy: the degree to which a variable actually represent the true
value. It is function of systematic error. – Bias is a sent from the sample
• Power(1-b): This is the probability that the test will correctly identify a
significant difference, effect or association in the sample should one
exist in the population. Sample size is directly proportional to the power
of the study. The larger the sample size, the study will have greater
power to detect significance difference, effect or association.
• Effect size: is a measure of the strength of the relationship between two
variables in a population. It is the magnitude of the effect under the
alternative hypothesis. The bigger the size of the effect in the
population, the easier it will be to find.
• Confidence Interval: A confidence interval, in statistics, refers to the
probability that a population parameter will fall between two set values
for a certain proportion of times. A confidence interval can take any
number of probabilities, with the most common being a 95% or 99%
confidence level.
• Null hypothesis: It states that there is no difference among groups or no
association between the predictor & the outcome variable. This
hypothesis need to be tested.
• Alternative hypothesis: It contradict the null hypothesis. If the
alternative hypothesis cannot be tested directly, it is accepted by
exclusion if the test of significance rejects the null hypothesis. There are
two types; one tail(one-sided) or two tailed(two-sided).
• Type I(α) error: It occurs if an investigator rejects a null hypothesis that
is actually true in the population. The probability of making (α) error is
called as level of significance & considered as 0.05(5%). Sample size is
inversely proportional to type I error.
• Type II(β) error: it occur if the investigator fails to reject a null
hypothesis that is actually false in the population
• Type I error or alpha (false - positive) :Rejecting the null when it is true.
• Type II error or beta (false - negative) : Accepting the null when it is false.
α
• The probability of committing a type I error (rejecting the null when it is actually
true) is called (alpha), another name is the level of statistical significance.
• An level of 0.05, setting 5 % as the maximum chance of incorrectly rejecting the
null hypothesis.
β
• The probability of making a type II error (failing to reject the null hypothesis when
it is actually false) is called (beta).
• The quantity (1-β) is called power, the ability to detect the difference of a given
size.
• If is set at 0.10, we are willing to accept a 10 % chance of missing an association of
a given effect size.
• This represents a power of 90 % (there is 90 % chance of finding an association of
that size
Nature of universe
Nature of study
DETERMINATION
OF SIZE OF Type of sampling
SAMPLE
Standard of accuracy and acceptable confidence level
Availability of finance
Other considerations
to specify the precision of
estimation desired and then to
Approaches determine the sample size
for necessary to insure it
determining
the size of uses Bayesian statistics to weigh
the sample the cost of additional information
against the expected value of the
additional information.
DETERMINATION OF SAMPLE SIZE THROUGH THE APPROACH BASED ON PRECISION
RATE AND CONFIDENCE LEVEL
Acceptable Error
Infinite Population
Finite Population
• In case of finite population the confidence interval for µ is given by
the formula:
• the confidence interval for µ precision is taken as equal to ‘e’
Determining ‘n’
where
N = size of population
n = size of sample
e = acceptable error (the precision)
σp = standard deviation of population
z = standard variate at a given confidence level.
• Determine the size of the sample for
estimating the true weight of the cereal
containers for the universe with N = 5000 on
the basis of the following information:
(1) the variance of weight = 4 ounces on the
Solve basis of past records.
(2) estimate should be within 0.8 ounces of
the true average weight with 99% probability.
Will there be a change in the size of the
sample if we assume infinite population in the
given case? If so, explain by how much?
Ans
• N = 5000;
• e = acceptable error (the precision)
• σp = 2 ounces (since the variance of weight = 4 ounces);
• e = 0.8 ounces (since the estimate should be within 0.8 ounces of the true
average weight);
• z = 2.57 (as per the table of area under normal curve for the given
confidence level of 99%).
Hence, the sample size (or n) = 41 for the given precision and confidence level in the above
question with finite population
As per the question if the population is
infinite then
Thus, in the given case the sample size remains the same even if we assume infinite population.
Home work
• A hospital administrator wishes to estimate the mean weight of babies
born in her hospital. How large a sample of birth records should be taken
if she wants a 99 percent confidence interval that is 1 pound wide?
Assume that a reasonable estimate of s is 1 pound. What sample size is
required if the confidence coefficient is lowered to .95?
• A physician would like to know the mean fasting blood glucose value
(milligrams per 100 ml) of patients seen in a diabetes clinic over the past
10 years. Determine the number of records the physician should examine
in order to obtain a 90 percent confidence interval for m if the desired
width of the interval is 6 units and a pilot sample yields a variance of 60.
(b) Sample size when estimating a percentage
or proportion:
If infinite population
• Suppose a certain hotel management is
interested in determining the percentage of
the hotel’s guests who stay for more than 3
days. The reservation manager wants to be
To Solve 95 per cent confident that the percentage
has been estimated to be within ± 3% of the
true value. What is the most conservative
sample size needed for this problem?
Solution:
• Population is infinite;
• e = .03 (since the estimate should be within 3% of the true value);
• z = 1.96 (as per table of area under normal curve for the given confidence level of 95%).
• As we want the most conservative sample size we shall take the value of p = .5 and q = .5.
• Using all this information, we can determine the sample size for the given problem as under:
• Thus, the most conservative sample size needed for the problem is = 1067.
• A survey is being planned to determine what
proportion of families in a certain area are
medically indigent. It is believed that the
To solve 2 proportion cannot be greater than .35. A 95
percent confidence interval is desired with
e=.05. What size sample of families should
be selected?
Homework
• An epidemiologist wishes to know what proportion of adults living in a
large metropolitan area have subtype hepatitis B virus. Determine the
sample size that would be required to estimate the true proportion to
within .03 with 95 percent confidence. In a similar metropolitan area the
proportion of adults with the characteristic is reported to be .20. If data
from another metropolitan area were not available and a pilot sample
could not be drawn, what sample size would be required?