Biostatistics Module
Biostatistics Module
net/publication/308179883
CITATIONS READS
5 1,386
2 authors:
Some of the authors of this publication are also working on these related projects:
WHO/TDR Project of Mass Drug administration of DEC and Albandazole admin station for control of Filariases in Ruaral Area of Wardha View project
All content following this page was uploaded by Avijit Hazra on 30 September 2016.
Key Words: Effect size, power, sample size, Type 1 error, Type 2 error
a subset of the population could answer the research studying the situation in a particular region of India,
question. while the second may be that of a group with better
manpower and funding who are trying to cover all the
Studies that cover entire populations go by the generic
regions of the country. It stands to logic that we will
term “census.” In India, we have a decennial census
accept the results of the second group as representative
that is held once in every 10 years. This is only a
of the entire country. We might also accept the results of
demographic and socioeconomic census that aims to
the first group as representative of the entire country if
capture data on a limited range of demographic, social,
there is not much reason to suspect there could be large
and economic indicators. However, each and every Indian
regional differences. However, it is unlikely that we will
citizen has to be covered. This makes it a huge exercise
accept the results of a third group, who have sampled
and necessitates that the Government of India maintain
only a tiny circle, say of a single town (Sample 3) as
an elaborate machinery called the Office of the Registrar representing the entire country. Thus, whenever we work
General and Census Commissioner under the Ministry of with samples rather than populations, it is important to
Home Affairs (popularly called the Census Bureau). This ensure that the sample is optimally sized and adequately
census, though it aims to capture only limited data, is representative of the entire population. It does not
such an elaborate affair, that by the time all the data matter what type of study we are doing; a clinical
have been collected, collated, processed and analyzed, it trial, laboratory experiment, field survey, or quality
is almost time for the next census. Hence, what will the control – everywhere these two issues – that of sample
Indian Government do if it requires some quick answers, size and sampling – are of paramount importance. If
such as the immunization coverage in a particular we do not get these right, our sample results are not
district or the malnutrition prevalence in a particular generalizable to the population we have intended to
region? For this, the government has to maintain study, and all our efforts will go in vain.
another machinery called the National Sample Survey
Organization (NSSO), now known as National Sample “How much is enough?” is often a question that
Survey Office, under the Ministry of Statistics. The NSSO plagues researchers and clinicians alike. While sample
conducts periodic surveys, not of the entire country’s size calculations immediately bring to mind complex
population but of a sample of randomly selected “NSSO formulas, the aim of this module is not to present
blocks,” to provide answers that are generally available a pantheon of fearsome formulas, but rather to
in a 2–3 year timeframe. It routinely collects data on familiarize the reader with the principles though a few
several socioeconomic, health, industrial, agricultural, representative formulas with solved examples. Estimation
and price indicators. of the minimum sample size required for any study can
have technical variations, but the concepts underlying
Most biomedical researchers will never have the luxury most methods are similar. These concepts are important
of conducting a census but will have to depend on a as they enable researchers to use a minimum number of
population subset, called the sample, to seek reasonably subjects to draw strong (valid and robust) conclusions
valid answers to their research questions. Look at with a limited number of research participants. It is
Figure 1 where the ellipse represents a population. also important to remember that whatever the formulas
Suppose, a researcher draws a sample represented by used, small differences in selected parameters (described
the first circle (Sample 1) to answer a research question. below) can lead to large differences in the calculated
Another researcher may be trying to answer the same sample size. Thus, any sample size calculation, however
research question using another sample (Sample 2) carefully done, will always remain approximate. In most
represented by the second circle. Obviously, the two studies, there is a primary question that the researcher
circles vary in their size (radius) and location (center). wishes to answer. Sample size calculations are based on
The situation of the first researcher is akin to a group this primary objective. Finally, before locking the sample
size to work with, one must take into account available
funding, manpower, logistics, and, most important of
all in clinical studies, the ethics of subjecting human
participants to potentially harmful interventions.
needed to detect a clinically relevant treatment effect. • Type 2 error: The probability of failing to reject a
As a general rule, the greater the variability in the false null hypothesis (β). This represents the false
outcome variable, the larger the sample size needed to negative error and is the probability of not finding
assess whether the observed effect (that seen when the a difference between the two groups studied when
study is completed) is a true effect. one actually exists. It is also called the investigator’s
error. The value of β is conventionally set at 20% or
Here, we will discuss principles of sample size calculation
lower. The lower the value, the larger would be the
for two group randomized controlled trials (RCTs).
The calculation of sample size for RCTs requires that sample size. Since β error is the inability to detect
an investigator specify certain factors outlined below. a difference, it follows the (1 − β) is the ability to
Broadly, as we have seen earlier in this series, data detect a difference should one actually exist and is
can be categorized as numerical (quantitative) and referred to as the “power” of a study. A study must
categorical (qualitative) data. For the former, information have at least 80% power to detect a difference. A β
on the mean responses in the two group, µ1, and µ2, value of 10% will confer 90% power to the study.
are required as also the standard deviations (SDs) or a Note again that the relationship between α and β are
common (pooled) SD for the two groups. For categorical reciprocal. If one tries to lower α, the value of β will
data, information on proportions of successes in the two go up, unless one expands the sample size. The only
groups, p1, and p2, is needed. Such information is usually way to achieve zero α and zero β errors is to work
obtained either from published literature, a pilot study or with an infinite sample size, which is not possible.
at times “guesstimated.” The other two key components Selection of α and β error values, will in turn lead
are the Type 1 (alpha) and Type 2 (beta) error probabilities. to selection of the standard normal deviates, Zα and
Apart from this, an understanding of whether the data Zβ values, that are actually entered into the formulas
is normally distributed (following the Gaussian curve) or [Table 1]. The formulas incorporate a factor (Zα + Zβ)2,
otherwise is required. Moreover, an understanding of the which has been referred to as the power index.
null hypothesis is useful at this stage. The null hypothesis • Standard deviation (SD) of the outcome measure
states that the two groups that are being compared are of interest in the underlying population (SD or σ).
not different while the alternate hypothesis would be that This is the variability or spread associated with
the two groups are actually different. quantitative data. The larger the variability, the
larger would be the sample size required to attain a
The four elements that enter into sample size calculation given power at the chosen level of significance. In
formulas: many cases, although the SD is not exactly known,
• Effect size (d or δ): The size of the effect that is one has a rough estimate and the sample size may
clinically worthwhile to detect, that is the smallest be calculated based on the maximum variance that
difference that is clinically meaningful. For numerical is likely. If the variance in the observed data is
data, this is the difference between µ1 and µ2; that smaller, the study will attain higher power. If the SD
is the anticipated outcome means in the two groups, is completely unknown, the solution is to conduct a
while for categorical data, this is represented by p1 pilot study to obtain an estimate of the variance of
and p2 that is the proportions of successful outcomes the outcome measure.
in the two groups. The effect size represents a
clinically meaningful difference in the sense that Although it is often assumed that study groups would
it may make the physician change his practice. As have the same variance, this is not always the case. If
stated earlier, choice of this clinically meaningful the variance of the outcome measure in question varies
difference can be based on existing literature or widely between the different groups, a “pooled” SD value
a pilot study. In case of numerical data, the ratio has been used. This can be calculated for two groups as:
of effect size to SD has been called “standardized
difference” or the “standardized effect”. Table 1: Commonly used standard normal deviate
• Type 1 error: The probability of falsely rejecting a values used in sample size calculations
null hypothesis when it is true (α). This represents
Direction of testing α or β Value
the false positive error and can be regarded as
Zα Two‑sided α=0.05 Zα=1.960
the probability of finding a difference between
Two‑sided α=0.025 Zα=2.326
two groups where none exists. It is also called the
Two‑sided α=0.01 Zα=2.576
regulator’s error as regulatory decision making takes
One‑sided α=0.05 Zα=1.645
place based on results of these comparisons. Note
One‑sided α=0.025 Zα=1.960
that the α error is akin to the significance level of a
One‑sided α=0.01 Zα=2.326
study. Conventionally, the value of α is set at 5% or
lower. The lower the value, the larger would be the Zβ β=0.20 Zβ=0.840
sample size. β=0.10 Zβ=1.282
survival value of 50% in the angioplasty and 65% in the Earlbaum Associates; 1988) dealt elaborately with
bypass group. At 5% significance and 90% power, how effect sizes and standardized differences and proposed
many patients would be needed to detect a difference thumb rules for these. They have been widely used
between the two groups? in psychology, but many feel that predetermined
standardized differences are too restrictive.
In this example, we have:
• Z value (two‑sided) related to the probability of Nomograms
falsely rejecting a true null hypothesis (α) = 0.05 Nomograms offer a graphical alternative method of
Zα = 1.96 sample size calculation. They are cleverly designed on
• Z value related to the probability of failing to reject the basis of general formulas. The Altmans’s nomogram,
a false null hypothesis (β) = 0.90 Zβ = 1.282 devised by Doug Altman and published in his 1991 book
• Proportion of success in one group (p1) = 0.65 (Altman DG. Practical statistics for medical research.
• Proportion of success in the other group (p2) = 0.50. London: Chapman and Hall, 1991) is a popular graphical
Therefore, δ will be (0.65 − 0.50)/√(0.575 × 0.425) = method that enables sample size estimations for paired
0.15/0.49434 = 0.30343. and unpaired t‑tests and the Chi‑square test. You can
easily download a copy from the internet and try it.
Substituting in the formula, we get: To use the nomogram, we first need to translate our
4 × (1.96 + 1.282)
2
required difference into a standardized difference.
N = = 456.63
( 0.30343)
2
Various other nomograms have been devised. For
Thus, 456 patients overall or 228 per group are needed studies of diagnostic tests, Malhotra and Indrayan have
(assuming groups to be of equal size) to conduct the proposed a convenient nomogram for sample size based
study. on anticipated sensitivity/specificity, and estimated
Note that an alternative version of the formula given prevalence (Malhotra RK, Indrayan A. A simple nomogram
above is: for sample size for estimating sensitivity and specificity
p (1 – p1 ) + p2 (1 – p2 ) of medical tests. Indian J Ophthalmol 2010;58:519‑52).
N = 2 × ( Z α + Zβ )2 × 1 The Schoenfeld and Richter nomograms give sample size
( p1 – p2 )
2
for detecting difference in median survival between two
In this version, we can directly accommodate the two treatment groups in survival analysis (Schoenfeld DA,
proportions. Richter JR. Nomograms for calculating the number of
patients needed for a clinical trial with survival as an
Alternatives to General Formulas endpoint. Biometrics 1982;38:163‑70). There are other
Beyond working with formulas by hand, sample size nomograms for other study designs.
estimations can also be done using tables based on these
Software
general formulas, quick formulas, graphical methods
An understanding of the principles helps in inputting
(nomograms), and software.
the desired information into software and quickly
Quick formulas getting the calculations done. Many softwares are
Various simplified versions of the general formulas have available that provide sample size calculation routines.
been proposed to enable rapid sample size calculations Some come as part of larger statistical packages while
for standard situations. For instance, Lehr’s formula some are standalone power and sample size software.
can be used for quick calculation of sample size for Power and Sample Size Calculation (PS) is a free
comparison of two equal‑sized groups for power of 80% software (current version 3.1.2; 2014) developed by
and a two‑sided significance level of 0.05. the Department of Biostatistics, Vanderbilt University,
USA, that provides sample size routines for both
The required size of each equal sized group is interventional and observational studies. It can be
16/(standardized difference)2. downloaded (http://biostat.mc.vanderbilt.edu/wiki/
For power of 90%, numerator becomes 21. main/powersamplesize), installed and used without
restrictions. Power Analysis and Sample Size Software
However, if the standardized difference is small, this
(PASS) is a comprehensive commercial package developed
formula tends to overestimate sample size. In comparison
by NCSS, a company based in Kayesville, Utah, USA.
to the general formula presented above.
It is regarded as an industry standard and provides
Recall that standardized difference is the ratio of sample size calculations for over 650 statistical tests and
effect size to SD. For unpaired t‑test, the standardized confidence interval (CI) scenarios (http://www.ncss.com/
difference is calculated simply as δ/σ. Jacob Cohen in software/pass/). nMaster (current version 2.0; 2011),
his 1988 book (Cohen J. Statistical power analysis for developed by the Department of Biostatistics, Christian
the behavioral sciences. 2nd ed., Hillsdale, NJ: Lawrence Medical College, Vellore, India, is a very affordable
package that also provides a large number of sample to approach more subjects than are needed in the first
size routines required for most academic studies. It is instance. In addition, even in the very best designed and
important to remember, however, that the sample size conducted studies, it is unusual to finish with a data set
calculations are prone to errors and, even with software, in which complete data are available for every subject.
it is usually a good practice to get these calculations Subjects may fail to turn up, refuse to be examined or
verified by someone experienced in the field before their samples may be lost. In studies involving long
embarking upon the study. follow‑up, there will always be a substantial degree of
attrition. It is therefore often necessary to estimate
Adjustments to Calculated Sample Size the number of subjects that need to be approached to
Once a “raw” sample size is calculated, various achieve the final desired sample size.
adjustments may be needed to accommodate variations The adjustment may be done as follows. If a total
in study objectives and to keep an adequate safety of (N) subjects is required in the final study, but a
margin for potential attritions. Thus, adjustments may proportion (q) are expected to refuse to participate or to
be needed for multiple outcomes, unequal group sizes, drop out before the study ends, then the total number
dropouts, planned subgroup analysis, cluster sampling, of subjects (N”) who would have to be approached at the
and so on. For instance, it is important that we calculate outset to ensure that the final sample size is achieved:
sample size on the basis of a single primary outcome. If N
we have multiple primary outcomes, there is no other N' =
way out than to calculate sample sizes separately for
( – q)
1
each of these outcomes and then work with the largest Thus, if 135 is the estimated total sample size required
so that the rest are covered as well. Other common and maximum 20% of recruited subjects are expected
adjustments are discussed below. to drop out before study ends, the recruitment target
would be:
Adjusting for Unequal Sized Groups N" =
135
=
135
= 168.75
The methods described above assume that the (1 – 0.2) 0.8
comparison is to be made across two equal sized groups. Thus, 169 eligible subjects would need to be recruited in
However, this may not be the case in practice, for order that at least 135 subjects complete the study.
example in an observational study or in an RCT with
Using the formula 100/(100 − x), where x represents the
unequal randomization. In this case, it is possible to
estimated maximum drop out fraction in percentage, one
adjust the numbers to reflect this inequality. The first
can easily derive the following set of correction factors
step is to calculate the total (across both groups) sample
[Table 2] that may be applied to the total sample size to
size assuming that the groups are equal sized. This total
derive the screening or recruitment target sample size.
sample size (N) can then be adjusted according to the
actual ratio of the two groups (k) with the revised total As with other aspects of sample size calculations,
sample size (N') being: the proportion of eligible subjects who will refuse to
N (1 + k )
2 participate or who drop out will be unknown at the
N’ = onset of the study. However, good estimates will often
4k be possible using information from similar studies in
For instance, consider that a placebo‑controlled trial comparable populations or from an appropriate pilot
requires a total sample size of 120 if the two groups are study. Note that it is particularly important to account
of equal size. If it is decided that twice as many subjects for dropouts in the budgeting of studies in which initial
would be randomized to the active treatment group than recruitment costs or treatment costs are likely to be high.
to the placebo group, the new sample size N’ would be:
120 × (1 + 2)
2
120 × 9 Table 2: Correction factors (to total sample size) for
N' = = = 135
4×2 8 different dropout fractions
Thus, 135/3 = 45 patients need to be allocated to Estimated maximum Total sample size
placebo treatment and 135 × 2/3 = 90 to active dropout fraction (%) to be multiplied by
treatment groups. 5 1.05
10 1.11
Adjusting for Consent Refusals, 15 1.18
Withdrawals, and Missing Data 20 1.25
25 1.33
Any sample size calculation is based on the total
30 1.43
number of participants who are needed in the final
analysis. In practice, eligible subjects will not always be 35 1.54
willing or continue to take part, and it will be necessary 40 1.67
Other Considerations in Sample Size study. Since the predicted precision or CI width depends
Determination mostly on the variance of the data and much less on
the effect size, a study can be planned to yield a given
There are some additional issues in sample size
precision or CI width without choosing a likely effect
determination that are not often discussed, but size.
nevertheless may be relevant in certain situations and
therefore merit attention. When designing a study to yield a CI with a given width,
the general technique is to choose a sample size so that
First, the above approaches to sample size have assumed the ‘average, resulting CI will have the given width. It is
that a simple random sample is the sampling design. important to realize that, given such a sample size, there
More complex designs, like stratified random samples or is a 50‑50 chance that the final width will be narrower
cluster samples, must take into account the variances or wider than the desired width, even if the estimate
of subpopulations, strata, or clusters before an estimate used for the variance of the outcome data is accurate.
of the variability in the population as a whole can be Alternatively, one can choose a sample size so that there
made. is a defined probability (e.g., 0.80 or 0.95) that the final
A second consideration is the extent of the analysis that CI width is below a given value. This calculation is more
is planned to be performed. If descriptive summaries analogous to a typical power calculation, but is more
and simple inferential statistics are planned than almost complex and not generally part of currently‑available
any sample size is good enough. On the other hand, sample size software.
larger samples are required if multivariate analysis such
as multiple regression, analysis of covariance, logistic
Post hoc Power Analysis
regression, log‑linear analysis, and Cox’s proportional A power calculation needs to be before a study is
hazard analysis are planned for more rigorous assessment initiated to determine the appropriate sample size.
of the combined effect of multiple variables. Methods A power calculation can always be done once the study
are still evolving to estimate optimum sample size for data have been generated but it is not good, and rather
such multivariate techniques. In studies with “adaptive unfair, statistical practice to tweak parameters of the
design,” the sample size may keep changing as the study a priori power analysis on the basis of post hoc power
progresses, based on results obtained. These calculations analysis.
are quite complex. How can one interpret data from a negative study in
Finally, an adjustment in the sample size may be needed which no power calculation was initially performed?
to accommodate a comparative analysis of subgroups. Although tempting, performing a post hoc power analysis
There are various suggestions but no hard and fast to estimate the effect size that could have been found
recommendations toward this. Even if total sample is with the actual sample size and with a given power,
relatively large, skewed distribution of variables in is inappropriate and should not be done. The correct
approach to such data is to calculate the 95% CI for the
question can result in erroneous conclusions on subgroup
outcome of interest, based on the final data, and use
analysis. The safest approach is to plan all subgroup
this interval to guide interpretation of the study results.
analyses beforehand and ensure that the likely subgroup
Incidentally, a CI width‑based power and sample size
sample size would have adequate power to demonstrate a
calculation can also be done before initiating the study.
clinically important effect on comparison of subgroups.
This is routinely done in cohort, case–control and other
The sample size for surveys requires different epidemiological studies.
considerations. This is highlighted in Box 1.
What if There is No Choice About Sample
Precision versus Effect Size Size
Frequently, studies are designed to yield an estimate Finally, before we close this chapter, let us address this
of some parameter of interest, for example, the mean common dilemma. Often, a study has a limited budget,
of a continuous variable, proportion, or a difference in and this curtails the possibility of a “comfortably”
proportions. A difference in proportions can be measured large sample. It is hard to argue with a budget. It
as a raw difference, a relative risk, an odds ratio, and in is equally unwelcome to give up (the aptly called
other ways. In each case, however, the precision of the “terminator” approach) and say that the study cannot
final estimate can be measured by the width of the 95% be done. The practical alternative is to realize that by
CI. The larger the sample size used to calculate the final tweaking aspects of sample size calculation it may still
CI, the narrower the CI will be. Therefore, the desired be possible to execute the study within the constraints
final CI width can be used, instead of the desired effect of a restricted budget and therefore the imposed
size, to determine the appropriate sample size for a sample size. It may be possible to raise the effect
size, while still keeping it within clinically plausible a pooled conclusion and contribute to the body of
range. Perhaps a better instrument can be found evidence‑based medicine.
that will reduce the variability in the measurements.
It may even be feasible to make modifications to Conclusion
the study design (e.g., judicious stratification) that Determining the appropriate sample size for a study is
will further reduce the variance of the estimator. a fundamental aspect of ethical research. Performing a
As an alternative, we may consider networking valid sample size calculation requires estimates of the
with other sites and investigators willing to carry permissible Type 1 error rate (α) and power, variability
out the same study (the “Spiderman” approach) in in the data, the effect size sought, as well as a planned
collaboration. As a last resort, we forget about the method of analysis. Although the concepts underlying
limited sample size and concentrate instead on doing power analysis and sample size determination are
the study well (the Nike‑like “just do it” approach). relatively simple, the large number of different study
In this era of systematic reviews and meta‑analysis, designs and analysis methods results in a bewildering
even if our relatively small study cannot achieve number of different sample size formulas. Therefore, the
sufficient statistical power, as part of a sequence of use of power and sample size software is helpful and is
studies, it may still add enough muscle to arrive at gaining widespread usage.
Financial support and sponsorship Basic principles of sample size estimation. J Adv Nursing
2004;47:297‑302.
Nil.
3. Karlsson J, Engebretsen L, Dainty K, ISAKOS scientific
Conflicts of interest committee. Considerations on sample size and power calculations
in randomized clinical trials. Arthroscopy 2003;19;997‑9.
There are no conflicts of interest.
4. Statistical power and sample size. In: Cleophas TJ,
Zwinderman AH, Cleophas TF, Cleophas EP. Statistics applied
Further Reading to clinical trials. 4th ed. Amsterdam: Springer; 2009. p. 81‑97.
1. Julios SA. Sample sizes for clinical trials with normal data. 5. Calculation of required sample size. In: Kirkwood BR,
Stats Med 2004;23:1921‑86. Sterne JA. Essential medical statistics. 2nd ed. Oxford:
2. Devane D, Begley CM, Clarke M. How many do I need? Blackwell Science; 2003. p. 413‑28.