Sample Size Calculation
Sample Size Calculation
Since the era of evidence-based medicine, it has become a matter of course to use statistics to create Received February 20, 2023
objective evidence in clinical research. As an extension of this, it has become essential in clinical research to Revised March 12, 2023
calculate the correct sample size to demonstrate a clinically significant difference before starting the study. Accepted March 12, 2023
Also, because sample size calculation methods vary from study design to study design, there is no formula
for sample size calculation that applies to all designs. It is very important for us to understand this. In this Corresponding author
review, each sample size calculation method suitable for various study designs was introduced using the R Hae In Bang
program (R Foundation for Statistical Computing). In order for clinical researchers to directly utilize it Department of Laboratory Medicine,
Soonchunhyang University Seoul
according to future research, we presented practice codes, output results, and interpretation of results for
Hospital, 59 Daesagwan-ro,
each situation.
Yongsan-gu, Seoul 04401, Korea
E-mail: [email protected]
Keywords: Sample size, Effect size, Continuous outcome, Categorical outcome ORCID:
https://orcid.org/0000-0001-7854-3011
Youngho Park
Department of Big Data Application,
College of Smart Interdisciplinary
Engineering, Hannam University,
70 Hannamro, Daedeok-gu,
Daejeon 34430, Korea
E-mail: [email protected]
ORCID:
https://orcid.org/0000-0002-7096-3967
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// Copyright © The Korean Society of Endo-
creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any Laparoscopic & Robotic Surgery.
medium, provided the original work is properly cited.
relevant treatment effect. The fundamental reason for calculat- considered essential for estimating the sample size are as follows.
ing the number of subjects in the study can be divided into the
following three categories [1,2]. Study design
Economic reasons There are various research designs [3] in clinical research,
but among them, the most commonly used design is the paral-
In clinical studies, if the sample is not large enough, statistical lel design. A crossover design [4,5] can be used in studies where
significance may not be found even if an important relationship the number of subjects is difficult to collect. A crossover design
or difference exists. In other words, it may not be possible to requires fewer samples than a parallel design, but is complex and
successfully conclude the study because the study may lack the must satisfy several conditions, so an appropriate study design
power to detect the effect. Conversely, when a study is based on should be selected according to the purpose of the study.
a very large sample, small effect differences may be considered Parallel design. Group A receives only treatment A and group
statistically significant and lead to clinical misjudgment. Either B receives only treatment B.
way, your research may not be successful for other reasons and Crossover design. It is a study in which one group receives
the conclusion is a waste of money, time, and resources. treatment A first, then treatment B, and the other group receives
treatment B and then treatment A. Therefore, it is important to
Ethical reasons have an appropriate wash-out period at the time of treatment
change.
Oversized studies are likely to include more subjects than the
study actually needs, exposing unnecessarily many subjects to Null and alternative hypotheses testing
potentially harmful or futile treatments. Similarly, in undersized
studies, ethical issues may arise in that subjects are exposed to When establishing statistical hypotheses in research, two hy-
unnecessary situations in studies that may have low success rates. potheses are always required, which we call the null hypothesis
(H0) and the alternative hypothesis (H1 or Hα). In this case, the
Scientific reasons two hypotheses must consist of two mutually exclusive state-
ments. The null hypothesis (H0) usually contains the opposite of
If a negative result is obtained after conducting a study, it is what the researcher claims in the study and is set to be rejected.
necessary to consider whether the sample size of the study was That is, include ‘no difference’ when forming hypotheses. Con-
sufficient or insufficient. First, if the study was conducted with versely, an alternative hypothesis (H1) is a statement in which the
sufficient sample size, it can be interpreted that there is no clini- researcher proposes a potential outcome, and that hypothesis
cally significant effect. However, if the study is conducted with includes ‘there is a difference.’ There are also different types of
insufficient sample size, meaningful clinical results with statisti- hypothesis testing problems, depending on the purpose of the
cally significant differences in practice may be missed. Notice study. In Table 1, hypotheses can be established depending on
that not being able to reject the null hypothesis does not mean whether it is an equality, equivalence, superiority, or non-inferi-
that it is true; it means that we do not have enough evidence to ority test. Let μS = mean of standard treatment, μT = mean of new
reject it. treatment, δ = the minimum clinically important difference, and
Additionally, calculating sample size at the study design stage, δNI = the non-inferiority margin.
when receiving ethics committee approval, has become a require- Test for equality. To determine whether a clinically meaning-
ment rather than an option. As a result, calculating the optimal ful difference or effect exists (δ = 0).
sample size is an important process that must be done at the Test for equivalence. To demonstrate the difference between
design stage before a study is conducted in order to ensure the the new treatment and standard treatment has no clinical impor-
validity, accuracy, reliability, and scientific and ethical integrity
of the study.
Table 1. Types of hypothesis testing
COMPONENTS OF SAMPLE SIZE Test for Null hypothesis (H0 ) Alternative hypothesis (H1 )
CALCULATION Equality μT – μs = 0 μT – μs ≠ 0
Appropriate sample size usually depends on the statistical hy- Equivalence |μT – μs | ≥ δ |μT – μs | < δ
potheses made with the study’s primary outcome and the study Superiority μT – μs ≥ δ μT – μs > δ
design parameters. The basic statistical six concepts that must be Non-inferiority μT – μs ≤ –δNI μT – μs > –δNI
www.e-jmis.org
12 Suyeon Park et al.
Interim analysis However, for studies with relatively low event rates and high
In the confirmatory trials, there are cases in which interim censoring, the following sample size formula using only event
analysis, whether planned or unplanned, is performed at the re- rates can be used:
search planning stage. When calculating the number of subjects
taking this into account, the false positive rate increases with the �
��𝑍𝑍����� + 𝑍𝑍��� �
number of interim analyses, so type I error should be considered. n
n== .
𝜆𝜆 �
�ln � � ��
𝜆𝜆�
Sample size for survival time
In survival analysis, the outcome variable is the time until a
specific event such as death occurs, and whether or not an event HOW TO CALCULATE THE SAMPLE SIZE?
occurs for each subject and the time from the start of the clini-
cal trial to the occurrence of the event (or censoring) are used as Using the 17 tests in Table 3, which are widely used in research,
outcome variables. In particular, the power of survival analysis is we would like to show an example using an R program version
a function of the number of events and generally increases with a 4.1.2 (R Foundation for Statistical Computing; ‘pwr’, ‘exact2x2’
shorter period (T0) of recruitment to study subjects and a longer to- and ‘WebPower’ [11] packages), one of the free programs. Basi-
tal follow-up period (T). Let λ1 and λ2 are the hazard ratio for each cally, when using R, you need to install a package that includes
group, the formula for calculating the number of subjects is [10]: the function you want to analyze and then use it. After that, you
can use the function you want to use after calling package using
�
�𝑍𝑍� � � � � − 𝑍𝑍� � � � �∅ �𝜆𝜆� � + ∅ �𝜆𝜆� �� the ‘library()’ function. More details will be explained through
n= the example below.
n= �𝜆𝜆� − 𝜆𝜆� �� where
𝜆𝜆� 𝜆𝜆�
. All studies intend to use a parallel group design. A two-tailed
∅�𝜆𝜆� = 𝑜𝑜𝑜𝑜𝑜𝑜𝑜�𝜆𝜆� =
1 − 𝑒𝑒 ��� 1 − �𝑒𝑒 �� ����� � − 𝑒𝑒 ��� � / 𝜆𝜆𝜆𝜆� test with a significance of 0.05 and a power of 80% was estab-
lished. The dropout rate is different for each research field, but
here we will unify it at 20%. For nonparametric tests on con- One-sample t test (Table 3, no.1)
tinuous variables, as a rule of thumb [12], calculate the sample
size required for parametric tests and add 15%. Effect size can >install.packages(“pwr”)
be defined as ‘a standardized measure of the magnitude of the >library(pwr)
mean difference or relationship between study groups’ [13]. In >pwr.t.test (d = 0.5, sig.level = 0.05, power = 0.8, type = “one.
other words, an index that divides the effect size by its dispersion sample”, alternative = “two.sided”)
(standard deviation, etc.) is not affected by the measurement unit Effect size calculation:
𝜇𝜇� − 𝜇𝜇�
and can be used regardless of the unit, and is called an ‘effect size Cohen’s d (d) = 𝑆𝑆𝑆𝑆
index’ or ‘standardized effect size.’ Cohen intuitively introduced where μ 0 = mean under Ho
effect sizes as small, medium, and large for easy understand- μ1 = mean under H1
ing [14]. However, since the value presented by Cohen may vary SD = SD under H0
depending on the population or distribution of the variable,
there may be limitations in using it as an absolute value. When Assuming a p-value of 0.05 and a power of 80% in a two-tailed
estimating the number of subjects, effect sizes (such as Cohen’s d, test, the minimum number of subjects required to demonstrate
r, or the relative ratio, etc.) should be calculated using parameter statistical significance is 34 when the effect size d = 0.5. Consid-
information (MD and SD) found in the literature relevant to ering the dropout rate of 20%, a total of 43 samples are required.
each primary outcome and entered as arguments to the func-
tion. Additionally, whether an effect size should be interpreted Two-sample t test (Table 3, no. 2)
as small, medium, or large may depend on the analysis method.
We use the guidelines mentioned by Cohen [14] and Sawilowsky >library(pwr)
[15] and use the medium effect size considered for each test in the >pwr.t.test (d = 0.5, sig.level = 0.05, power = 0.8, type = “two.
examples below. sample”, alternative = “two.sided”)
Effect size calculation [16]:
CONTINUOUS OUTCOME Cohen’s d for Welch
𝜇𝜇� − 𝜇𝜇�
test(d) = 𝑆𝑆𝑆𝑆����
When the primary outcome considered in the study is con- where μ 1 = mean of group1
tinuous data, the number of samples can be calculated using μ 2 = mean of group2
the ‘pwr’ package. At this time, you can consider comparing the SD1 = SD of group1
mean of a single group, two groups, or more than three groups, SD2 = SD of group2
and Cohen’s d and f will be used for the effect size. When applied SDpool = ��𝑆𝑆𝑆𝑆�� + 𝑆𝑆𝑆𝑆���/2
to your study, parameters can be taken from a previous or pilot
study and calculated using the effect size calculation formula Assuming a p-value of 0.05 and a power of 80% in a two-tailed
below. test, the minimum number of subjects required for each group to
demonstrate statistical significance is 64 when the effect size d =
Practice 1 0.5. Considering a dropout rate of 20%, 80 subjects are required
for each group, for a total of 160 subjects.
The pwr.t.test() function (Supplementary data 1, Table 1) can be
utilized with the ‘type’ argument for (1) one-sample t test (type =
“one.sample”), (2) two-sample t test (type = “two.sample”), or (3)
paired t test (type = “paired”). Cohen’s d is used as the effect size,
and the size definition [14,15] is as follows; very small (d = 0.01),
small (d = 0.2), medium (d = 0.5), large (d = 0.8), very large (d = 1.2),
and huge (d = 2). In our example, we will use medium effect size (d
= 0.5).
www.e-jmis.org
14 Suyeon Park et al.
Paired t test (Table 3, no. 3) One-way analysis of variance (ANOVA) (Table 3, no. 4)
>library(pwr) >library(pwr)
>pwr.t.test (d = 0.5, sig.level = 0.05, power = 0.8, type = “paired”, >pwr.anova.test (k = 3 , f = 0.25, sig.level = 0.05, power = 0.8)
alternative = “two.sided”) Effect size calculation:
Effect size calculation [16]: �
Cohen’s f (f) = �∑ 𝑝𝑝 ×𝜎𝜎 �𝜇𝜇 − 𝜇𝜇�
����� �
�
�
�
In the case of paired samples, if there is a correlation coef- Assume that the p-value is 0.05, the power is 80%, and the two-
ficient (r) between the variables before and after, it can be calcu- tailed test is performed. When the total comparison group was
lated as the SDpool = �𝑆𝑆𝑆𝑆�� + 𝑆𝑆𝑆𝑆�� − 2𝑟𝑟𝑟𝑟𝑟𝑟� 𝑆𝑆𝑆𝑆� /�2�1 − 𝑟𝑟� . Assuming a three groups and the effect size value was 0.25, the number of
p-value of 0.05 and a power of 80% in a two-tailed test, the mini- subjects calculated was 53 in each group. Considering a dropout
mum number of pairs required to demonstrate statistical signifi- rate of 20%, a total of 198 samples are required, which is calcu-
cance is 34 when the effect size d = 0.5. Considering the dropout lated as 66 per group.
rate of 20%, a total of 43 pairs are required.
Kruskal-Wallis test (Table 3, no. 8)
One-sample Wilcoxon test (Table 3, no. 5) By one-way ANOVA, 66 people were calculated for each group,
A total of 43 was calculated by one-sample t test and adding and if 15% of each group is additionally considered, a total of 297
15% gives a total of 65. people are calculated.
One-sample proportion test (Table 3, no. 9) Chi-square test (Table 3, no. 11)
>library(pwr) >library(pwr)
>pwr.p.test (h = 0.5, sig.level = 0.05, power = 0.8, alternative = >pwt.chisq.test (w = 0.3, df = (2–1)*(3–1), sig.level = 0.05, power = 0.8)
“two.sided”) Effect size calculation:
�
� �𝑝𝑝𝑝 − 𝑝𝑝𝑝 � �
Assuming that the event rate of the control group was 0.2 and
that of the treatment group was 0.8, the allocation ratio of each
group was set at 1:1. If a two-sided test is performed with a signif-
www.e-jmis.org
16 Suyeon Park et al.
icance level of 0.05 and a power of 80%, 12 samples are calculated Assuming a p-value of 0.05 and a power of 80% in a two-tailed
for each group. Considering a dropout rate of 20%, 15 subjects are test, the minimum number of subjects required to demonstrate
required for each group, for a total of 30 subjects. statistical significance is 84 for an effect size of r = 0.3. Consider-
ing a dropout rate of 20%, 105 subjects are required.
McNemar test (Table 3, no. 13)
GENERALIZED LINEAR MODEL
>library(exact2x2)
>ss2x2 (p0 = .2, p1 = .8, n1.over.n0 = 1, sig.level = 0.05, power = .8, Generalized linear models [18] have been formulated as a way
approx = TRUE, print.steps = FALSE, pair = TRUE) to incorporate a variety of other statistical models, including lin-
ear regression, logistic regression, and Poisson regression. We will
use the ‘pwr’ package for linear regression and the ‘We’ package
for logistic/Poisson regression.
Practice 7
The pwr.f2.test() function (Supplementary data 1, Table 6)
can be used for multiple linear regression analysis. We will use
Cohen’s f2 as the effect size using the R 2 value used as a measure
of goodness of fit in regression analysis (Cohen’s f2 = R2/(1-R 2)).
Assuming that the event rate of the matched control group The ‘u’ is the number of predictors (or risk factors) considered in
was 0.2 and that of the matched case (or treatment) group was 0.8, the analysis, and the ‘v’ is n (the total number of subjects) – u – 1.
the allocation ratio of each group was set at 1:1. If a two-sided test That is, if you set only the value of u to the function, the value of
is performed with a significance level of 0.05 and a power of 80%, v is calculated and this value is used to calculate the total num-
13 samples are calculated for each group. Considering a dropout ber of subjects (n ≥ v + u + 1). Cohen suggests f2 values of 0.02,
rate of 20%, 16 subjects are required for each group, for a total of 0.15, and 0.35 represent small, medium, and large effect sizes. We
32 subjects. will use medium effect size = 0.15 and u = 3.
Practice 6
The pwr.r.test() function (Supplementary data 1, Table 5) can
be used in correlation analysis. The correlation coefficient (r) is Similarly, we assumed a p-value of 0.05 and a power of 80%.
used as a measure of effect size. Cohen suggests that r values of Considering the three risk factors (u = 3), if the effect size = 0.15,
0.1, 0.3, and 0.5 represent small, medium, and large effect sizes v = 73. Finally, a total of 77 (73 + 3 + 1) are calculated and consid-
respectively. We will use a medium effect size of 0.3. ering a dropout rate of 20, 96 people should be recruited.
Logistic regression [19] (Table 3, no. 16) expert advice for more complex studies, but we hope that this ar-
ticle will help researchers calculate the right number of subjects
>install.packages(“WebPower”, dependencies = TRUE) for their own research.
>library(WebPower)
>wp.logistic (p0 = 0.15, p1 = 0.1, alpha = 0.05, power = 0.8, family = NOTES
“normal”, parameter = c(0,1))
Authors’ contributions
Conceptualization: YHK, HIB, YP
Data curation: SP, YHK, YHK
Formal analysis: SP, HIB
If predictor (X) is a continuous variable, it can be used as fam- Investigation: SP, HIB
ily = “normal” and the ‘parameter’ is used as default. The way p0 Methodology: SP, YHK
and p1 are calculated can be calculated using the 1SD range of X. Project administration: YHK, Y.P
You can set p1 to the probability of being in range and p0 to the Visualization: HIB
probability of being out of range. In this example, p0 = 0.15 and Writing–Original Draft: SP, HIB
p1 = 0.1 were used. Similarly, we assumed a p-value of 0.05 and a Writing–Review & Editing: All authors
power of 80%. The minimum number of samples satisfying these
conditions is 299, and a total of 374 is required considering the Conflict of interest
dropout rate of 20%.
All authors have no conf licts of interest to declare.
Poisson regression [20] (Table 3, no 17)
Funding/support
>library(WebPower)
>w p.poisson (exp0 = 1,exp1 = 1.2, alpha = 0.05, power = 0.8, This work was supported by the Soonchunhyang University Re-
family = “Bernoulli”, parameter = 0.5) search Fund.
ORCID
Suyeon Park, https://orcid.org/0000-0002-6391-557X
Yeong-Haw Kim, https://orcid.org/0000-0002-8068-3678
If predictor (X) is a binary variable, it can be used as family = Hae In Bang, https://orcid.org/0000-0001-7854-3011
“bernoulli” and the ‘parameter’ will be used as its default value. Youngho Park, https://orcid.org/0000-0002-7096-3967
For exp0, a base rate of 1 under the null hypothesis was used, and
for exp1, expected relative risk = 1.2 was set as the relative incre- Supplementary materials
ment of the event rate. Similarly, we assumed a p-value of 0.05
and a power of 80%. The minimum number of samples satisfy- Supplementary data 1–3 can be found via https://doi.org/10.7602/
ing these conditions is 866, and a total of 1083 is required consid- jmis.2023.26.1.9.
ering the dropout rate of 20%.
REFERENCES
CONCLUSIONS
1. Altman DG. Statistics and ethics in medical research: III How large a
In conclusion, sample size calculation plays the most important sample? Br Med J 1980;281:1336-1338.
role in the research design process before starting the study. In 2. Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and
particular, since randomized controlled trial studies, which are their reporting in randomized controlled trials. JAMA 1994;272:122-
frequently conducted in clinical settings, are directly related to 124.
cost issues, the number of samples must be carefully calculated. 3. Foulkes M. Study designs, objectives, and hypotheses [Internet].
However, although there are various references related to sample Johns Hopkins Bloomberg School of Public Health; 2008 [cited 2023
size calculation, it can be difficult to correctly use a method Feb 20]. Available from: https://docplayer.net/38128249-Study-de-
suitable for your own study. Of course, it would be better to seek signs-objectives-and-hypotheses-mary-foulkes-phd-johns-hopkins-
www.e-jmis.org
18 Suyeon Park et al.