Writing Simple Functions in R Bootstrapping
Writing Simple Functions in R Bootstrapping
Example
Sep 2024
Table of contents
1 Aim and Scope 1
1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Target Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Sample Functions are for Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Defining Functions 2
2.1 Defining a Simple Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Defining a More Complicated Function . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Default Value for an Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Useful Statements in R 5
3.1 if and else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 if without else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 if … else if … else if … else . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.3 if and NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 for … in …. and while . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Examples 8
4.1 Nonparametric Bootstrapping Confidence Intervals . . . . . . . . . . . . . . . . . . . . 8
4.1.1 Pearson’s r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Optional Topics 13
5.1 Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Pass-By-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Dotdotdot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Final Remarks 16
7 Further References 16
1
2
Note that, although the focus is on writing simple functions, functions not covered in previous documents
may also be introduced if necessary.
1.4 Style
In this and other documents, I will use my own personal style. Feel free to use whatever style you like in
your own work. See the section on style in R as a Language Part 1.
2 Defining Functions
2.1 Defining a Simple Function
You have already learned about calling a function and specifying its arguments. Now we are going to
define a very simple function that
• receives two numbers;
• adds them together;
• returns the result.
This is an example:
Recall that a function is an object. Therefore, we will assign it to a name, my_addition by <-. A function
definition starts with, well, function.
After function, there must be a pair of parentheses, ( and ). Between them are the arguments of this
function. They are either named, or are unknown number of arguments denoted by ... (dotdotdot).
Dotdotdot is useful in some cases but will be introduced later when we need it. In the example above,
the arguments are x and y, in this order. Arguments are separated by commas.
Note that the order is important because if users do not name the arguments, the values will be assigned
based on this order.
After the parentheses, the body of a function must be enclosed between a pair of curly brackets, { and },
unless it is one simple expression after the parentheses.1 We write code as usual between the brackets
to define the body of a function.
In the example above, the body is x + y. This is what we will do to add two variables, x and y.
Try and see what will happen by calling this function and then learn how it works:
1
For example, my_addition <- function(x, y) x + y is acceptable.
my_addition(x = 1, y = 2)
my_addition(30, 5)
It should do what we expect. With just two arguments, we do not have to use names.
How about supplying two “names” of objects? Try this:
a <- 15
b <- 21
my_addition(a, b)
a <- 15
my_addition(a, 3 * 7)
Now you should have confirmed that it works. Let us see how it works.
When we call the function by my_addition(1, 2), 1 is assigned to x, and 2 is assigned to y. R will then
run the code in the body of my_addition() using these values.2
When the function finishes its operation normally, either because it finishes the last line, or it calls
return() (introduced later) to return something. If it finishes its last line, the output in this line will be
returned.
In my_addition(), the last line is x + y. Therefore, the output of x + y is automatically returned.
This function has only one argument, x. The function first finds the minimum using min(), and assign it
to x_min. It then finds the maximum using max(), and assign it to x_max. It then creates a named vector
from these two numbers. This is the last line and so the result of this line will be returned.
Let’s try it:
Good! We wrote our own function to find the range! This function has something new. New variables,
x_min and x_max, are created inside the body. This leads to the next topic …
2
The arguments actually will be evaluated only when they are used. This is called lazy evaluation. This is not covered here.
3
There is a base function, range(), for doing this. This example is for learning about writing functions by finding the range
ourselves.
What is the result? Will x_min and x_max be changed? Try this:
First, we find that my_range gives the correct result. x_min and x_max we created before calling my_range
do not affect its operation.
Second, x_min and x_max we created, interestingly, are not affected, even if variables with the same
names are used in the function (x_min <- min(x) and x_max <- max(x)).
This introduces the idea of local variables. x_min and x_max, created by <- inside the function, are local.
They are created inside the function and so are different from what exists “out there”4 This behavior is
useful because we do not need to worry about overwriting variables that exists in the environment calling
it.5 These local variables will disappear after the function ends.
First, we add an argument, my.na.rm. We set the default value of my.na.rm to FALSE. If my.na.rm is
provided, then the provided value will be used. If not, then my.na.rm = FALSE. In the calls to min and max,
we set the argument na.rm of them to the value of our argument, my.na.rm.
Let’s see how it works by trying this:
Now users can decide how to handle missing values, and this function also has a default way to handle
missing values if the users have no instruction on how to handle missing values.
4
Technically, in the parent frame.
5
A function can overwrite variables outside it, by using <<-. However, this should be avoided. Use this only if there are no other
solutions.
Setting the default values of arguments makes a function easier to use, if the default values are the
values users usually want.
3 Useful Statements in R
This section introduces a few statements useful for writing R functions. Note that all these statements
can also be used in R scripts, not just in a function.
Note that we set the default value of alpha to .05, the usual maximum level of significance.
Let’s test this function:
is_sig(pvalue = .04)
# We can omit the name
is_sig(.06)
# We set another level of significance
is_sig(.04, alpha = .01)
It should work. This function uses if ( ... ) { ... } else { ... }. After if is the condition en-
closed in a pair of parentheses. This condition should be a one-element logical vector, or an expression
that will result in a one-element logical vector. In is_sig, pvalue < alpha should result in TRUE or FALSE
(though NA is possible).
If the condition is TRUE, then the expression inside the next pair of curly brackets will be run. If FALSE,
then the expression in the pair of curly brackets after else will be run. (The case of NA will be covered
later.)
NOTE: Be care when writing a condition. The version we used above can result in an error if (pvalue <
alpha) does not return one single logical value. Try this:
This example also introduces a new function, return(). This is used to tell the function to end and return
the argument of return immediately. Because the if ... else structure already covers all possibilities,
and return() is used in all possibilities, the line print("This will never be printed"), although
being the last line, will never be run.
This introduces the idea of testing an argument. The level of significance should not be zero or less (p
< 0?) and should not be one or higher (p < 1?). Therefore, before checking the p value, we check the
alpha first. If either (alpha <= 0) or (alpha >= 1) is TRUE, then the line stop .... will be run. There is
no need for else because we only need to check whether a condition is met. If not, then we can proceed
as usual.
NOTE: || (and &&) is usually used in if condition.
This example also uses a new function, stop(). This function is commonly used in a function. It, obvi-
ously, “stops” a function. But it does not just stop the function. It will “raise” an error, and the argument
is the error message.
Let’s try this version, is_sig2:6
[1] "sig."
[1] "n.s."
[1] "sig."
Instead of stopping the function and raising an error, we can also return NA, that is, replace the call to
stop by return(NA). However, sometimes we may prefer raising an error in this case because NA can
also be interpreted as missing, for example, p-value is NA (although an error will actually occur if pvalue
is NA, for a reason described later).
Certainly, we can also apply a similar test to pvalue, which should range from 0 to 1. I will leave it as an
exercise for you.
6
The error messages may be printed outside the margin. I cannot yet solve this problem. formatR and tidy do not work as
some suggested.)
pstar(.06)
pstar(.04)
pstar(.009)
pstar(.00000001)
Note that, whenever a condition is met, the code inside the next curly brackets will be run, and all re-
maining conditions will not be checked. If p < .001, then p is also < .01 and < .05. Therefore, we need
to check p < .001 first.
Having many conditions can be difficult to read. If appropriate, we can consider using switch(). We can
also simply remove else and else if.
pstar2(.06)
pstar2(.04)
pstar2(.009)
pstar2(.00000001)
This version uses if only. If the condition of an if block is not met, then R will proceed to the line after
this block.
Which version to use depends on the context and personal preference. Using else or else if may
make the code look organized. However, sometimes it can be more difficult to read than just having a
sequence of if blocks, especially when we have a lot of lines inside an if block.
3.1.3 if and NA
Note that, when testing the condition, NA is neither FALSE nor TRUE. It will result in an error. Therefore,
the following call will result in an error:
is_sig2(pvalue = NA)
4 Examples
4.1 Nonparametric Bootstrapping Confidence Intervals
(This section assumes that you have learned about nonparametric bootstrapping, including its pros and
cons.)
R comes with a package boot that can do nonparametric bootstrapping. More and more packages can
form nonparametric bootstrapping confidence intervals (e.g., lavaan can do this for parameter estimates
in structural equation modeling, and psych::alpha() can do this for Cronbach’s alpha). Nevertheless,
there may be cases in which such a function has not yet been developed (or it has but you could not find
it). Even if there is such a function, it is still a good practice to learn writing a function to do this.
The boot() function in the boot package does not compute the statistic. It requires users to supply the
function to compute the statistic. Its job is to draw the bootstrap samples, compute the statistic, and
return them to the users.
4.1.1 Pearson’s r
Let us consider a practical scenario: forming a nonparametric bootstrapping confidence interval for a
Pearson’s r.
We already know that psych::cor.ci() can do this. Let’s try to do it using our own function.
Let’s try to compute the correlation first, using the dataset similar to the one used in the handout
SSGC_8802_Correlation_in_R, but with 100 cases:
library(readxl)
dat <- read_excel("correlation_example_100_cases.xlsx")
cor(dat)
as.vector(cor(dat))
as.vector(cor(dat))[c(2, 3, 6)]
my_r(data = dat)
my_r(data = dat)
I renamed dat to dat0, just to make it obvious that the correlations are computed on the sampled rows
of dat.
Let’s test the function again:
my_r(data = dat,
index = 1:5)
cor(dat[1:5, ])
library(boot)
set.seed(23456)
boot_r <- boot(data = dat,
statistic = my_r,
R = 2000)
set.seed() is used to set the seed for the random number generator, to make the results reproducible.
The output, boot_r, stores the results from the 2000 bootstrap samples. We can use plot to examine
the distribution of the 2000 bootstrap Pearson’s rs.
Note that our function my_f() returns three correlations for each sample. Therefore, we need to add
index to indicate the statistic we need. Let’s add index = 1 to plot the 2000 bootstrap correlations
between self-esteem and happiness:
plot(boot_r,
index = 1)
Histogram of t
5
0.6
4
Density
t*
0.4
2
1
0.2
0
plot(boot_r,
index = 2)
5 Histogram of t
0.2
4
0.0
Density
t*
2
−0.2
1
−0.4
0
plot(boot_r,
index = 3)
Histogram of t
5
0.5
4
0.3
Density
t*
2
0.1
1
−0.1
0
To get the bootstrap confidence interval, we can use boot::boot.ci(). In this example, we will only
use percentile bootstrap confidence interval. Therefore, we set type to "perc" (percentile). The level
of significance is 95%, or .95. Therefore, we set conf to .95. (See help("boot.ci") for further details.)
Note that we also need to add index in this case because we computed three correlations:
boot.ci(boot_r,
index = 1,
conf = .95,
type = "perc")
CALL :
boot.ci(boot.out = boot_r, conf = 0.95, type = "perc", index = 1)
Intervals :
Level Percentile
95% ( 0.3440, 0.6856 )
Calculations and Intervals on Original Scale
In this example, the nonparametric bootstrap percentile 95% confidence interval of the Pearson’s r be-
tween self-esteem and happiness is 0.3440 to 0.6856.
We can compare the results with psych::cor.ci():
library(psych)
logit
set.seed(23456)
cor_ci_out <- cor.ci(dat,
n.iter = 2000,
plot = FALSE)
print(cor_ci_out,
digits = 4)
5 Optional Topics
5.1 Style
Some align the closing curly bracket with first line of the definition:
I indent the closing brackets too, as in this document, simply because this is consistent with how we
indent lines in Python. I prefer a (personal) style that is similar across languages.
Some use four whitespace characters for indentation:
Using four whitespace characters is a common practice in programming. I use two whitespace characters
simply because I usually work on a small screen or window.
Some write one argument per line:
5.2 Pass-By-Value
R functions use pass-by-value in handling argument values. Therefore, a function normally will not
change the sources of its arguments, although it can return a modified version of its arguments.
For example:
[1] 100
x_origin
[1] 10
Even though we set x to x_origin and then x is changed inside the function, x_origin is not changed.
It is because it is the value of x_origin that is passed to x, not x_origin itself.
Certainly, we can update x_origin to the result of demo_pass_by_value(), but this is just an reassign-
ment, not a consequence of demo_pass_by_value():
x_origin <- 10
x_origin <- demo_pass_by_value(x = x_origin)
x_origin
[1] 100
5.3 Dotdotdot
The argument ... is sometimes used by one function to pass arguments to another function. You may
notice that boot() has this argument (see help("boot")). This section illustrates how ... can be used
to do bootstrapping.
In doing bootstrapping, the function used to compute the target statistic may have its own arguments.
boot() collects these arguments using ..., and passes them to the function assigned to statistic.
We can use this feature to revise my_r() such that we can form the bootstrapping confidence interval of
for any two variables we want:
my_r_any2(data = dat,
index = 1:50,
x = "SelfEsteem",
y = "Happiness")
[1] 0.4033987
SelfEsteem Happiness
SelfEsteem 1.0000000 0.4033987
Happiness 0.4033987 1.0000000
We can try this version again. No need for index in boot.ci() because we compute only one correla-
tions:
set.seed(23456)
boot_r <- boot(data = dat,
statistic = my_r,
R = 2000)
boot.ci(boot_r,
index = 1,
type = "perc")
CALL :
boot.ci(boot.out = boot_r, type = "perc", index = 1)
Intervals :
Level Percentile
95% ( 0.3440, 0.6856 )
Calculations and Intervals on Original Scale
set.seed(23456)
boot_r2 <- boot(data = dat,
statistic = my_r_any2,
R = 2000,
x = "SelfEsteem",
y = "Happiness")
boot.ci(boot_r2,
type = "perc")
CALL :
boot.ci(boot.out = boot_r2, type = "perc")
Intervals :
Level Percentile
95% ( 0.3440, 0.6856 )
Calculations and Intervals on Original Scale
You can see that the two confidence intervals are identical.7
In your own research, whether you will use this technique depends on how flexible you want the function
to be.
• If you are pretty sure that you only need bootstrap CI for a very specific scenario (e.g., only the
statistic for a set of variables computed in a specific way), then no need to use these additional
arguments.
• However, if you think you may need to adjust the analysis, such as trying other variables or options
for the analysis (e.g., using another measure of correlation), then you may want to write a more
general function.
6 Final Remarks
There are a lot of issues about functions not covered here. I myself also still have a lot to learn. The goal
of this document is not to make you a programmer (I am also not a programmer). The goal is to help you
to know how writing function can help us to do analysis in our research. We have learned how writing
functions can make it easier to do several tasks again and again for different models or variables. We
have also learned how we can write a function to compute something that we need. This will be useful
if you learn about some new statistic or measure that you want to use but is not yet available in existing
function.
You will definitely encounter some problems when you try to write your own functions. Learn what you
need when using R in your research. Certainly, if you have some experience in programming, or maybe
you are already an experienced programmer, you can consider reading some books on programming in
R to learn more about the technical details in R.
7 Further References
In the book by Fox and Weisberg (2019) on doing regression analysis using R, they also have a chapter
on programming in R (Chapter 10), aimed for researchers. You can see if this chapter is suitable for you:
• Fox, J. &, Weisberg, S.. (2019) An R companion to applied regression (3rd Ed.). Sage.
URL https://socialsciences.mcmaster.ca/jfox/Books/Companion/index.html. (UM library has
7
I call set.seed() before each call to boot(), and use the same seed. We usually do not do this. However, these two versions
are fitted to the same dataset. Therefore, to make the results comparable, we will want these two bootstrapping analysis to have
the same bootstrap samples. This can be done by using the same number in set.seed() right before calling boot().