Applied Epidemiology Using R PDF
Applied Epidemiology Using R PDF
Aragón
I wrote this book to introduce R—a language and environment for statistical
computing and graphics—to epidemiologists and health data analysts con-
ducting epidemiologic studies. From my experience in public health practice,
sometimes even formally trained epidemiologists lack the breadth of analytic
skills required at health departments where resources are very limited. Re-
cent graduates come prepared with a solid foundation in epidemiological and
statistical concepts and principles and they are ready to run a multivariable
analysis (which is not a bad thing we are grateful for highly trained staff).
However, what is sometimes lacking is the practical knowledge, skills, and
abilities to collect and process data from multiple sources (e.g., Census data;
reportable diseases, death and birth registries) and to adequately implement
new methods they did not learn in school. One approach to implementing
new methods is to look for the “commands” among their favorite statistical
packages (or to buy a new software program). If the commands do not exist,
then the method may not be implemented. In a sense, they are looking for a
custom-made solution that makes their work quick and easy.
In contrast to custom-made tools or software packages, R is a suite of basic
tools for statistical programming, analysis, and graphics. One will not find a
“command” for a large number of analytic procedures one may want to exe-
cute. Instead, R is more like a set of high quality carpentry tools (hammer,
saw, nails, and measuring tape) for tackling an infinite number of analytic
problems, including those for which custom-made tools are not readily avail-
able or affordable. I like to think of R as a set of extensible tools to implement
one’s analysis plan, regardless of simplicity or complexity. With practice, one
not only learns to apply new methods, but one also develops a depth of un-
derstanding that sharpens one’s intuition and insight. With understanding
comes clarity, focused problem-solving, and confidence.
This book is divided into two parts. First, I cover how to process, manip-
ulate, and operate on data in R. Most books cover this material briefly or
leave it for an appendix. I decided to dedicate a significant amount of space
to this topic with the assumption that the average epidemiologist is not fa-
vii
viii Preface
miliar with R and a good grounding in the basics will make the later chapters
more understandable. Second, I cover basic epidemiology topics addressed in
most books but we infuse R to demonstrate concepts and to exercise your
intuition. Readers may notice a heavier emphasis on descriptive epidemiol-
ogy which is what is more commonly used at health departments, at least
as a first step. In this section we do cover regression methods and graphical
displays. I have also included “how to” chapters on a diversity of topics that
come up in public health. My goal is not to be comprehensive in each topic
but to demonstrate how R can be used to implement a diversity of methods
relevant to public health epidemiology and evidence-based practice.
To help us spread the word, this book is available on the World Wide Web
(http://www.medepi.com). I do not want financial or geographical barriers
to limit access to this material. I am only presenting what I have learned from
the generosity of others. My hope is that more and more epidemiologists will
embrace R for epidemiological applications, or at least, include it in their
toolbox.
ix
Applied Epidemiology Using R 25-Nov-2012 c Tomás J. Aragón (www.medepi.com)
Contents
xi
xii Contents
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
1.1 What is R?
• Full-function calculator
• Extensible statistical package
• High-quality graphics tool
• Multi-use programming language
3
4 1 Getting Started With R
Some may find R challenging to learn if they are not familiar with statisti-
cal programming. R was created by statistical programmers and is more often
used by analysts comfortable with matrix algebra and programming. How-
ever, even for those unfamiliar with matrix algebra, there are many analyses
one can accomplish in R without using any advanced mathematics, which
would be difficult in other programs. The ability to easily manipulate data in
R will allow one to conduct good descriptive epidemiology, life table methods,
graphical displays, and exploration of epidemiologic concepts. R allows one
to work with data in any way they come.
1 Read my recommendations for mostly free and open source software (FOSS) at medepi.
com.
R is available for many computer platforms, including Mac OS, Linux, Mi-
crosoft Windows, and others. R comes as source code or a binary file. Source
code needs to be compiled into an executable program for our computer.
Those not familiar with compiling source code (and that’s most of us) just
install the binary program. We assume most readers will be using R in the
Mac OS or MS Windows environment. Listed here are useful R links:
Use R by typing commands at the R command line prompt (>) and pressing
Enter on our keyboard. This is how to use R interactively. Alternatively, from
the R command line, we can also execute a list of R commands that we have
saved in a text file (more on this later). Here is an example of using R as a
calculator:
> 8*4
[1] 32
> (4 + 6)^3
[1] 1000
Use the scan function to enter data at the command line. At the end of each
line press the Enter key to move to the next line. Pressing the Enter key
twice completes the data entry.
> quantity <- scan()
1: 34 56 22
4:
Read 3 items
> price <- scan()
1: 19.95 14.95 10.99
4:
Read 3 items
> total <- quantity*price
> cbind(quantity, price, total)
quantity price total
[1,] 34 19.95 678.30
[2,] 56 14.95 837.20
[3,] 22 10.99 241.78
The default installation of R does not have packages that specifically imple-
ment epidemiologic applications; however, many of the statistical tools that
epidemiologists use are readily available, including statistical models such
as unconditional logistic regression, conditional logistic regression, Poisson
regression, Cox proportional hazards regression, and much more.
To meet the specific needs of public health epidemiologists and health
data analysts, I maintain a freely available suite of Epidemiology Tools: the
epitools R package can be directly installed from within R.
For example, to evaluate the association of consuming jello with a diarrheal
illness after attending a church supper in 1940, we can use the epitab function
from the epitools package. In the R code that follows, the # symbol precedes
comments that are not evaluated by R.
The best way to learn R is to use it! Use it as our calculator! Use it as
our spreadsheet! Finally read these notes sitting at a computer and use R
interactively (this works best sitting in a cafe that brews great coffee and
plays good music). In this book, when we display R code it appears as if we
are typing the code directly at the R command line:
> x <- matrix(1:9, nrow = 3, ncol = 3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Sometimes the R code appears as it would in a text editor (e.g., Notepad)
before it has been evaluated by R. When a text editor version of R code is
displayed it appears without the command line prompt and without output:
x <- matrix(1:9, nrow = 3, ncol = 3)
x
When the R code displayed exceeds the page width it will continue on the
next line but indented. Here’s an example:
agegrps <- c("Age < 1", "Age 1 to 4", "Age 5 to 14", "Age 15
to 24", "Age 25 to 44", "Age 45 to 64", "Age 64+")
Although we encourage initially to use R interactively by typing expres-
sions at the command line, as a general rule, it is much better to type our
code into a text editor. We save our code with a convenient file name such as
job01.R2 . For convenience, R comes with its own text editor. For Windows,
2 The .R extension, although not necessary, is useful when searching for R command files.
under File, select New script to open an empty text file. For Mac OS, under
File, select New Document. As before, save this text file with a .R extension.
Within R’s text editor, we can highlight and run selected commands.
The code in our text editor can be run in the following ways:
• Highlight and run selected command in the R editor;
• Paste the code directly into R at the command line;
• Run the file in batch mode from the R command line using the source("job01.R").
Open R and start using it as our calculator. The most common math opera-
tors are displayed in Table 1.1. From now on make R our default calculator!
Study the examples and spend a few minutes experimenting with R as a
calculator. Use parentheses as needed to group operations. Use the keyboard
Up-arrow to recall what we previously entered at the command line prompt.
can be assigned a name and recalled at a later time. We refer to these variables
as data objects. We use the assignment operator (<-) to name an evaluable
expression and save it as a data object.
> xx <- "hello, what’s your name"
> xx
[1] "hello, what’s your name"
Wrapping the assignment expression in parentheses makes the assignment
and displays the data object value(s).
> yy <- 5^3 #assignment; no display
> (yy <- 5^3) #assignment; displays evaluation
[1] 125
In this book, we might use (yy <- 5^3) to display the value of yy and save
space on the page. In practice, this is more common:
> yy <- 5^3
> yy
[1] 125
Multiple assignments work and are read from right to left:
> aa <- bb <- 99
> aa; bb
[1] 99
[1] 99
Finally, the equal sign (=) can be used for assignments, although I prefer and
the <- symbol:
> ages = c(34, 45, 67)
> ages
[1] 34 45 67
The reason I prefer <- for assigning object names in the workspace is be-
cause later we use = for assigning values to function arguments. For example,
> x <- 20:25 #object name assignment
> x
[1] 20 21 22 23 24 25
> sample(x = 1:10, size = 5) #argument assignments
[1] 9 6 3 2 5
> x
[1] 20 21 22 23 24 25
The first x is an object name assignment in the workspace which persist
during the R session. The second x is a function argument assignment which
is only recognized locally in the function and only for the duration of the
function execution. For clarity, it is better to keep these types of assignments
separate in our mind by using different assignment symbols.
Study these previous examples and spend a few minutes using the assign-
ment operator to create and call data objects. Try to use descriptive names
if possible. For example, suppose we have data on age categories; we might
name the data agecat, age.cat, or age cat3 . These are all acceptable.
When we start R you have opened a workspace. The first time we use R,
the workspace is empty. Every time we create a data object, it is in the
workspace. If a data object with the same name already exists, the old data
object will be overwritten withou warning, so be careful. To list the objects
in your workspace use the ls or objects functions:
R has many available functions. When we open R, several packages are at-
tached by default. Each package has its own suite of functions. To display
the list of attached packages use the search function.
> search() # Linux
[1] ".GlobalEnv" "package:stats" "package:graphics"
3 In older versions of R, the underscore symbol ( ) could be used for assignments, but
this is no longer permitted. The “ ” symbol can be used to make your object name more
readable and is consistent with other software.
4 In some operating systems files names that begin with a period (.) are hidden files and
are not displayed by default. You may need to change the viewing option to see the file.
R has extensive help capabilities. From the main menu select Help to get
you started (Figure 1.1 on the next page). The Frequently Asked Questions
(FAQ) and R manuals are available from this menu. The R functions (text). . . ,
Html help, Search help. . . , and Apropos. . . selections are also available from
the command line.
From the command line, we have several options. Suppose you are inter-
ested in learning abouting help capabilities.
> ?help #opens help page for ’help’ function
> help() #opens help page for ’help’ function
> help(help) #opens help page for ’help’ function
> help.start() #starts HTML help page
> help.search("help") #searches help system for "help"
> apropos("help") #displays ’help’ objects in search list
> apropos("help")
[1] "help" "help.request" "help.search" "help.start"
To learn about about available data sets use the data function:
> data() #displays available data sets
> try(data(package = "survival")) #lists survival pkg data sets
> help(pbc, package = "survival") #displays pbc data help page
Figure 1.1 on the facing page shows that a R Graphical User Interface (GUI)
main menu will have a Help selection.
Fig. 1.1 Select Help from the main menu in the MS Windows R GUI.
Maybe.6 A good text editor will make your programming and data processing
easier and more efficient. A text editor is a program for, we guessed it, editing
text! The functionality we look for in a text editor are the following:
1. Toggle between wrapped and unwrapped text
2. Block cutting and pasting (also called column editing)
3. Easy macro programming
4. Search and replace using regular expressions
5. Ability to import large datasets for editing
When we are programming we want our text to wrap so we can read all
of your code. When we import a data set that is wider than the screen, we
do not want the data set to wrap: we want it to appear in its tabular format.
Column editing allows us to cut and paste columns of text at will. A macro is
5 http://www.rstudio.org/
6 If your only goal is to learn R, then RStudio is more than sufficient.
Fig. 1.2 RStudio—An integrated development environment for R that runs in Linux, Mac
OS, or MS Windows. In this Figure, RStudio is running in MS Windows.
just a way for the text editor to learn a set of keystrokes (including search and
replace) that can be executed as needed. Searching using regular expressions
means searching for text based on relative attributes. For example, suppose
you want to find all words that begin with “b,” end with “g,” have any number
of letters in between but not “r” and “f.” Regular expression searching makes
this a trivial task. These are powerful features that once we use regularly, we
will wonder how we ever got along without them.
If we do not want to install a text editing program then we can use the
default text editor that comes with our computer operating system (gedit
in Ubuntu Linux, TextEdit in Mac OS, Notepad in Windows). However, it
is much better to install a text editor that works with R. My favorite text
editor is the free and open source GNU Emacs. GNU Emacs can be extended
with the “Emacs Speaks Statistics” (ESS) package. For more information on
Emacs and ESS pre-installed for Windows, visit http://ess.r-project.
org/. For the Mac OS, I recommend GNU Emacs for Mac OS7 (Figure 1.3
on the facing page) or Aquamacs.8
7 http://emacsformacosx.com/
8 http://aquamacs.org/
To the novice user, R may seem complicated and difficult to learn. In fact, for
its immense power and versatility, R is easier to learn and deploy compared to
other statistical software (e.g. SAS, Stata, SPSS). This is because R was built
from the ground up to be an efficient and intuitive programming language and
environment. If one understands the logic and structure of R, then learning
proceeds quickly. Just like a spoken language, once we know its rules of
grammar, syntax, and pronunciation, and can write legible sentences, we can
figure out how to communicate almost anything. Before the next chapter,
we want to describe the “forest”: the logic and structure of working with R
objects and epidemiologic data.
For our purposes, there are only five types of data objects in R9 and five types
of actions we take on these objects (Table 1.4 on the next page). That’s it! No
more, no less. You will learn to create, name, index (subset), replace compo-
nents of, and operate on these data objects using a systematic, comprehensive
9 The sixth type of R object is a function. Functions can create, manipulate, operate on, and
store data; however, we will use functions primarily to execute a series of R “commands”
and not as primary data objects.
Table 1.4 Types of actions taken on R data objects and where to find examples
Action Vector Matrix Array List Data Frame
Creating Table 2.4 Table 2.11 Table 2.18 Table 2.24 Table 2.30
(p. 34) (p. 51) (p. 66) (p. 80) (p. 89)
Naming Table 2.5 Table 2.12 Table 2.19 Table 2.25 Table 2.31
(p. 38) (p. 55) (p. 69) (p. 81) (p. 90)
Indexing Table 2.6 Table 2.13 Table 2.20 Table 2.26 Table 2.32
(p. 39) (p. 57) (p. 70) (p. 81) (p. 91)
Replacing Table 2.7 Table 2.14 Table 2.21 Table 2.27 Table 2.33
(p. 41) (p. 58) (p. 70) (p. 84) (p. 94)
Operating on Table 2.8 Table 2.15 Table 2.22 Table 2.28 Table 2.34
(p. 42) (p. 59) (p. 71) (p. 84) (p. 95)
Table 2.9
(p. 46)
approach. As you learn about each new data object type, it will reinforce and
extend what you learned previously.
A vector is a collection of elements (usually numbers):
> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
A matrix is a 2-dimensional representaton of a vector:
> y <- matrix(x, nrow = 2)
> y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 3 5 7 9 11
[2,] 2 4 6 8 10 12
An array is a 3 or more dimensional represention of a vector:
> z <- array(x, dim = c(2, 3, 2))
> z
, , 1
, , 2
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 3 5 7 9 11
[2,] 2 4 6 8 10 12
[[3]]
, , 1
, , 2
Problems
1.1. If you have not done so already, install R on your personal computer.
What is the R workspace file on your operating systems? What is the file
path to your R workspace file? What is the name of this workspace file?
1.2. By default, which R packages come already loaded? What are the file
paths to the default R packages?
1.3. List all the object in the current workspace. If there are none, create
some data objects. Using one expression, remove all the objects in the current
workspace.
1.4. One inch equals 2.54 centimeters. Correct the following R code and
create a conversion table.
inches <- 1:12
centimeters <- inches/2.54
cbind(inches, centimeters)
1.6. For the Celsius temperatures 0, 5, 10, 15, 20, 25, ..., 80, 85, 90, 95, 100,
construct a conversion table that displays the corresponding Fahrenheit tem-
peratures. Hint: to create the sequence of Celsius temperatures use seq(0,
100, 5).
1.7. BMI is a reliable indicator of total body fat, which is related to the risk
of disease and death. The score is valid for both men and women but it does
have some limits. BMI does have some limitations: it may overestimate body
fat in athletes and others who have a muscular build, it may underestimate
body fat in older persons and others who have lost muscle mass.
Body Mass Index (BMI) is calculated from your weight in kilograms and
height in meters:
kg
BM I =
m2
1 kg ≈ 2.2 lb
1 m ≈ 3.3 ft
Calculate your BMI (don’t report it to us).
1.8. Using Table 1.1 on page 8, explain in words, and use R to illustrate, the
difference between modulus and integer divide.
y = logb (x)
is equivalent to
x = by
In R, the log function is to the base e. Implement the following R code
and study the graph:
curve(log(x), 0, 6)
abline(v = c(1, exp(1)), h = c(0, 1), lty = 2)
What kind of generalizations can you make about the natural logarithm and
its base—the number e?
1.10. Risk (R) is a probability bounded between 0 and 1. Odds is the follow-
ing transformation of R:
R
Odds =
1−R
Use the following code to plot the odds:
curve(x/(1-x), 0, 1)
Now, use the following code to plot the log(odds):
curve(log(x/(1-x)), 0, 1)
What kind of generalizations can you make about the log(odds) as a trans-
formation of risk?
1.11. Use the data in Table 1.6 on the next page. Assume one is HIV-negative.
If the probability of infection per act is p, then the probability of not getting
infected per act is (1 − p). The probability of not getting infected after 2
Table 1.6 Estimated per-act risk (transmission probability) for acquisition of HIV, by
exposure route to an infected source. Source: CDC [1]
consecutive acts is (1− p)2 , and after 3 consecutive acts is (1− p)3 . Therefore,
the probability of not getting infected infected after n consecutive acts is
(1 − p)n , and the probability of getting infected after n consecutive acts is
1 − (1 − p)n . For each non-blood transfusion transmission probability (per act
risk) in Table 1.6, calculate the cumulative risk of being infected after one
year (365 days) if one carries out the same act once daily for one year with an
HIV-infected partner. Do these cumulative risks make intuitive sense? Why
or why not?
1.12. The source function in R is used to “source” (read in) ASCII text files.
Take a group of R commands that worked from a previous problem above
and paste them into an ASCII text file and save it with the name job01.R.
Then from R command line, source the file. Here is how it looked on my
Linux computer running R:
> source("/home/tja/Documents/courses/ph251d/jobs/job01.R")
Describe what happened. Now, set echo option to TRUE.
> source("/home/tja/Documents/courses/ph251d/jobs/job01.R", echo = TRUE)
Describe what happened. To improve your understanding read the help file
on the source function.
1.13. Now run the source again (without and with echo = TRUE) but each
time create a log file using the sink function. Create two log files: job01.log1a
and job01.log1b.
> sink("/home/tja/Documents/courses/ph251d/jobs/job01.log1a")
> source("/home/tja/Documents/courses/ph251d/jobs/job01.R")
> sink() #closes connection
>
> sink("/home/tja/Documents/courses/ph251d/jobs/job01.log1b")
> source("/home/tja/Documents/courses/ph251d/jobs/job01.R", echo = TRUE)
> sink() #closes connection
1.14. Create a new job file (job02.R) with the following code:
n <- 365
per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000
risks <- 1-(1-per.act.risk)^n
show(risks)
Source this file at the R command line and describes what happens.
25
26 2 Working with R data objects
[,1] [,2]
[1,] 1 3
[2,] 2 4
1 In other programming languages, vectors are either row vectors or column vectors. R
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
If we try to include elements of different modes in an atomic data object, R
will coerce the data object into a single mode based on the following hierarchy:
character > numeric > logical. In other words, if an atomic data object
contains any character element, all the elements are coerced to character.
> c("hello", 4.56, FALSE)
[1] "hello" "4.56" "FALSE"
> c(4.56, FALSE)
[1] 4.56 0.00
A recursive data object can contain one or more data objects where each
object can be of any mode. Lists, data frames, and functions are recursive
data objects. Lists and data frames are of mode list, and functions are of
mode function (Table 2.1 on the next page).
A list is a collection of data objects without any restrictions:
> x <- c(1, 2, 3)
> y <- c("Male", "Female", "Male")
> z <- matrix(1:4, 2, 2)
> mylist <- list(x, y, z)
> mylist
[[1]]
[1] 1 2 3
[[2]]
[1] "Male" "Female" "Male"
[[3]]
[,1] [,2]
[1,] 1 3
[2,] 2 4
A data frame is a list with a 2-dimensional (tabular) structure. Epidemiol-
ogists are very experienced working with data frames where each row usually
represents data collected on individual subjects (also called records or ob-
servations) and columns represent fields for each type of data collected (also
called variables).
> subjno <- c(1, 2, 3, 4)
> age <- c(34, 56, 45, 23)
> sex <- c("Male", "Male", "Female", "Male")
Summarized in Table 2.1 are the key attributes of atomic and recursive data
objects. Data objects can also have class attributes. Class attributes are just
a way of letting R know that an object is “special,” allowing R to use specific
methods designed for that class of objects (e.g., print, plot, and summary
methods). The class function displays the class if it exists. For our purposes,
we do not need to know any more about classes.
Frequently, we need to assess the structure of data objects. We already
know that all data objects have a mode and length attribute. For example,
let’s explore the infert data set that comes with R. The infert data comes
from a matched case-control study evaluating the occurrence of female infer-
tility after spontaneous and induced abortion.
At this point we know that the data object named “infert” is a list of length
8. To get more detailed information about the structure of infert use the
str function (str comes from “str”ucture).
> str(infert)
‘data.frame’: 248 obs. of 8 variables:
$ education : Factor w/ 3 levels "0-5yrs",..: 1 1 1 1 ...
$ age : num 26 42 39 34 35 36 23 32 21 28 ...
$ parity : num 6 1 6 4 3 4 1 2 1 2 ...
$ induced : num 1 1 2 2 1 2 0 0 0 0 ...
$ case : num 1 1 1 1 1 1 1 1 1 1 ...
$ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...
$ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
Great! This is better. We now know that infert is a data frame with 248
observations and 8 variables. The variable names and data types are dis-
played along with their first few values. In this case, we now have sufficient
information to start manipulating and analyzing the infert data set.
Additionally, we can extract more detailed structural information that
becomes useful when we want to extract data from an object for further
manipulation or analysis (Table 2.2 on the next page). We will see extensive
use of this when we start programming in R.
To get practice calling data from the command line, enter data() to dis-
play the available data sets in R. Then enter data(data set ) to load a
dataset. Study the examples in Table 2.2 on the following page and spend
a few minutes exploring the structure of the data sets we have loaded. To
display detailed information about a specific data set use ?data set at the
command prompt (e.g., ?infert).
A vector is a collection of like elements (i.e., the elements all have the same
mode). There are many ways to create vectors (see Table 8). The most com-
mon way of creating a vector is using the concatenation function c:
> #numeric
> chol <- c(136, 219, 176, 214, 164)
> chol
[1] 136 219 176 214 164
> #character
> fname <- c("Mateo", "Mark", "Luke", "Juan", "Jaime")
> fname
A single digit is also a vector; that is, a vector of length = 1. Let’s confirm
this.
> 5
[1] 5
> is.vector(5)
[1] TRUE
In R, we use relational and logical operators (Table 2.3 on page 33) to conduct
Boolean queries. Boolean operations is a methodological workhorse of data
analysis. For example, suppose we have a vector of female movie stars and a
corresponding vector of their ages (as of January 16, 2004), and we want to
select a subset of actors based on age criteria.
> movie.stars
[1] "Rebecca De Mornay" "Elisabeth Shue" "Amanda Peet"
[4] "Jennifer Lopez" "Winona Ryder" "Catherine Zeta Jones"
[7] "Reese Witherspoon"
> ms.ages
[1] 42 40 32 33 32 34 27
Let’s select the actors who are in their 30s. This is done using logical
vectors that are created by using relational operators (<, >, <=, >=, ==, !=).
Study the following example:
To summarize:
xx; yy
: Generate integer 1:10
sequence 10:(-4)
seq Generate sequence of seq(1, 5, by = 0.5)
numbers seq(1, 5, length = 3)
zz <- c("a", "b", "c")
seq(along = zz)
rep Replicate argument rep("Juan Nieve", 3)
rep(1:3, 4)
rep(1:3, 3:1)
which Integer vector from age <- c(8, NA, 7, 4)
Boolean operation which(age<5 | age>=8)
paste Paste elements paste(c("A", "B", "C"), 1:3)
creating a character paste(c("A", "B", "C"), 1:3, sep="")
string
[row#, ] Indexing a matrix xx <- matrix(1:8, nrow = 2, ncol = 4)
or returns a vector xx[2,]
[ ,col#] xx[,3]
sample Sample from a vector sample(c("H","T"), 20, replace = TRUE)
runif Generate random rnorm(10, mean = 50, sd = 19)
rnorm numbers from a runif(n = 10, min = 0, max = 1)
rbinom probability rbinom(n = 10, size = 20, p = 0.5)
rpois distribution rpois(n = 10, lambda = 15)
as.vector Coerce data objects mx <- matrix(1:4, nrow = 2, ncol = 2)
into a vector mx
as.vector(mx)
vector Create vector of vector("character", 5)
specified mode and vector("numeric", 5)
length vector("logical", 5)
character Create vector of character(5)
numeric specified type numeric(5)
logical logical(5)
−(x − µ)2
1
f (x) = √ exp
2πσ 2 2σ 2
0.4
Density
0.2
0.2
fx
0.0
0.0
−4 0 2 4 −3 −1 1 3
x rnorm(500)
for every value of x we calculated f (x), represented by the numeric vector fx.
We then used the plot function to plot x vs. f (x). The optional argument
type="l" produces a “line” and lwd=2 doubles the line width. For compari-
son, we also plotted a density histogram of 500 standard normal variates that
were simulated using the rnorm function (Figure 2.1 on the preceding page).3
The rep function is used to replicate its arguments. Study the examples
that follow:
> rep(5, 2) #repeat 5 2 times
[1] 5 5
> rep(1:2, 5) # repeat 1:2 5 times
[1] 1 2 1 2 1 2 1 2 1 2
> rep(1:2, c(5, 5)) # repeat 1 5 times; repeat 2 5 times
[1] 1 1 1 1 1 2 2 2 2 2
> rep(1:2, rep(5, 2)) # equivalent to previous
[1] 1 1 1 1 1 2 2 2 2 2
> rep(1:5, 5:1) # repeat 1 5 times, repeat 2 4 times, etc
[1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
The paste function pastes character strings:
> fname <- c("John", "Kenneth", "Sander")
> lname <- c("Snow", "Rothman", "Greenland")
> paste(fname, lname)
[1] "John Snow" "Kenneth Rothman" "Sander Greenland"
> paste("var", 1:7, sep="")
[1] "var1" "var2" "var3" "var4" "var5" "var6" "var7"
Indexing (subsetting) an object often results in a vector. To preserve the
dimensionality of the original object use the drop option.
> x <- matrix(1:8, 2, 4)
> x
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
> x[2,] #index 2nd row
[1] 2 4 6 8
> x[2, , drop = FALSE] #index 2nd row; keep object structure
[,1] [,2] [,3] [,4]
[1,] 2 4 6 8
Up to now we have generated vectors of known numbers or character
strings. On occasion we need to generate random numbers or draw a sample
from a collection of elements. First, sampling from a vector returns a vector:
The first way of naming vector elements is when the vector is created:
> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54)
> x
chol sbp dbp age
234 148 78 54
The second way is to create a character vector of names and then assign that
vector to the numeric vector using the names function:
> z <- c(234, 148, 78, 54)
> z
[1] 234 148 78 54
> names(z) <- c("chol", "sbp", "dbp", "age")
> z
sbp age
148 54
> x[-c(2, 4)] #exclude 2nd and 4th element
chol dbp
234 78
A logical vector indexes the positions that corresponds to the TRUEs. Here
is an example:
> x<=100 | x>200
chol sbp dbp age
TRUE FALSE TRUE TRUE
> x[x<=100 | x>200]
chol dbp age
234 78 54
Any expression that evaluates to a valid vector of integers, names, or
logicals can be used to index a vector.
> (samp1 <- sample(1:4, 8, replace = TRUE))
[1] 1 3 3 3 1 3 4 1
> x[samp1]
chol dbp dbp dbp chol dbp age chol
234 78 78 78 234 78 54 234
> (samp2 <- sample(names(x), 8, replace = TRUE))
[1] "dbp" "sbp" "sbp" "dbp" "dbp" "age" "dbp" "sbp"
> x[samp2]
dbp sbp sbp dbp dbp age dbp sbp
78 148 148 78 78 54 78 148
Notice that when we indexed by position or name we indexed the same po-
sition repeatly. This will not work with logical vectors. In the example that
follows NA means “not available.”
> (samp3 <- sample(c(TRUE, FALSE), 8, replace = TRUE))
[1] FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
> x[samp3]
dbp <NA> <NA> <NA> <NA>
78 NA NA NA NA
We have already seen that a vector can be indexed based on the charac-
teristics of another vector.
> kid <- c("Tomasito", "Irene", "Luisito", "Angelita", "Tomas")
> age <- c(8, NA, 7, 4, NA)
> age<=7 # produces logical vector
[1] FALSE NA TRUE TRUE NA
> kid[age<=7] # index ’kid’ using ’age’
[1] NA "Luisito" "Angelita" NA
> kid[!is.na(age)] # remove missing values
[1] "Tomasito" "Luisito" "Angelita"
> kid[age<=7 & !is.na(age)]
[1] "Luisito" "Angelita"
A Boolean operation that returns a logical vector contains TRUE values where
the condition is true. To identify the position of each TRUE value we use the
which function. For example, using the same data above:
> which(age<=7) # which positions meet condition
[1] 3 4
> kid[which(age<=7)]
[1] "Luisito" "Angelita"
Notice that is was unnecessary to remove the missing values.
First, we made a copy of the numeric vector age and named it agecat. Then,
we replaced elements of agecat with character strings for each age category,
creating a character vector.
For practice study the examples in Table 2.7 on the preceding page and
spend a few minutes replacing vector elements.
First, we focus on operating on single numeric vectors (Table 2.8). This also
gives us the opportunity to see how common mathematical notation is trans-
lated into simple R code. Pn
To sum elements of a numeric vector x of length n, ( i=1 xi ), use the sum
function:
> # generate and sum 100 random standard normal values
> x <- rnorm(100)
> sum(x)
[1] -0.34744
Pk
To calculate a cumulative sum of a numeric vector x of length n, ( i=1 xi ,
for k = 1, . . . , n), use the cumsum function which returns a vector:
# generate sequence of 2’s and calculate cumulative sum
> x <- rep(2, 10)
> x
[1] 2 2 2 2 2 2 2 2 2 2
> cumsum(x)
[1] 2 4 6 8 10 12 14 16 18 20
Qn
To multiply elements of a numeric vector x of length n, ( i=1 xi ), use the
prod function:
> x <- c(1, 2, 3, 4, 5, 6, 7, 8)
> prod(x)
[1] 40320
QTo
k
calculate the cumulative product of a numeric vector x of length n,
( i=1 xi , for k = 1, . . . , n), use the cumprod function:
> sqrt(var(x))
[1] 0.9525654
> sd(x)
[1] 0.9525654
To sort a numeric or character vector use the sort function.
> ages <- c(8, 4, 7)
> sort(ages)
[1] 4 7 8
However, to sort one vector based on the ordering of another vector use the
order function.
> ages <- c(8, 4, 7)
> subjects <- c("Tomas", "Angela", "Luis")
> subjects[order(ages)]
[1] "Angela" "Luis" "Toms"
> # ’order’ returns positional integers for sorting
> order(ages)
[1] 2 3 1
Notice that the order function does not return the data, but rather indexing
integers in new positions for sorting the vector age or another vector. For
example, order(ages) returned the integer vector c(2, 3, 1) which means
“move the 2nd element (age = 4) to the first position, move the 3rd element
(age = 7) to the second position, and move the 1st element (age = 8) to
the third position.” Verify that sort(ages) and ages[order(ages)] are
equivalent.
To sort a vector in reverse order combine the rev and sort functions.
> x <- c(12, 3, 14, 3, 5, 1)
> sort(x)
[1] 1 3 3 5 12 14
> rev(sort(x))
[1] 14 12 5 3 3 1
In contrast to the sort function, the rank function gives each element of a
vector a rank score but does not sort the vector.
> x <- c(12, 3, 14, 3, 5, 1)
> rank(x)
[1] 5.0 2.5 6.0 2.5 4.0 1.0
The median of a numeric vector is that value which puts 50% of the values
below and 50% of the values above, in other words, the 50% percentile (or 0.5
quantile). For example, the median of c(4, 3, 1, 2, 5) is 3. For a vector
of even length, the middle values are averaged: the median of c(4, 3, 1,
2) is 2.5. To get the median value of a numeric vector use the median or
quantile function.
> ages <- c(23, 45, 67, 33, 20, 77)
> median(ages)
[1] 39
> quantile(ages, 0.5)
50%
39
To return the minimum value of a vector use the min function; for the max-
imum value use the max function. To get both the minimum and maximum
values use the range function.
> ages <- c(23, 45, 67, 33, 20, 77)
> min(ages)
[1] 20
> sort(ages)[1] # equivalent
[1] 20
> max(ages)
[1] 77
> sort(ages)[length(ages)] # equivalent
[1] 77
> range(ages)
[1] 20 77
> c(min(ages), max(ages)) # equivalent
[1] 20 77
To sample from a vector of length n, with each element having a default
sampling probability of 1/n, use the sample function. Sampling can be with
or without replacement (default). If the sample size is greater than the length
of the vector, then sampling must occur with replacement.
> coin <- c("H", "T")
> sample(coin, size = 10, replace = TRUE)
[1] "H" "H" "T" "T" "T" "H" "H" "H" "H" "T"
> sample(1:100, 15)
[1] 9 24 53 11 15 63 52 73 54 84 82 66 65 20 67
Next, we review selected functions that work with one or more vectors. Some
of these functions manipulate vectors and others facilitate numerical opera-
tions.
In addition to creating vectors, the c function can be used to append
vectors.
> x <- 6:10
> y <- 20:24
> c(x, y)
[1] 6 7 8 9 10 20 21 22 23 24
The append function also appends vectors; however, one can specify at which
position.
> append(x, y)
[1] 6 7 8 9 10 20 21 22 23 24
> append(x, y, after = 2)
[1] 6 7 20 21 22 23 24 8 9 10
In contrast, the cbind and rbind functions concatenate vectors into a
matrix. During the outbreak of severe acute respiratory syndrome (SARS)
in 2003, a patient with SARS potentially exposed 111 passengers on board
an airline flight. Of the 23 passengers that sat “close” to the index case, 8
developed SARS; among the 88 passengers that did not sit “close” to the
index case, only 10 developed SARS [2]. Now, we can bind 2 vectors to create
a 2 × 2 table (matrix).
> case <- c("exposed" = 8, "unexposed" = 10)
> noncase <- c("exposed" = 15, "unexposed" = 78)
> cbind(case, noncase)
case noncase
exposed 8 15
unexposed 10 78
> rbind(case, noncase)
exposed unexposed
case 8 10
noncase 15 78
For the example that follows, let’s recreate the SARS data as two character
vectors.
> outcome <- c(rep("case", 8+10), rep("noncase", 15+78))
> tmp <- c("exposed", "unexposed")
> exposure <- c(rep(tmp, c(8, 10)), rep(tmp, c(15, 78)))
> cbind(exposure,outcome)[1:4,] # display 4 rows
exposure outcome
[1,] "exposed" "case"
[2,] "exposed" "case"
[3,] "exposed" "case"
[4,] "exposed" "case"
Now, use the table function to cross-tabulate one or more vectors.
> table(outcome, exposure)
exposure
outcome exposed unexposed
case 8 10
noncase 15 78
The ftable function creates a flat contingency table from one or more
vectors.
> ftable(outcome, exposure)
exposure exposed unexposed
outcome
case 8 10
noncase 15 78
This will come in handy later when we want to display a 3 or more dimen-
sional table as a “flat” 2-dimensional table.
The outer function applies a function to every combination of elements
from two vectors. For example, create a multiplication table for the numbers
1 to 5.
Table 2.10 Deaths among subjects who received tolbutamide and placebo in the Unver-
sity Group Diabetes Program (1970)
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
We calculated the risks of death for each treatment group. We got the nu-
merator by indexing the dat matrix using the row name "Deaths". The
numerator is a vector containing the deaths for each group and the denom-
inator is the total number of subjects in each group. We calculated the risk
ratios using the placebo group as the reference.
odds <- risks/(1-risks)
odds.ratio <- odds/odds[2] #odds ratio
Using the definition of the odds, we calculated the odds of death for each
treatment group. Then we calculated the odds ratios using the placebo group
as the reference.
dat
rbind(risks, risk.ratio, odds, odds.ratio)
Finally, we display the dat table we created. We also created a table of results
by row binding the vectors using the rbind function.
In the sections that follow we will cover the necessary concepts to make
the previous analysis routine.
There are several ways to create matrices (Table 2.11 on the next page). In
general, we create or use matrices in the following ways:
• Contingency tables (cross tabulations)
• Spreadsheet calculations and display
• Collecting results into tabular form
• Results of 2-variable equations
In the previous section we used the matrix function to create the 2 × 2 table
for the UGDP clinical trial:
> dat <- matrix(c(30, 174, 21, 184), 2, 2)
> rownames(dat) <- c("Deaths", "Survivors")
> colnames(dat) <- c("Tolbutamide", "Placebo")
> dat
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
Alternatively, we can create a 2-way contingency table using the table func-
tion with fields from a data set;
Placebo Tolbutamide
Deaths 21 30
Survivors 184 174
Alternatively, the xtabs function cross tabulates using a formula interface.
An advantage of this function is that the field names are included.
> xtabs(~Status + Treatment, data = dat2)
Treatment
Status Placebo Tolbutamide
Deaths 21 30
Survivors 184 174
Treatment
Status Placebo Tolbutamide
Deaths 5 8
Survivors 115 98
, , Agegrp = 55+
Treatment
Status Placebo Tolbutamide
Deaths 16 22
Survivors 69 76
R(0, t) = 1 − e−λt
A 2-way contingency table from the table or xtabs functions does not have
margin totals. However, we can construct a numeric matrix that includes the
totals. Using the UGDP data again,
> dat2 <- read.table("http://www.medepi.net/data/ugdp.txt",
+ header = TRUE, sep=",")
> tab2 <- xtabs(~Status + Treatment, data = dat2)
> rowt <- tab2[,1] + tab2[,2]
> tab2a <- cbind(tab2, Total = rowt)
> colt <- tab2a[1,] + tab2a[2,]
> tab2b <- rbind(tab2a, Total = colt)
> tab2b
Placebo Tolbutamide Total
Deaths 21 30 51
Survivors 184 174 358
Total 205 204 409
This table (tab2b) is primarily for display purposes.
z = xy
And suppose x = {1, 2, 3, 4, 5} and y = {6, 7, 8, 9, 10}. Here’s the long way to
create a matrix for this equation:
> x <- 1:5; y <- 6:10
> z <- matrix(NA, 5, 5) #create empty matrix of missing values
> for(i in 1:5){
+ for(j in 1:5){
+ z[i, j] <- x[i]*y[j]
+ }
+ }
> rownames(z) <- x; colnames(z) <- y
> z
6 7 8 9 10
1 6 7 8 9 10
2 12 14 16 18 20
3 18 21 24 27 30
4 24 28 32 36 40
5 30 35 40 45 50
Okay, but the outer function is much better for this task:
> x <- 1:5; y <- 6:10
> z <- outer(x, y, "*")
> rownames(z) <- x; colnames(z) <- y
> z
6 7 8 9 10
1 6 7 8 9 10
2 12 14 16 18 20
3 18 21 24 27 30
4 24 28 32 36 40
5 30 35 40 45 50
In fact, the outer function can be used to calculate the “surface” for any
2-variable equation (more on this later).
> tab
Treatment
Outcome Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
If a matrix does not have field names, we can add them after the fact, but
we must use the names and dimnames functions together. Having field names
is necessary if the row and column names are not self-explanatory, as this
example illustrates.
> y <- matrix(c(30, 174, 21, 184), 2, 2)
> rownames(y) <- c("Yes", "No")
> colnames(y) <- c("Yes", "No")
> y #labels not informative
Yes No
Yes 30 21
No 174 184
> #add field names
> names(dimnames(y)) <- c("Death", "Tolbutamide")
> y
Tolbutamide
Death Yes No
Yes 30 21
No 174 184
Study and test the examples in Table 2.12 on the previous page.
> dat[,"sbp"]<140
[1] TRUE FALSE FALSE TRUE TRUE
> tmp <- dat[,"age"]<60 & dat[,"sbp"]<140
> tmp
[1] TRUE FALSE FALSE TRUE FALSE
> dat[tmp,]
age chol sbp
[1,] 45 145 124
[2,] 44 144 134
Notice that the tmp logical vector is the intersection of the logical vectors
separated by the logical operator &.
Treatment
Outcome Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
We can transpose the matrix using the t function.
> t(tab)
Outcome
Treatment Deaths Survivors
Tolbutamide 30 174
Placebo 21 184
We can reverse the order of the rows and/or columns.
> tab[2:1,] #reverse rows
Treatment
Outcome Tolbutamide Placebo
Survivors 174 184
Deaths 30 21
> tab[,2:1] #reverse columns
Treatment
Outcome Placebo Tolbutamide
Deaths 21 30
Survivors 184 174
> tab[2:1,2:1] #reverse rows and columns
Treatment
Outcome Placebo Tolbutamide
Survivors 184 174
Deaths 21 30
> tab
Treatment
Outcome Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
> tab2 <- cbind(tab, Total=rowSums(tab))
> rbind(tab2, Total=colSums(tab2))
Tolbutamide Placebo Total
Deaths 30 21 51
Survivors 174 184 358
Total 204 205 409
For convenience, the addmargins function calculates and displays the
marginals totals with the original data in one step.
> addmargins(tab)
Treatment
Outcome Tolbutamide Placebo Sum
Deaths 30 21 51
Survivors 174 184 358
Sum 204 205 409
The power of the apply function comes from our ability to pass many
functions (including our own) to it. For practice, combine the apply function
with functions from Table 2.8 on page 42 to conduct operations on rows and
columns of a matrix.
The sweep function is another important and versatile function for conduct-
ing operations across rows or columns of a matrix. This function “sweeps”
(operates on) a row or column of a matrix using some function and a value
(usually derived from the row or column values). To understand this, we con-
sider an example involving a single vector. For a given integer vector x, to
convert the values of x into proportions involves two steps:
> x <- c(1, 2, 3, 4, 5)
> sumx <- sum(x) #Step 1: summation
> propx <- x/sumx #Step 2: division (the "sweep")
> propx
[1] 0.066667 0.133333 0.200000 0.266667 0.333333
To apply this equivalent operation across rows or columns of a matrix requires
the sweep function.
For example, to calculate the row and column distributions of a 2-way
table we combine the apply (step 1) and the sweep (step 2) functions:
> tab
Treatment
Outcome Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
> rtot <- apply(tab, 1, sum) #row totals
> tab.rowdist <- sweep(tab, 1, rtot, "/")
> tab.rowdist
Treatment
Outcome Tolbutamide Placebo
Deaths 0.58824 0.41176
Survivors 0.48603 0.51397
> ctot <- apply(tab, 2, sum) #column totals
> tab.coldist <- sweep(tab, 2, ctot, "/")
> tab.coldist
Treatment
Outcome Tolbutamide Placebo
Deaths 0.14706 0.10244
Survivors 0.85294 0.89756
Because R is a true programming language, these can be combined into single
steps:
> sweep(tab, 1, apply(tab, 1, sum), "/") #row distribution
Treatment
Outcome Tolbutamide Placebo
Deaths 0.58824 0.41176
Survivors 0.48603 0.51397
> sweep(tab, 2, apply(tab, 2, sum), "/") #column distribution
Treatment
Outcome Tolbutamide Placebo
Deaths 0.14706 0.10244
Survivors 0.85294 0.89756
For convenience, R provides prop.table. However, this function just uses
the apply and sweep functions.
Table 2.16 Deaths among subjects who received tolbutamide and placebo in the Unver-
sity Group Diabetes Program (1970), stratifying by age
Age<55 Age≥55 Combined
Tolbutamide Placebo Tolbutamide Placebo Tolbutamide Placebo
Deaths 8 5 22 16 30 21
Survivors 98 115 76 69 174 184
Total 106 120 98 85 204 205
domized clinical trial previously shown comparing the number deaths among
diabetic subjects that received tolbutamide vs. placebo is now also stratified
by age group (Table 2.16):
This is 3-dimensional array: outcome status vs. treatment status vs. age
group. Let’s see how we can represent this data in R.
> tdat <- c(8, 98, 5, 115, 22, 76, 16, 69)
> tdat <- array(tdat, c(2, 2, 2))
> dimnames(tdat) <- list(Outcome=c("Deaths", "Survivors"),
+ Treatment=c("Tolbutamide", "Placebo"),
+ "Age group"=c("Age<55", "Age>=55"))
> tdat
, , Age group = Age<55
Treatment
Outcome Tolbutamide Placebo
Deaths 8 5
Survivors 98 115
Treatment
Outcome Tolbutamide Placebo
Deaths 22 16
Survivors 76 69
R displays the first stratum (tdat[,,1]) then the second stratum (tdat[,,2]).
Our goal now is to understand how to generate and operate on these types of
arrays. Before we can do this we need to thoroughly understand the structure
of arrays.
Let’s study a 4-dimensional array. Displayed in Table 2.17 on the following
page is the year 2000 population estimates for Alameda and San Francisco
Counties by age, ethnicity, and sex. The first dimension is age category, the
second dimension is ethnicity, the third dimension is sex, and the fourth
dimension is county. Learning how to visualize this 4-dimensional sturcture
in R will enable us to visualize arrays of any number of dimensions.
Table 2.17 Example of 4-dimensional array: Year 2000 population estimates by age,
ethnicity, sex, and county
Ethnicity
County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd
Alameda
Female <=19 58160 31765 40653 49738 10120 839
20–44 112326 44437 72923 58553 7658 1401
45–64 82205 24948 33236 18534 2922 822
65+ 49762 12834 16004 7548 1014 246
Male <=19 61446 32277 42922 53097 10102 828
20–44 115745 36976 69053 69233 6795 1263
45–64 81332 20737 29841 17402 2506 687
65+ 33994 8087 11855 5416 711 156
San Francisco
Female <=19 14355 6986 23265 13251 2940 173
20–44 85766 10284 52479 23458 3656 526
45–64 35617 6890 31478 9184 1144 282
65+ 27215 5172 23044 5773 554 121
Male <=19 14881 6959 24541 14480 2851 165
20–44 105798 11111 48379 31605 3766 782
45–64 43694 7352 26404 8674 1220 354
65+ 20072 3329 17190 3428 450 76
Fig. 2.2 Schematic representation of a 4-dimensional array (Year 2000 population esti-
mates by age, race, sex, and county)
In R, arrays are most often produced with the array, table, or xtabs func-
tions (Table 2.18). As in the previous example, the array function works
much like the matrix function except the array function can specify 1 or
more dimensions, and the matrix function only works with 2 dimensions.
> array(1, dim = 1)
[1] 1
> array(1, dim = c(1, 1))
[,1]
[1,] 1
> array(1, dim = c(1, 1, 1))
, , 1
[,1]
[1,] 1
The table function cross tabulates 2 or more categorical vectors: character
vectors or factors. In R, categorical data are represented as factors (more
on this later). In contrast, using a formula interface, the xtabs function
cross tabulates 2 or more factors from a data frame. Additionally, the xtabs
function includes field names (which is highly preferred). For illustration, we
will cross tabulate character vectors.
Placebo Tolbutamide
Death 16 22
Survivor 69 76
, , = <55
Placebo Tolbutamide
Death 5 8
Survivor 115 98
The xtabs function will not work on character vectors.
By default, R converts character fields into factors. With factors, both the
table and xtabs functions cross tabulate the fields.
> #read in data and convert character vectors to factors
> udat2 <- read.csv("http://www.medepi.net/data/ugdp.txt")
> str(udat2)
’data.frame’: 409 obs. of 3 variables:
$ Status : Factor w/ 2 levels "Death","Survivor": 1 1 1 1 ...
$ Treatment: Factor w/ 2 levels "Placebo","Tolbutamide": 2 2 ...
$ Agegrp : Factor w/ 2 levels "55+","<55": 2 2 2 2 2 2 2 ...
> table(udat2$Status, udat1$Treatment, udat1$Agegrp)
, , = 55+
Placebo Tolbutamide
Death 16 22
Survivor 69 76
, , = <55
Placebo Tolbutamide
Death 5 8
Survivor 115 98
, , Agegrp = 55+
Treatment
Status Placebo Tolbutamide
Death 16 22
Survivor 69 76
, , Agegrp = <55
Treatment
Status Placebo Tolbutamide
Death 5 8
Survivor 115 98
Notice that the xtabs function above included the field names. Field names
can be added manually with the table functions:
> table(Outcome = udat2$Status, Therapy = udat1$Treatment,
+ Age = udat1$Agegrp)
, , Age = 55+
Therapy
Outcome Placebo Tolbutamide
Death 16 22
Survivor 69 76
, , Age = <55
Therapy
Outcome Placebo Tolbutamide
Death 5 8
Survivor 115 98
Recall that the ftable function creates a flat contingency from categorical
vectors. The as.table function converts the flat contingency table back into
a multidimensional array.
> ftab <- ftable(udat2$Agegrp, udat1$Treatment, udat1$Status)
> ftab
Death Survivor
Placebo Tolbutamide
<55 5 8
55+ 16 22
, , = Survivor
Placebo Tolbutamide
<55 115 98
55+ 69 76
With the exception of the aperm function, operating on an array (Table 2.22)
is an extension of operating on a matrix (Table 2.15 on page 59). Consider
Table 2.23 Example of 3-dimensional array with marginal totals: Primary and secondary
syphilis morbidity by age, race, and sex, United State, 1989
Ethnicity
Age (years) Sex White Black Other Total
≤ 14 Male 2 31 7 40
Female 14 165 11 190
Total 16 196 18 230
15-19 Male 88 1412 210 1710
Female 253 2257 158 2668
Total 341 3669 368 4378
20-24 Male 407 4059 654 5120
Female 475 4503 307 5285
Total 882 8562 961 10405
25-29 Male 550 4121 633 5304
Female 433 3590 283 4306
Total 983 7711 916 9610
30-34 Male 564 4453 520 5537
Female 316 2628 167 3111
Total 880 7081 687 8648
35-44 Male 654 3858 492 5004
Female 243 1505 149 1897
Total 897 5363 641 6901
45-54 Male 323 1619 202 2144
Female 55 392 40 487
Total 378 2011 242 2631
55+ Male 216 823 108 1147
Female 24 92 15 131
Total 240 915 123 1278
the number of primary and secondary syphilis cases in the United State,
1989, stratified by sex, ethnicity, and age (Table 2.23). This table contains
the marginal and joint distribution of cases. Let’s read in the original data
and reproduce the table results.
> sdat3 <- read.csv("http://www.medepi.net/data/syphilis89c.txt")
> str(sdat3)
‘data.frame’: 44081 obs. of 3 variables:
$ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 ...
$ Race: Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 ...
Race
Sex White Black Other
Male 2 31 7
Female 14 165 11
, , Age = 15-19
Race
Sex White Black Other
Male 88 1412 210
Female 253 2257 158
, , Age = 20-24
Race
Sex White Black Other
Male 407 4059 654
Female 475 4503 307
, , Age = 25-29
Race
Sex White Black Other
Male 550 4121 633
Female 433 3590 283
, , Age = 30-34
Race
Sex White Black Other
Male 564 4453 520
Female 316 2628 167
, , Age = 35-44
Race
Sex White Black Other
Male 654 3858 492
Female 243 1505 149
, , Age = 45-54
Race
Sex White Black Other
Male 323 1619 202
Female 55 392 40
, , Age = 55+
Race
Sex White Black Other
Male 216 823 108
Female 24 92 15
To get marginal totals for one dimension, use the apply function and
specify the dimension for stratifying the results.
> sum(sdat) #total
[1] 44081
> apply(X = sdat, MARGIN = 1, FUN = sum) #by sex
Male Female
26006 18075
> apply(sdat, 2, sum) #by race
White Black Other
4617 35508 3956
> apply(sdat, 3, sum) #by age
<=14 15-19 20-24 25-29 30-34 35-44 45-54 55+
230 4378 10405 9610 8648 6901 2631 1278
To get the joint marginal totals for 2 or more dimensions, use the apply
function and specify the dimensions for stratifying the results. This means
that the function that is passed to apply is applied across the other, non-
stratified dimensions.
> apply(sdat, c(1, 2), sum) #by sex and race
Race
Sex White Black Other
Male 2804 20376 2826
Female 1813 15132 1130
> apply(sdat, c(1, 3), sum) #by sex and age
Age
Sex <=14 15-19 20-24 25-29 30-34 35-44 45-54 55+
Male 40 1710 5120 5304 5537 5004 2144 1147
Female 190 2668 5285 4306 3111 1897 487 131
> apply(sdat, c(3, 2), sum) #by age and race
Race
Age White Black Other
<=14 16 196 18
15-19 341 3669 368
20-24 882 8562 961
25-29 983 7711 916
30-34 880 7081 687
35-44 897 5363 641
45-54 378 2011 242
55+ 240 915 123
In R, arrays are displayed by the 1st and 2nd dimensions, stratified by the
remaining dimensions. To change the order of the dimensions, and hence the
display, use the aperm function. For example, the syphilis case data is most
efficiently displayed when it is stratified by race, age, and sex:
> sdat.ras <- aperm(sdat, c(2, 3, 1))
> sdat.ras
, , Sex = Male
Age
Race <=14 15-19 20-24 25-29 30-34 35-44 45-54 55+
White 2 88 407 550 564 654 323 216
Black 31 1412 4059 4121 4453 3858 1619 823
Other 7 210 654 633 520 492 202 108
, , Sex = Female
Age
Race <=14 15-19 20-24 25-29 30-34 35-44 45-54 55+
White 14 253 475 433 316 243 55 24
Black 165 2257 4503 3590 2628 1505 392 92
Other 11 158 307 283 167 149 40 15
Another method for changing the display of an array is to convert it into
a flat contingency table using the ftable function. For example, to display
Table 2.23 on page 72 as a flat contingency table in R (but without the
marginal totals), we use the following code:
> sdat.asr <- aperm(sdat, c(3,1,2)) #rearrange to age, sex, race
> ftable(sdat.asr) #convert 2-D flat table
Race White Black Other
Age Sex
<=14 Male 2 31 7
Female 14 165 11
15-19 Male 88 1412 210
Female 253 2257 158
20-24 Male 407 4059 654
Female 475 4503 307
25-29 Male 550 4121 633
Female 433 3590 283
30-34 Male 564 4453 520
Female 316 2628 167
35-44 Male 654 3858 492
Female 243 1505 149
45-54 Male 323 1619 202
Female 55 392 40
55+ Male 216 823 108
Female 24 92 15
This ftable object can be treated as a matrix, but it cannot be transposed.
Notice that we can combine the ftable with addmargins:
> ftable(addmargins(sdat.asr))
Race Black Other White Sum
Age Sex
15-19 Female 2257 158 253 2668
Male 1412 210 88 1710
Sum 3669 368 341 4378
20-24 Female 4503 307 475 5285
Male 4059 654 407 5120
Sum 8562 961 882 10405
25-29 Female 3590 283 433 4306
Male 4121 633 550 5304
Sum 7711 916 983 9610
30-34 Female 2628 167 316 3111
Male 4453 520 564 5537
Sum 7081 687 880 8648
35-44 Female 1505 149 243 1897
Male 3858 492 654 5004
Sum 5363 641 897 6901
45-54 Female 392 40 55 487
Male 1619 202 323 2144
Sum 2011 242 378 2631
<=14 Female 165 11 14 190
Male 31 7 2 40
Sum 196 18 16 230
>55 Female 92 15 24 131
Up to now, we have been working with atomic data objects (vector, ma-
trix, array). In contrast, lists, data frames, and functions are recursive data
objects. Recursive data objects have more flexibility in combining diverse
data objects into one object. A list provides the most flexibility. Think of a
list object as a collection of “bins” that can contain any R object (see Fig-
ure 2.5.1 on the following page). Lists are very useful for collecting results
Fig. 2.4 Schematic representation of a list of length four. The first bin [1] contains a
smiling face [[1]], the second bin [2] contains a flower [[2]], the third bin [3] contains
a lightning bolt [[3]], and the fourth bin [[4]] contains a heart [[4]]. When indexing a
list object, single brackets [·] indexes the bin, and double brackets [[·]] indexes the bin
contents. If the bin has a name, then $name also indexes the contents.
of an analysis or a function into one data object where all its contents are
readily accessible by indexing.
For example, using the UGDP clinical trial data, suppose we perform
Fisher’s exact test for testing the null hypothesis of independence of rows
and columns in a contingency table with fixed marginals.
> udat <- read.csv("http://www.medepi.net/data/ugdp.txt")
> tab <- table(udat$Status, udat$Treatment)[,2:1]
> tab
Tolbutamide Placebo
Death 30 21
Survivor 174 184
> ftab <- fisher.test(tab)
> ftab
data: tab
p-value = 0.1813
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.80138 2.88729
sample estimates:
odds ratio
1.5091
The default display only shows partial results. The total results are stored
in the object ftab. Let’s evaluate the structure of ftab and extract some
results:
> str(ftab)
List of 7
To create a list directly, use the list function. A list is a convenient method
to save results in our customized functions. For example, here’s a function to
calculate an odds ratio from a 2 × 2 table:
orcalc <- function(x){
or <- (x[1,1]*x[2,2])/(x[1,2]*x[2,1])
pval <- fisher.test(x)$p.value
list(data = x, odds.ratio = or, p.value = pval)
}
The orcalc function has been loaded in R, and now we run the function on
the UGDP data.
> tab #display 2x2 table
Tolbutamide Placebo
Death 30 21
Survivor 174 184
> orcalc(tab) #run function
$data
Tolbutamide Placebo
Death 30 21
Survivor 174 184
$odds.ratio
[1] 1.5107
$p.value
[1] 0.18126
For additional practice, study and implement the examples in Table 2.24.
If list components (bins) are unnamed, we can index the list by bin position
with single or double brackets. The single brackets [·] indexes one or more
bins, and the double brackets indexes contents of single bins only.
> mylist1 <- list(1:5, matrix(1:4,2,2), c("Juan Nieve", "Guillermo Farro"))
> mylist1[c(1, 3)] #index bins 1 and 3
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] "Juan Nieve" "Guillermo Farro"
object creates a list object with more detailed results. This too has a default
display, or we can index list components by name.
> summod1 <- summary(mod1)
> summod1 #default display of more detailed results
Call:
coxph(formula = Surv(rep(1, 248), case) ~ spontaneous +
induced + strata(stratum), data = infert, method = "exact")
n= 248
coef exp(coef) se(coef) z p
spontaneous 1.99 7.29 0.352 5.63 1.8e-08
induced 1.41 4.09 0.361 3.91 9.4e-05
Because lists can have complex structural components, there are not many
operations we will want to do on lists. When we want to apply a function
to each component (bin) of a list, we use the lapply or sapply function.
These functions are identical except that sapply “simplies” the final result,
if possible.
The do.call function applies a function to the entire list using each each
component as an argument. For example, consider a list where each bin con-
tains a vector and we want to cbind the vectors.
> mylist <- list(vec1=1:5, vec2=6:10, vec3=11:15)
> cbind(mylist) #will not work
mylist
vec1 Integer,5
vec2 Integer,5
vec3 Integer,5
> do.call(cbind, mylist) #works
vec1 vec2 vec3
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
For additional practice, study and implements the examples in Table 2.28
on the facing page.
Epidemiologists are familiar with tabular data sets where each row is a record
and each column is a field. A record can be data collected on individuals or
groups. We usually refer to the field name as a variable (e.g., age, gender,
ethnicity). Fields can contain numeric or character data. In R, these types of
data sets are handled by data frames. Each column of a data frame is usually
either a factor or numeric vector, although it can have complex, character,
or logical vectors. Data frames have the functionality of matrices and lists.
For example, here is the first 10 rows of the infert data set, a matched
case-control study published in 1976 that evaluated whether infertility was
associated with prior spontaneous or induced abortions.
> data(infert)
> str(infert)
‘data.frame’: 248 obs. of 8 variables:
$ education : Factor w/ 3 levels "0-5yrs",..: 1 1 ...
$ age : num NA 45 NA 23 35 36 23 32 21 28 ...
$ parity : num 6 1 6 4 3 4 1 2 1 2 ...
$ induced : num 1 1 2 2 1 2 0 0 0 0 ...
$ case : num 1 1 1 1 1 1 1 1 1 1 ...
The fields are obviously vectors. Let’s explore a few of these vectors to see
what we can learn about their structure in R.
> #age variable
> infert$age
[1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26
...
[235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23
> mode(infert$age)
[1] "numeric"
> class(infert$age)
[1] "numeric"
[1] "numeric"
> class(infert$education)
[1] "factor"
What have we learned so far? In the infert data frame, age is a vector
of mode “numeric” and class “numeric,” stratum is a vector of mode “nu-
meric” and class “integer,” and education is a vector of mode “numeric”
and class “factor.” The numeric vectors are straightforward and easy to un-
derstand. However, a factor, R’s representation of categorical data, is a bit
more complicated.
Contrary to intuition, a factor is a numeric vector, not a character vector,
although it may have been created from a character vector (shown later). To
see the “true” education factor use the unclass function:
> z <- unclass(infert$education)
> z
[1] 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
...
[244] 3 3 3 3 3
attr(,"levels")
[1] "0-5yrs" "6-11yrs" "12+ yrs"
> mode(z)
[1] "numeric"
> class(z)
[1] "integer"
Now let’s create a factor from a character vector and then unclass it:
> cointoss <- sample(c("Head","Tail"), 100, replace = TRUE)
> cointoss
[1] "Tail" "Head" "Head" "Tail" "Tail" "Tail" "Head"
...
[99] "Tail" "Head"
> fct <- factor(cointoss)
> fct
[1] Tail Head Head Tail Tail Tail Head Head Head Tail Head
...
[100] Head
Levels: Head Tail
> unclass(fct)
[1] 2 1 1 2 2 2 1 1 1 2 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 2
[28] 1 2 2 1 1 2 1 2 2 1 1 1 1 1 2 2 1 1 2 2 2 1 1 2 2 2 1
[55] 1 1 1 1 2 1 1 2 2 2 1 1 2 2 2 2 1 2 2 1 1 1 2 1 1 2 2
[82] 1 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 1 2 1
attr(,"levels")
[1] "Head" "Tail"
Notice that we can still recover the original character vector using the
as.character function:
> as.character(cointoss)
[1] "Tail" "Head" "Head" "Tail" "Tail" "Tail" "Head"
...
[99] "Tail" "Head"
> as.character(fct)
[1] "Tail" "Head" "Head" "Tail" "Tail" "Tail" "Head"
...
[99] "Tail" "Head"
Okay, let’s create an ordered factor; that is, levels of a categorical vari-
able that have natural ordering. For this we set ordered=TRUE in the factor
function:
> samp <- sample(c("Low","Medium","High"), 100, replace=TRUE)
> ofac1 <- factor(samp, ordered=T)
> ofac1
[1] Low Medium High Medium Medium Medium Medium
...
[99] High High
Levels: High < Low < Medium
> table(ofac1) #levels and labels not in natural order
ofac1
High Low Medium
43 25 32
However, notice that the ordering was done in alphabetical order which is
not what we want. To change this, use the levels options in the factor
function:
> ofac2 <- factor(samp, levels=c("Low","Medium","High"), ordered=T)
> ofac2
[1] Low Medium High Medium Medium Medium Medium
...
[99] High High
Levels: Low < Medium < High
> table(ofac2)
ofac2
Low Medium High
28 35 37
Great — this is exactly what we want! For review, Table 2.29 on the next
page summarizes the variable types in epidemiology and how they are repre-
sented in R. Factors (unordered and ordered) are used to represent nominal
and ordinal categorical data. The infert data set contains nominal factors
and the esoph data set contains ordinal factors.
Table 2.29 Variable types in epidemiologic data and their representations in R data
frames
Representations in data Representations in R
Variable type Examples Mode Class Examples1
Numeric
Continuous 3.45, 2/3 numeric numeric infert$age
Discrete 1, 2, 3, 4, . . . numeric integer infert$stratum
Categorical
Nominal male vs. female numeric factor infert$education
Ordinal low < medium < high numeric ordered factor esoph$agegp
1. First load data: data(infert); data(esoph)
Factors can also be created directly from vectors as described in the previous
section.
Everything that applies to naming list components (Table 2.25 on page 81)
also applies to naming data frame components (Table 2.31). In general, we
may be interested in renaming variables (fields) or row names of a data
frame, or renaming the levels (possible values) for a given factor (categorical
variable). For example, consider the Oswego data set.
> odat <- read.table("http://www.medepi.net/data/oswego.txt",
+ sep="", header=TRUE, na.strings=".")
> odat[1:5,1:8] #Display partial data frame
id age sex meal.time ill onset.date onset.time baked.ham
1 2 52 F 8:00 PM Y 4/19 12:30 AM Y
2 3 65 M 6:30 PM Y 4/19 12:30 AM Y
3 4 59 F 6:30 PM Y 4/19 12:30 AM Y
4 6 63 F 7:30 PM Y 4/18 10:30 PM Y
5 7 70 M 7:30 PM Y 4/18 10:30 PM Y
> names(odat)[3] <- "Gender" #Rename ’sex’ to ’Gender’
> table(odat$Gender) #Display ’Gender’ distribution
F M
44 31
> levels(odat$Gender) #Display ’Gender’ levels
[1] "F" "M"
> #Replace ’Gender’ level labels
> levels(odat$Gender) <- c("Female", "Male")
> levels(odat$Gender) #Display new ’Gender’ levels
[1] "Female" "Male"
> table(odat$Gender) #Confirm distribution is same
Female Male
44 31
> odat[1:5,1:8] #Display partial data frame
id age Gender meal.time ill onset.date onset.time baked.ham
1 2 52 Female 8:00 PM Y 4/19 12:30 AM Y
2 3 65 Male 6:30 PM Y 4/19 12:30 AM Y
3 4 59 Female 6:30 PM Y 4/19 12:30 AM Y
4 6 63 Female 7:30 PM Y 4/18 10:30 PM Y
5 7 70 Male 7:30 PM Y 4/18 10:30 PM Y
On occasion, we might be interested in renaming the row names. Currently,
the Oswego data set has default integer values from 1 to 75 as the row names.
> row.names(odat)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
[12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
[23] "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"
[34] "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
[45] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55"
[56] "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"
[67] "67" "68" "69" "70" "71" "72" "73" "74" "75"
We can change the row names by assigning a new character vector.
> row.names(odat) <- sample(101:199, size=nrow(odat))
> odat[1:5,1:7]
id age Gender meal.time ill onset.date onset.time
123 2 52 Female 8:00 PM Y 4/19 12:30 AM
145 3 65 Male 6:30 PM Y 4/19 12:30 AM
173 4 59 Female 6:30 PM Y 4/19 12:30 AM
138 6 63 Female 7:30 PM Y 4/18 10:30 PM
146 7 70 Male 7:30 PM Y 4/18 10:30 PM
With data frames, as with all R data objects, anything that can be indexed
can be replaced. We already saw some examples of replacing names. For
practice, study and implement the examples in Table 2.33.
> capop[1:5,1:8]
County Year Sex Age White Hispanic Asian Pacific.Islander
1 59 2000 F 0 75619 115911 20879 741
2 59 2000 F 1 76211 113706 20424 765
3 59 2000 F 2 76701 114177 21044 806
4 59 2000 F 3 78551 116733 21920 817
5 59 2000 F 4 82314 119995 22760 884
Now, suppose we want to assess the range of the numeric fields. If we treat
the data frame as a list, both lapply or sapply works:
> sapply(capop[-3], range)
County Year Age White Hispanic Asian Pacific.Islander
[1,] 59 2000 0 497 110 76 1
[2,] 59 2050 100 148246 277168 46861 1890
Black American.Indian Multirace
[1,] 57 0 5
[2,] 26983 8181 17493
However, if we treat the data frame as a matrix, apply also works:
> apply(capop[,-3], 2, range)
3 F W 125 45
4 M W 145 65
For another example, in the capop data frame, we notice that the variable
age goes from 0 to 100 by 1-year intervals. It will be useful to aggregate ethnic-
specific population estimates into larger age categories. More specifically, we
want to calculate the sum of ethnic-specific population estimates (6 fields)
stratified by age category, sex, and year (3 fields). We will create a new 7-
level age category field commonly used by the National Center for Health
Statistics. Naturally, we use the aggregate function:
> capop <- read.csv("http://www.dof.ca.gov/HTML/DEMOGRAP/Data/
RaceEthnic/Population-00-50/documents/California.txt")
> to.keep <- c("White", "Hispanic", "Asian", "Pacific.Islander",
+ "Black", "American.Indian", "Multirace")
> age.nchs7 <- c(0, 1, 5, 15, 25, 45, 65, 101)
> capop$agecat7 <- cut(capop$Age, breaks = age.nchs7, right=FALSE)
> capop7 <- aggregate(capop[,to.keep], by = list(Age=capop$agecat7,
+ Sex=capop$Sex, Year=capop$Year), FUN = sum)
> levels(capop7$Age)[7] <- "65+"
> capop7[1:14, 1:6]
Age Sex Year White Hispanic Asian
1 [0,1) F 2000 75619 115911 20879
2 [1,5) F 2000 313777 464611 86148
3 [5,15) F 2000 924930 1124573 241047
4 [15,25) F 2000 868767 946948 272846
5 [25,45) F 2000 2360250 1742366 667956
6 [45,65) F 2000 2102090 735062 445039
7 65+ F 2000 1471842 279865 208566
8 [0,1) M 2000 79680 121585 21965
9 [1,5) M 2000 331193 484068 91373
10 [5,15) M 2000 979233 1175384 257574
11 [15,25) M 2000 925355 1080868 279314
12 [25,45) M 2000 2465194 1921896 614608
13 [45,65) M 2000 2074833 687549 384011
14 65+ M 2000 1075226 202299 154966
> mode(capop7)
[1] "list"
> class(capop7)
[1] "data.frame"
> str(capop7)
‘data.frame’: 714 obs. of 10 variables:
$ Age : Factor w/ 7 levels "[0,1)",..: 1 2 6 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 ...
$ Year : Factor w/ 51 levels "2000",..: 1 1 1 ...
$ White : int 75619 313777 924930 868767 ...
$ Hispanic : int 115911 464611 1124573 946948 ...
$ Asian : int 20879 86148 241047 272846 ...
$ Pacific.Islander: int 741 3272 9741 9629 19085 9898 ...
$ Black : int 14629 65065 195533 158923 ...
$ American.Indian : int 1022 4980 15271 14301 30960 ...
$ Multirace : int 10731 34524 78716 56735 82449 ...
>
> mode(orcalc)
[1] "function"
> class(orcalc)
[1] "function"
> str(orcalc)
function (x)
- attr(*, "source")= chr [1:5] "function(x){" ...
> orcalc
function(x){
or <- (x[1,1]*x[2,2])/(x[1,2]*x[2,1])
pval <- fisher.test(x)$p.value
list(data = x, odds.ratio = or, p.value = pval)
Objects created in the workspace are available during the R session. Upon
closing the R session, R asks whether to save the workspace. To save the
objects without exiting an R session, use the save.image function:
> save.image()
The save.image function is actually a special case of the save function:
save(list = ls(all = TRUE), file = ".RData")
The save function saves an R object as an external file. This file can be
loaded using the load function.
> x <- 1:5
> x
[1] 1 2 3 4 5
> save(x, file="/home/tja/temp/x")
> rm(x)
is.numeric as.numeric
is.integer as.integer
is.character as.character
is.logical as.logical
is.function as.function
is.null as.null
is.na n/a
is.nan n/a
is.finite n/a
is.infinite n/a
> x
Error: object "x" not found
> load(file="/home/tja/temp/x")
> x
[1] 1 2 3 4 5
Table 2.36 provides more functions for conducting specific object queries
and for coercing one object type into another. For example, a vector is not a
matrix.
> is.matrix(1:3)
[1] FALSE
However, a vector can be coerced into a matrix.
> as.matrix(1:3)
[,1]
[1,] 1
[2,] 2
[3,] 3
> is.matrix(as.matrix(1:3))
[1] TRUE
A common use would be to coerce a factor into a character vector.
> sex <- factor(c("M", "M", "M", "M", "F", "F", "F", "F"))
> sex
[1] M M M M F F F F
Levels: F M
> unclass(sex) #does not coerce into character vector
[1] 2 2 2 2 1 1 1 1
attr(,"levels")
[1] "F" "M"
> as.character(sex) #yes, works
[1] "M" "M" "M" "M" "F" "F" "F" "F"
In R, missing values are represented by the value NA (“not available”). The
is.na function evaluates an object and returns a logical vector indicating
which positions contain NA. The !is.na version returns positions that do not
contain NA.
> x <- c(12, 34, NA, 56, 89)
> is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
> !is.na(x)
[1] TRUE TRUE FALSE TRUE TRUE
We can use is.na to replace missing values.
> x[is.na(x)] <- 999
> x
[1] 12 34 999 56 89
In R, NaN represents “not a number” and Inf represent an infinite value.
Therefore, we can use is.nan and is.infinite to assess which positions
contain NaN and Inf, respectively.
> x <- c(0, 3, 0, -6)
> y <- c(4, 0, 0, 0)
> z <- x/y
> z
[1] 0 Inf NaN -Inf
> is.nan(z)
[1] FALSE FALSE TRUE FALSE
> is.infinite(z)
[1] FALSE TRUE FALSE TRUE
Our workspace is like a desktop that contains the “objects” (data and tools)
we use to conduct our work. Use the getwd function to list the file path to
the workspace file .RData.
> getwd()
[1] "/home/tja/Data/R/home"
Use the setwd function to set up a new workspace location. A new .RData
file will automatically be created there
setwd("/home/tja/Data/R/newproject")
This is one method to manage multiple workspaces for one’s projects.
Use the search function to list the packages, environments, or data frames
attached and available.
> search() # Linux
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
The global environment .GlobalEnv is our workspace. The searchpaths
function list the full paths:
> searchpaths()
[1] ".GlobalEnv" "/usr/lib/R/library/stats"
[3] "/usr/lib/R/library/graphics" "/usr/lib/R/library/grDevices"
[5] "/usr/lib/R/library/utils" "/usr/lib/R/library/datasets"
[7] "/usr/lib/R/library/methods" "Autoloads"
[9] "/usr/lib/R/library/base"
Table 2.37 Risk of Death in a 20-year Period Among Women in Whickham, England,
According to Smoking Status at the Beginning of the Period
Smoking
Vital Status Yes No
Dead 139 230
Alive 443 502
Table 2.38 Risk of Death in a 20-year Period Among Women in Whickham, England,
According to Smoking Status at the Beginning of the Period
Smoking
Vital Status Yes No Total
Dead 139 230 369
Alive 443 502 945
Total 582 732 1314
Table 2.39 Risk Ratio and Odds Ratio of Death in a 20-year Period Among Women in
Whickham, England, According to Smoking Status at the Beginning of the Period
Smoking
Yes No
Risk 0.24 0.31
Risk Ratio 0.76 1.00
Odds 0.31 0.46
Odds Ratio 0.68 1.00
Problems
2.6. Starting with the 2 × 2 matrix object we created previously, using only
the apply, cbind, rbind, names, and dimnames functions, recreate Table 2.38.
2.7. Using the 2 × 2 data from Table 2.37 on the preceding page, use the
sweep and apply functions to calculate marginal and joint distributions.
2.8. Using the data from the previous problems, recreate Table 2.39 on the
previous page and interpret the results.
2.9. Read in the Whickham, England data using the R code below.
wdat = read.table("http://www.medepi.net/data/whickham-engl.txt",
sep = ",", header = TRUE)
str(wdat)
xtabs(~Vital.Status + Age + Smoking, data = wdat)
Stratified by age category, calculate the risk of death comparing smokers to
nonsmokers. Show your results. What is your interpretation.
2.10. Use the read.table function to read in the syphilis data available
at http://www.medepi.net/data/syphilis89c.txt. Evaluate structure of
data frame. Do not attach std data frame (yet). Create a 3-dimensional
array using both the table or xtabs function. Now attach the std data
frame using the attach function. Create the same 3-dimensional array using
both the table or xtabs function.
2.11. Use the apply function to get marginal totals for the syphilis 3-
dimensional array.
2.12. Use the sweep and apply functions to get marginal and joint distribu-
tions for a 3-D array.
2.13. Review and read in the group-level, tabular data set of primary and
secondary syphilis cases in the United States in 1989 available at http:
//www.medepi.net/data/syphilis89b.txt. Use the rep function on the
data frame fields to recreate the individual-level data frame with over 40,000
observations.
header = T)
#8 Aggregate
bapop2 <- aggregate(bapop[,racen],
list(Agecat = bapop$Agecat, Sex = bapop$Sex,
County = bapop$County), sum)
bapop2
There are many ways of getting our data into R for analysis. In the section
that follows we review how to enter the Unversity Group Diabetes Program
data (Table 3.1) as well as the original data from a comma-delimited text
file. We will use the following approaches:
• Entering data at the command prompt
• Importing data from a file
• Importing data using an URL
We review four methods. For Methods 1 and 2, data are entered directly at
the command prompt. Method 3 uses the same R expressions and data as
Methods 1 and 2, but they are entered into a text editor, saved as an text
file with a .R extension (e.g., job02.R), and then executed from the command
prompt using the source function. Alternatively, the R expressions and data
Table 3.1 Deaths among subjects who received tolbutamide and placebo in the Unversity
Group Diabetes Program (1970), stratifying by age
Age<55 Age≥55 Combined
Tolbutamide Placebo Tolbutamide Placebo Tolbutamide Placebo
Deaths 8 5 22 16 30 21
Survivors 98 115 76 69 174 184
107
108 3 Managing epidemiologic data in R
can be copied and pasted into R.1 And, for Method 4 we use R’s spreadsheet
editor (least preferred).
3.1.1.1 Method 1
For review, a convenient way to enter data at the command prompt is to use
the c function:
[,1] [,2]
[1,] 8 5
[2,] 98 115
, , 2
[,1] [,2]
[1,] 22 16
[2,] 76 69
[,1] [,2]
[1,] 8 5
[2,] 98 115
, , 2
[,1] [,2]
[1,] 22 16
[2,] 76 69
>
> #enter simple data frame
> subjname <- c("Pedro", "Paulo", "Maria")
> subjno <- 1:length(subjname)
> age <- c(34, 56, 56)
> sex <- c("Male", "Male", "Female")
> dat <- data.frame(subjno, subjname, age, sex); dat
subjno subjname age sex
1 1 Pedro 34 Male
2 2 Paulo 56 Male
3 3 Maria 56 Female
>
> #enter a simple function
> odds.ratio <- function(aa, bb, cc, dd){ aa*dd / (bb*cc)}
> odds.ratio(30, 174, 21, 184)
[1] 1.510673
3.1.1.2 Method 2
Method 2 is identical to Method 1 except one uses the scan function. It does
not matter if we enter the numbers on different lines, it will still be a vector.
Remember that we must press the Enter key twice after we have entered the
last number.
> udat.tot <- scan()
1: 30 174
3: 21 184
5:
Read 4 items
> udat.tot
[1] 30 174 21 184
To read in a matrix at the command prompt combine the matrix and
scan functions. Again, it does not matter on what lines we enter the data, as
long as they are in the correct order because the matrix function reads data
in column-wise.
> udat.tot <- matrix(scan(), 2, 2)
1: 30 174 21 184
5:
Read 4 items
> udat.tot
[,1] [,2]
[1,] 30 21
[2,] 174 184
Treatment
Vital.Status Tolbutamide Placebo
Dead 8 5
Survived 98 115
, , Age.Group = 55+
Treatment
Vital.Status Tolbutamide Placebo
Dead 22 16
Survived 76 69
[[2]]
[1] "John Paul" "Jane Doe"
[[3]]
[1] 84.5 34.5
[[4]]
[1] "Male" "Female"
[[5]]
[1] FALSE TRUE
> str(dat)
List of 5
$ : int [1:2] 3 4
$ : chr [1:2] "John Paul" "Jane Doe"
$ : num [1:2] 84.5 34.5
$ : chr [1:2] "Male" "Female"
$ : logi [1:2] FALSE TRUE
$name
[1] "John Paul" "Jane Doe"
$age
[1] 84.5 34.5
$sex
[1] "Male" "Female"
$dead
[1] FALSE TRUE
> str(dat)
List of 5
$ id : int [1:2] 3 4
$ name: chr [1:2] "John Paul" "Jane Doe"
$ age : num [1:2] 84.5 34.5
$ sex : chr [1:2] "Male" "Female"
$ dead: logi [1:2] FALSE TRUE
3.1.1.3 Method 3
Method 3 uses the same R expressions and data as Methods 1 and 2, but
they are entered into a text editor, saved as an ASCII text file with a .R
extension (e.g., job01.R), and then executed from the command prompt using
the source function. Alternatively, the R expressions and data can be copied
and pasted into R.2
For example, the following expressions are in a text editor and saved to a
file named job01.R.
x <- 1:10
x
One can copy and paste this code into R at the commmand prompt.
> x <- 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10
However, if we execute the code using the source function, it will only display
to the screen those objects that are printed using the show or print function.
Here is the text editor code again, but including show.
x <- 1:10
show(x)
Now, source job01.R using source at the command prompt.
> source("/home/tja/Documents/Rproj/job01.R")
[1] 1 2 3 4 5 6 7 8 9 10
In general, we highly recommend using a text editor for all our work. The
program file (e.g., job01.R) created with the text editor facilitates document-
ing our code, reviewing our code, debugging our code, replicating our analytic
steps, and auditing by external reviewers.
Method 4 uses R’s spreadsheet editor.3 This is not a preferred method because
we like the original data to be in a text editor or read in from a data file. We
will be using the data.entry and edit functions. The data.entry function
allows editing of an existing object and automatically saving the changes to
the original object name. In contrast, the edit function allows editing of an
existing object but it will not save the changes to the original object name; we
must explicitly assign it to an object name (event if it is the original name).
To enter a vector we need to initialize a vector and then use the data.entry
function (Figure 3.1).
> x <- numeric(10) #Initialize vector with zeros
> x
[1] 0 0 0 0 0 0 0 0 0 0
> data.entry(x) #Enter numbers, then close window
> x
[1] 1 2 3 4 5 6 7 8 9 10
However, the edit function applied to a vector does not open a spread-
sheet. Try the edit function and see what happens.
xnew <- edit(numeric(10)) #Edit number, then close window
To enter data into a spreadsheet matrix, first initialize a matrix and then
use the data.entry or edit function. Notice that the editor added default
column names. However, to add our own column names just click on the
column heading with our mouse pointer (unfortunately we cannot give row
names).
> xnew <- matrix(numeric(4),2,2)
> data.entry(xnew)
> xnew <- edit(xnew) #equivalent
>
> #open spreadsheet editor in one step
> xnew <- edit(matrix(numeric(4),2,2))
> xnew
col1 col2
[1,] 11 33
[2,] 22 44
Arrays and nontabular lists cannot be entered using a spreadsheet editor.
Hence, we begin to see the limitations of spreadsheet-type approach to data
entry. One type of list, the data frame, can be entered using the edit function.
To enter a data frame use the edit function. However, we do not need
to initialize a data frame (unlike with a matrix). Again, click on the column
headings to enter column names.
> df <- edit(data.frame()) #Spreadsheet screen not shown
> df
mykids age
1 Tomasito 7
2 Luisito 6
3 Angelita 3
When using the edit function to create a new data frame we must assign
it an object name to save the data frame. Later we will see that when we
edit an existing data object we can use the edit or fix function. The fix
function differs in that fix(data object ) saves our edits directly back to
data object without the need to make a new assignment.
mypower <- function(x, n){x^n}
fix(mypower) # Edits saved to ’mypower’ object
mypower <- edit(mypower) #equivalent
In this section we review how to read the following types of text data files:
• Comma-separated variable (csv) data file (± headers and ± row names)
• Fixed width formatted data file (± headers and ± row names)
Here is the University Group Diabetes Program randomized clinical trial
text data file that is comma-delimited, and includes row names and a header
(ugdp.txt).4 The header is the first line that contains the column (field)
names. The row names is the first column that starts on the second line
and uniquely identifies each row. Notice that the row names do not have a
column name associated with it. A data file can come with either row names
or header, neither, or both. Our preference is to work with data files that
have a header and data values that are self-explanatory. Even without a data
dictionary one can still make sense out of this data set.
4 Available at http://www.medepi.net/data/ugdp.txt
Status,Treatment,Agegrp
1,Dead,Tolbutamide,<55
2,Dead,Tolbutamide,<55
...
408,Survived,Placebo,55+
409,Survived,Placebo,55+
Notice that the header row has 3 items, and the second row has 4 items. This
is because the row names start in the second row and have no column name.
This data file can be read in using the read.table function, and R figures
out that the first column are row names.5
> ud <- read.table("http://www.medepi.net/data/ugdp.txt",
+ header = TRUE, sep = ",")
> head(ud) #displays 1st 6 lines
Status Treatment Agegrp
1 Dead Tolbutamide <55
2 Dead Tolbutamide <55
3 Dead Tolbutamide <55
4 Dead Tolbutamide <55
5 Dead Tolbutamide <55
6 Dead Tolbutamide <55
Here is the same data file as it would appear without row names and
without a header (ugdp2.txt).
Dead,Tolbutamide,<55
Dead,Tolbutamide,<55
...
Survived,Placebo,55+
Survived,Placebo,55+
This data file can be read in using the read.table function. By default, it
adds row names (1, 2, 3, . . . ).
> cnames <- c("Status", "Treatment", "Agegrp")
> udat2 <- read.table("http://www.medepi.net/data/ugdp2.txt",
+ header = FALSE, sep = ",", col.names = cnames)
> head(udat2)
Status Treatment Agegrp
1 Dead Tolbutamide <55
2 Dead Tolbutamide <55
3 Dead Tolbutamide <55
4 Dead Tolbutamide <55
5 Dead Tolbutamide <55
6 Dead Tolbutamide <55
Here is the same data file as it might appear as a fix formatted file. In this
file, columns 1 to 8 are for field #1, columns 9 to 19 are for field #2, and
columns 20 to 22 are for field #3. This type of data file is more compact.
One needs a data dictionary to know which columns contain which fields.
Dead Tolbutamide<55
Dead Tolbutamide<55
...
SurvivedPlacebo 55+
SurvivedPlacebo 55+
This data file would be read in using the read.fwf function. Because the
field widths are fixed, we must strip the white space using the strip.white
option.
> cnames <- c("Status", "Treatment", "Agegrp")
> udat3 <- read.fwf("http://www.medepi.net/data/ugdp3.txt",
+ width = c(8, 11, 3), col.names = cnames, strip.white = TRUE)
> head(udat3)
Status Treatment Agegrp
1 Dead Tolbutamide <55
2 Dead Tolbutamide <55
3 Dead Tolbutamide <55
4 Dead Tolbutamide <55
5 Dead Tolbutamide <55
6 Dead Tolbutamide <55
Finally, here is the same data file as it might appear as a fixed width
formatted file but with numeric codes (ugdp4.txt). In this file, column 1 is
for field #1, column 2 is for field #2, and column 3 is for field #3. This type
of text data file is the most compact, however, one needs a data dictionary
to make sense of all the 1s and 2s.
121
121
...
212
212
Here is how this data file would be read in using the read.fwf function.
> cnames <- c("Status", "Treatment", "Agegrp")
> udat4 <- read.fwf("http://www.medepi.net/data/ugdp4.txt",
+ width = c(1, 1, 1), col.names = cnames)
> head(udat4)
Status Treatment Agegrp
1 1 2 1
2 1 2 1
3 1 2 1
4 1 2 1
5 1 2 1
6 1 2 1
R has other functions for reading text data files (read.csv, read.csv2,
read.delim, read.delim2). In general, read.table is the function used most
commonly for reading in data files.
3.1.2.2 Reading data from a binary format (e.g., Stata, Epi Info)
To read data that comes in a binary or proprietary format load the foreign
package using the library function. To review available functions in the the
foreign package try help(package = foreign). For example, here we read
in the ‘infert’ data set which is also available as a Stata data file.6
> idat <- read.dta("c:/.../data/infert.dta")
> head(idat)[,1:8]
id education age parity induced case spontaneous stratum
1 1 0 26 6 1 1 2 1
2 2 0 42 1 1 1 0 2
3 3 0 39 6 2 1 0 3
4 4 0 34 4 2 1 0 4
5 5 1 35 3 1 1 1 5
6 6 1 36 4 2 1 1 6
As we have already seen, text data files can be read directly off a web server
into R using the read.table function. Here we load the Western Collabora-
tive Group Study data directly off a web server.
> wdat <- read.table("http://www.medepi.net/data/wcgs.txt",
+ header = TRUE, sep = ",")
> str(wdat)
‘data.frame’: 3154 obs. of 14 variables:
$ id : int 2001 2002 2003 2004 2005 2006 2007 2010 ...
$ age0 : int 49 42 42 41 59 44 44 40 43 42 ...
$ height0: int 73 70 69 68 70 72 72 71 72 70 ...
$ weight0: int 150 160 160 152 150 204 164 150 190 175 ...
$ sbp0 : int 110 154 110 124 144 150 130 138 146 132 ...
$ dbp0 : int 76 84 78 78 86 90 84 60 76 90 ...
$ chol0 : int 225 177 181 132 255 182 155 140 149 325 ...
6 Available at http://www.medepi.net/data/infert.dta
In the ideal setting, our data has already been checked, errors corrected, and
ready to be analyzed. Post-collection data editing can be minimized by good
design and data collection. However, we may still need to make corrections
or changes in data values.
For small data sets, it may be convenience to edit the data in our favorite
text editor. Key-recording macros, and search and replace tools can be very
useful and efficient. Figure 3.2 on the following page displays West Nile virus
(WNV) infection surveillance data.7 This file is a comma-delimited data file
with a header.
For vector and matrices we can use the data.entry function to edit these data
object elements. For data frames and functions use the edit or fix functions.
Remember that changes made with the edit function are not saved unless we
assign it to the original or new object name. In contrast, changes made with
the fix function are saved back to the original data object name. Therefore,
be careful when we use the fix function because we may unintentionally
overwrite data.
Now let’s read in the WNV surveillance raw data as a data frame. Then,
using the fix function, we will edit the first three records where the value for
the syndome variable is “Unk” and change it to NA for missing (Figure 3.3
on page 121). We will also change “.” to NA.
Fig. 3.2 Editing West Nile virus human surveillance data in text editor. Source: California
Department of Health Services, 2004
Fig. 3.3 Using the fix function to edit the WNV surveillance data frame. Unfortunately,
this approach does not facilitate documenting our edits. Source: California Department of
Health Services, 2004
+ na.string=c("Unknown", "."))
> wd[c(128, 129, 133),] #verify change
id county age sex syndrome date.onset date.tested death
128 128 Los Angeles 81 M <NA> 07/28/2004 08/11/2004 <NA>
129 129 Riverside 44 F <NA> 07/25/2004 08/11/2004 <NA>
133 133 Los Angeles 36 M <NA> 08/04/2004 08/11/2004 No
How do we make these and other changes after the data set has been read
into R? Although using R’s spreadsheet function is convenient, we do not
recommend it because manual editing is inefficient, our work cannot be repli-
cated and audited, and documentation is poor. Instead use R’s vectorized
approach. Let’s look at the distribution of responses for each variable to as-
sess what needs to be “cleaned up,” in addition to converting missing values
to NA.
> wd <- read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt",
+ header = TRUE, sep = ",", as.is = TRUE)
> str(wd)
‘data.frame’: 779 obs. of 8 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ county : chr "San Bernardino" "San Bernardino" ...
$ age : chr "40" "64" "19" "12" ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
...
768 769 770 771 772 773 774 775 776 777 778 779 780 781
1 1 1 1 1 1 1 1 1 1 1 1 1 1
$county
$age
. 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26
6 1 1 1 3 2 3 3 1 4 6 5 1 4 2 3 6 8 3 9
...
82 83 84 85 86 87 88 89 9 91 93 94
10 5 6 4 2 2 1 6 1 4 1 1
$sex
. F M
2 294 483
$syndrome
$date.onset
$date.tested
$death
. No Yes
66 686 27
What did we learn? First, there are 779 observations and 781 id’s; there-
fore, 3 observations were removed from the original data set. Second, we see
that the variables age, sex, syndrome, and death have missing values that
need to be converted to NAs. This can be done one field at a time, or for the
whole data frame in one step. Here is the R code.
#individually
wd$age[wd$age=="."] <- NA
wd$sex[wd$sex=="."] <- NA
wd$syndrome[wd$syndrome=="Unknown"] <- NA
wd$death[wd$death=="."] <- NA
#or globally
wd[wd=="." | wd=="Unknown"] <- NA
After running the above code, let’s evaluate one variable to verify the
missing values were converted to NAs.
> table(wd$death)
No Yes
686 27
> table(wd$death, exclude=NULL)
No Yes <NA>
686 27 66
We also notice that the entry for one of the counties, San Luis Obispo,
was misspelled (Sn Luis Obispo). We can use replacement to make the cor-
rections:
> wd$County[wd$County=="Sn Luis Obispo"] <- "San Luis Obispo"
For this section, please load the well known Oswego foodborne illness dataset:
> odat <- read.table("http://www.medepi.net/data/oswego.txt",
+ header = TRUE, as.is = TRUE, sep = "")
> str(odat)
‘data.frame’: 75 obs. of 21 variables:
$ id : int 2 3 4 6 7 8 9 10 14 16 ...
$ age : int 52 65 59 63 70 40 15 33 10 32 ...
$ sex : chr "F" "M" "F" "F" ...
$ meal.time : chr "8:00 PM" "6:30 PM" "6:30 PM" ...
$ ill : chr "Y" "Y" "Y" "Y" ...
$ onset.date : chr "4/19" "4/19" "4/19" "4/18" ...
$ onset.time : chr "12:30 AM" "12:30 AM" ...
$ baked.ham : chr "Y" "Y" "Y" "Y" ...
...
$ vanilla.ice.cream : chr "Y" "Y" "Y" "Y" ...
$ chocolate.ice.cream: chr "N" "Y" "Y" "N" ...
$ fruit.salad : chr "N" "N" "N" "N" ...
3.4.1 Indexing
Now, we will practice indexing rows from this data frame. First, we create a
new data set that contains only cases. To index the rows with cases we need
to generate a logical vector that is TRUE for every value of odat$ill that “is
equivalent to” "Y". For “is equivalent to” we use the == relational operator.
> cases <- odat$ill=="Y"
> cases
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
...
[73] FALSE FALSE FALSE
> odat.ca <- odat[cases, ]
> odat.ca[, 1:8]
id age sex meal.time ill onset.date onset.time baked.ham
1 2 52 F 8:00 PM Y 4/19 12:30 AM Y
2 3 65 M 6:30 PM Y 4/19 12:30 AM Y
3 4 59 F 6:30 PM Y 4/19 12:30 AM Y
4 6 63 F 7:30 PM Y 4/18 10:30 PM Y
...
43 71 60 M 7:30 PM Y 4/19 1:00 AM N
44 72 18 F 7:30 PM Y 4/19 12:00 AM Y
45 74 52 M <NA> Y 4/19 2:15 AM Y
3.4.2 Subsetting
Subsetting a data frame using the subset function is equivalent to using log-
ical vectors to index the data frame. In general, we prefer indexing because it
is generalizable to indexing any R data object. However, the subset function
is a convenient alternative for data frames. Again, let’s create data set with
female cases, age < median, and ate vanilla ice cream.
> odat.fcv <- subset(odat, subset = {ill=="Y" & sex=="F" &
+ vanilla.ice.cream=="Y" & age < median(odat$age)},
+ select = c(id:onset.date, vanilla.ice.cream))
> odat.fcv
id age sex meal.time ill onset.date vanilla.ice.cream
8 10 33 F 7:00 PM Y 4/18 Y
10 16 32 F . Y 4/19 Y
13 20 33 F . Y 4/18 Y
14 21 13 F 10:00 PM Y 4/19 Y
18 27 15 F 10:00 PM Y 4/19 Y
23 36 35 F . Y 4/18 Y
31 48 20 F 7:00 PM Y 4/19 Y
37 58 12 F 10:00 PM Y 4/19 Y
40 65 17 F 10:00 PM Y 4/19 Y
41 66 8 F . Y 4/19 Y
42 70 21 F . Y 4/19 Y
44 72 18 F 7:30 PM Y 4/19 Y
In the subset function, the first argument is the data frame object name,
the second argument (also called subset) evaluates to a logical vector, and
third argument (called select) specifies the fields to keep. In the second
argument,
subset = {...}
the curly brackets are included for convenience to group the logical and re-
lational operations. In the select argument, using the : operator, we can
specify a range of fields to keep.
Now, reload the Oswego data set to recover the original odat$age field. We
are going to create a new field with the following seven age categories (in
years): < 1, 1 to 4, 5 to 14, 15 to 24, 25 to 44, 45 to 64, and 65+. We will
demonstrate this using several methods:
...
[73] (15,25] (25,45] (5,15]
Levels: (0,1] (1,5] (5,15] (15,25] (25,45] ... (65,100]
Note that the cut function generated a factor with 7 levels for each interval.
The notation (15, 25] means that the interval is open on the left boundary
(> 15) and closed on the right boundary (≤ 25). However, for age categories,
it makes more sense to have age boundaries closed on the left and open on
the right: [a, b). To change this we set the option right = FALSE
> agecat <- cut(odat$age, breaks = c(0, 1, 5, 15, 25, 45,
+ 65, 100), right = FALSE)
> agecat
[1] [45,65) [65,100) [45,65) [45,65) [65,100) [25,45)
...
[73] [15,25) [25,45) [5,15)
Levels: [0,1) [1,5) [5,15) [15,25) [25,45) ... [65,100)
> table(agecat)
agecat
[0,1) [1,5) [5,15) [15,25) [25,45) [45,65) [65,100)
0 1 14 13 18 20 9
Okay, this looks good, but we can add labels since our readers may not be
familiar with open and closed interval notation [a, b).
> agelabs <- c("<1", "1-4", "5-14", "15-24", "25-44", "45-64",
+ "65+")
> agecat <- cut(odat$age, breaks = c(0, 1, 5, 15, 25, 45,
+ 65, 100), right = FALSE, labels = agelabs)
> agecat
[1] 45-64 65+ 45-64 45-64 65+ 25-44 15-24
...
[71] 5-14 5-14 15-24 25-44 5-14
Levels: <1 1-4 5-14 15-24 25-44 45-64 65+
> table(agecat, case = odat$ill)
case
agecat N Y
<1 0 0
1-4 0 1
5-14 8 6
15-24 5 8
25-44 8 10
45-64 5 15
65+ 3 6
In the previous example the categorical variable was a numeric vector (1,
2, 3, 4, 5, 6, 7) that was converted to a factor and provided labels (“<1”,
“1 to 4”, “5 to 14”, . . . ). In fact, categorical variables are often represented
by integers (for example, 0 = no, 1 = yes; or 0 = non-case, 1 = case) and
provided labels. Often, ASCII text data files are integer codes that require
a data dictionary to convert these integers into categorical variables in a
statistical package. In R, keeping track of integer codes for categorical vari-
ables is unnecessary. Therefore, re-coding the underlying integer codes is also
unnecessary; however, if we feel the need to do so, here’s how.
> # Create categorical variable
23 31 28 18
In R, we can re-order and re-label at the same time using the levels
function and assigning to it a list.
> table(ethnicity)
ethnicity
White Black Latino Asian
23 31 28 18
> ethnicity3 <- ethnicity
> levels(ethnicity3) <- list(Hispanic = "Latino", Asian = "Asian",
+ Caucasion = "White", "African American" = "Black")
> table(ethnicity3)
ethnicity3
Hispanic Asian Caucasion African American
28 18 23 31
The list function is necessary to assure the re-ordering. To re-order without
re-labeling just do the following:
> table(ethnicity)
ethnicity
White Black Latino Asian
23 31 28 18
> ethnicity4 <- ethnicity
> levels(ethnicity4) <- list(Latino = "Latino", Asian = "Asian",
+ White = "White", Black = "Black")
> table(ethnicity4)
ethnicity4
Latino Asian White Black
28 18 23 31
In R, we can sort the factor levels by using the factor function in one of
two ways:
> table(ethnicity)
ethnicity
White Black Latino Asian
23 31 28 18
> ethnicity5a <- factor(ethnicity, sort(levels(ethnicity)))
> table(ethnicity5a)
ethnicity5a
Asian Black Latino White
18 31 28 23
> ethnicity5b <- factor(as.character(ethnicity))
> table(ethnicity5b)
ethnicity5b
Asian Black Latino White
18 31 28 23
In the first example, we assigned to the levels argument the sorted level
names. In the second example, we started from scratch by coercing the orig-
inal factor into a character vector which is then ordered alphabetically by
default.
The first level of a factor is the reference level for some statistical models
(e.g., logistic regression). To set a different reference level use the relevel
function.
> levels(ethnicity)
[1] "White" "Black" "Latino" "Asian"
> ethnicity6 <- relevel(ethnicity, ref = "Asian")
> levels(ethnicity6)
[1] "Asian" "White" "Black" "Latino"
As we can see, there is tremendous flexibility in dealing with factors with-
out the need to “re-code” categorical variables. This approach facilitates re-
viewing our work and minimizes errors.
In general, R’s strength is not data management but rather data analysis.
Because R can access and operate on multiple objects in the workspace it is
generally not necessary to merge data objects into one data object in order
to conduct analyses. On occasion, it may be necessary to merge two data
frames into one data frames
Data frames that contain data on individual subjects are generally of two
types: (1) each row contains data collected on one and only one individual,
or (2) multiple rows contain repeated measurements on individuals. The lat-
ter approach is more efficient at storing data. For example, here are two
approaches to collecting multiple telephone numbers for two individuals.
> tab1
name wphone fphone mphone
1 Tomas Aragon 643-4935 643-2926 847-9139
2 Wayne Enanoria 643-4934 <NA> <NA>
>
> tab2
name telphone teletype
1 Tomas Aragon 643-4935 Work
2 Tomas Aragon 643-2926 Fax
3 Tomas Aragon 847-9139 Mobile
4 Wayne Enanoria 643-4934 Work
The first approach is represented by tab1, and the second approach by
tab2.11 Data is more efficiently stored in tab2, and adding new types of
11 This approach is the basis for designing and implementing relational databases. A
relational database consists of multiple tables linked by an indexing field.
telephone numbers only requires assigning a new value (e.g., Pager) to the
teletype field.
> tab2
name telphone teletype
1 Tomas Aragon 643-4935 Work
2 Tomas Aragon 643-2926 Fax
3 Tomas Aragon 847-9139 Mobile
4 Wayne Enanoria 643-4934 Work
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
We can add code to the source file to “sink” selected results to an output
file using the sink or capture.output functions. Consider our edited source
file:
i <- 1:5
x <- outer(i, i, "*")
sink("/home/tja/Documents/wip/epir/r/chap03.log")
cat("Here are the results of the outer function", fill=TRUE)
show(x)
sink()
Here we run source from the R command prompt:
> source("/home/tja/Documents/wip/epir/r/chap03.R")
>
Nothing was printed to the console because sink sent it to the output file
(chap03.log). Here are the contents of chap03.log:
Here are the results of the outer function
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
The first sink opened a connection and created the output file (chap03.log).
The cat and show functions printed results to output file. The second sink
close the connection.
Alternatively, as before, if we use the echo = TRUE option in the source
function, everything is either printed to the console or output file. The sink
connection determines what is printed to the output file. Here is the edited
source file (chap03.R):
i <- 1:5
x <- outer(i, i, "*")
sink("/home/tja/Documents/wip/epir/r/chap03.log")
# Here are the results of the outer function
x
sink()
Here we run source from the R command prompt:
> source("/home/tja/Documents/wip/epir/r/chap03.R", echo=T)
> i <- 1:5
> x <- outer(i, i, "*")
> sink("/home/tja/Documents/wip/epir/r/chap03.log")
>
Nothing was printed to the console after the first sink because it was printed
to output file (chap03.Rout). Here are the contents of chap03.Rout:
> # Here are the results of the outer function
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
> sink()
The sink and capture.output functions accomplish the same task: send-
ing results to an output file. The sink function works in pairs: open and
closing the connection to the output file. In contrast, capture.output func-
tion appears once and only prints the last object to the output file. Here is
the edited source file (chap03.R) using capture.output instead of sink:
i <- 1:5
x <- outer(i, i, "*")
capture.output(
{
# Here are the results of the outer function
x
}, file = "/home/tja/Documents/wip/epir/r/chap03.log")
Here we run source from the R command prompt:
> source("/home/tja/Documents/wip/epir/r/chap03.R", echo=TRUE)
> i <- 1:5
> x <- outer(i, i, "*")
> capture.output(
+ {
+ # Here are the results of the outer function
+ x
+ }, file = "/home/tja/Documents/wip/epir/r/chap03.Rout")
And, Here are the contents of chap03.log:
In R, missing values are represented by NA, but not all NAs represent missing
values — some are just “not available.” NAs can appear in any data object.
The NA can represent a true missing value, or it can result from an operation
to which a value is “not available.” Here are three vectors that contain true
missing values.
x <- c(2, 4, NA, 5); x
y <- c("M", NA, "M", "F"); y
y <- c(F, NA, F, T); z
However, elementary numerical operations on objects that contain NA return
a single NA (“not available”). In this instance, R is saying “An answer is ‘not
available’ until you tell R what to do with the NAs in the data object.” To
remove NAs for a calculation specify the na.rm (“NA remove”) option.
> sum(x) # answer not available
[1] NA
> mean(x) # answer not available
[1] NA
> sum(x, na.rm = TRUE) # better
[1] 11
> mean(x, na.rm = TRUE) # better
[1] 3.666667
Here are more examples where NA means an answer is not available:
> # Inappropriate coercion
> z1
fname age
1 Tom 56
2 <NA> 34
3 Jerry NA
> # Global replacement
> z2[z2=="Unknown" | z2==-999] <- NA
> z2
fname age
1 Tom 56
2 <NA> 34
3 Jerry NA
When importing ASCII data files using the read.table function, use the
na.strings option to specify what characters are to be converted to NA.
The default setting is na.strings="NA". Blank fields are also considered to
be missing values in logical, integer, numeric, and complex fields. For example,
suppose the data set contains 999, 888, and . to represent missing values,
then import the data like this:
mydat <- read.table("dataset.txt", na.strings = c(999, 888, "."))
If a number, say 999, represents a missing value in one field but a valid value
in another field, then import the data using the as.is=TRUE option. Then
replace the missing values in the data frame one field at a time, and convert
categorical field to factors.
There are several function for working with NA values in data frames. First,
the na.fail function tests whether a data frame contains any NA values,
returning an error message if it contains NAs.
> name <- c("Jose", "Ana", "Roberto", "Isabel", "Jen")
> gender <- c("M", "F", "M", NA, "F")
> age <- c(34, NA, 22, 18, 34)
> df <- data.frame(name, gender, age)
> df
> df$gender
[1] M F M <NA> F
Levels: F M
> xtabs(~gender, data = df)
gender
F M
2 2
> df$gender.na <- factor(df$gender, exclude = NULL)
> xtabs(~gender.na, data = df)
gender.na
F M <NA>
2 2 1
Using the original data frame (that can contain NAs), we can index sujects
with ages less than 25.
> df$age # age field
[1] 34 NA 22 18 34
> df[df$age<25, ] # index ages < 25
name gender age
NA <NA> <NA> NA
3 Roberto M 22
4 Isabel <NA> 18
The row that corresponds to the age that is missing (NA) has been converted
to NAs (“not available”) by R. To remove this uninformative row we use the
is.na function.
> df[df$age<25 & !is.na(df$age), ]
name gender age
3 Roberto M 22
4 Isabel <NA> 18
This differs from the na.omit, na.exclude, and complete.cases functions
that remove all missing values from the data frame first.
By default, NAs are not tabulated in tables produced by the table and xtabs
functions. The table function can tabulate character vectors and factors.
The xtabs function only works with fields in a data frame. To tabulate NAs
in character vectors using the table function, set the exclude function to
NULL in the table function.
F M
2 2
> table(df$gender.chr, exclude = NULL)
F M <NA>
2 2 1
However, this will not work with factors: we must change the factor levels
first.
> table(df$gender) #does not tabulate NAs
F M
2 2
> table(df$gender, exclude = NULL) #does not work
F M
2 2
> df$gender.na <- factor(df$gender, exclude = NULL) #works
> table(df$gender.na)
F M <NA>
2 2 1
Finally, whereas the exclude option works on character vectors tabulated
with table function, it does not work on character vectors or factors tabu-
lated with the xtabs function. In a data frame, we must convert the charac-
ter vector to a factor (setting the exclude option to NULL), then the xtabs
functions tabulates the NA values.
> xtabs(~gender, data=df, exclude=NULL) # does not work
gender
F M
2 2
> xtabs(~gender.chr, data=df, exclude=NULL) # still does not work
gender.chr
F M
2 2
> df$gender.na <- factor(df$gender, exclude = NULL) #works
> xtabs(~gender.na, data = df)
gender.na
F M <NA>
2 2 1
Statistical models, for example the glm function for generalized linear models,
have default NA behaviors that can be reset locally using the na.action
option in the glm function, or reset globally using the na.action option
setting in the options function.
> options("na.action") # display global setting
$na.action
[1] "na.omit"
12 If na.omit removes cases, the row numbers of the cases form the “na.action” attribute
of the result, of class “omit”. na.exclude differs from na.omit only in the class of the
“na.action” attribute of the result, which is “exclude”. See help for more details.
Fig. 3.4 Displayed are functions to convert calendar date and time data into R date-time
classes (as.Date, strptime, as.POSIXlt, as.POSIXct), and the format function converts
date-time objects into character dates, days, weeks, months, times, etc.
Let’s start with simple date calculations. The as.Date function in R converts
calendar dates (e.g., 11/2/1949) into a Date objects—a numeric vector of
class Date. The numeric information is the number of days since January
1, 1970—also called Julian dates. However, because calendar date data can
come in a variety of formats, we need to specify the format so that as.Date
does the correct conversion. Study the following analysis carefully.
> bdays <- c("11/2/1959", "1/1/1970")
> bdays
[1] "11/2/1959" "1/1/1970"
> #convert to Julian dates
> bdays.julian <- as.Date(bdays, format = "%m/%d/%Y")
> bdays.julian
[1] "1959-11-02" "1970-01-01"
Although this looks like a character vectors, it is not: it is class “Date” and
mode “numeric”.
> #display Julian dates
> as.numeric(bdays.julian)
[1] -3713 0
> #calculate age as of today’s date
> date.today <- Sys.Date()
> date.today
[1] "2005-09-25"
> age <- (date.today - bdays.julian)/365.25
> age
Time differences of 45.89733, 35.73169 days
> #the display of ’days’ is not correct
> #truncate number to get "age"
> age2 <- trunc(as.numeric(age))
> age2
[1] 45 35
> #create date frame
> bd <- data.frame(Birthday = bdays, Standard = bdays.julian,
+ Julian = as.numeric(bdays.julian), Age = age2)
> bd
Birthday Standard Julian Age
1 11/2/1959 1959-11-02 -3713 45
2 1/1/1970 1970-01-01 0 35
To summarize, as.Date converted the character vector of calendar dates
into Julian dates (days since 1970-01-01) are displayed in a standard format
(yyyy-mm-dd). The Julian dates can be used in numerical calculations. To
see the Julian dates use as.numeric or julian function. Because the calen-
dar dates to be converted can come in a diversity of formats (e.g., November
2, 1959; 11-02-59; 11-02-1959; 02Nov59), one must specify the format op-
tion in as.Date. Below are selected format options; for a complete list see
help(strptime).
"%a" Abbreviated weekday name.
"%A" Full weekday name.
"%b" Abbreviated month name.
"%B" Full month name.
"%d" Day of the month as decimal number (01-31)
"%j" Day of year as decimal number (001-366).
"%m" Month as decimal number (01-12).
"%U" Week of the year as decimal number (00-53) using the
first Sunday as day 1 of week 1.
"%w" Weekday as decimal number (0-6, Sunday is 0).
"%W" Week of the year as decimal number (00-53) using the
first Monday as day 1 of week 1.
"%y" Year without century (00-99). If you use this on input,
which century you get is system-specific. So don’t!
Often values up to 69 (or 68) are prefixed by 20 and
70-99 by 19.
"%Y" Year with century.
Here are some examples of converting dates with different formats:
> as.Date("November 2, 1959", format = "%B %d, %Y")
[1] "1959-11-02"
> as.Date("11/2/1959", format = "%m/%d/%Y")
[1] "1959-11-02"
> #caution using 2-digit year
> as.Date("11/2/59", format = "%m/%d/%y")
[1] "2059-11-02"
> as.Date("02Nov1959", format = "%d%b%Y")
[1] "1959-11-02"
> #caution using 2-digit year
> as.Date("02Nov59", format = "%d%b%y")
[1] "2059-11-02"
> #standard format does not require format option
> as.Date("1959-11-02")
[1] "1959-11-02"
Notice how Julian dates can be used like any integer:
> as.Date("2004-01-15"):as.Date("2004-01-23")
[1] 12432 12433 12434 12435 12436 12437 12438 12439 12440
> seq(as.Date("2004-01-15"), as.Date("2004-01-18"), by = 1)
[1] "2004-01-15" "2004-01-16" "2004-01-17" "2004-01-18"
So far we have worked with calendar dates; however, we also need to be able to
work with times of the day. Whereas as.Date only works with calendar dates,
the strptime function will accept data in the form of calendar dates and
times of the day (HH:MM:SS, where H = hour, M = minutes, S = seconds).
For example, let’s look at the Oswego foodborne ill outbreak that occurred
in 1940. The source of the outbreak was attributed to the church supper that
was served on April 18, 1940. The food was available for consumption from 6
pm to 11 pm. The onset of symptoms occurred on April 18th and 19th. The
meal consumption times and the illness onset times were recorded.
> odat <- read.table("http://www.medepi.net/data/oswego.txt",
+ sep = "", header = TRUE, as.is = TRUE)
> str(odat)
‘data.frame’: 75 obs. of 21 variables:
$ id : int 2 3 4 6 7 8 9 10 14 16 ...
$ age : int 52 65 59 63 70 40 15 33 10 32 ...
$ sex : chr "F" "M" "F" "F" ...
$ meal.time : chr "8:00 PM" "6:30 PM" "7:30 PM" ...
$ ill : chr "Y" "Y" "Y" "Y" ...
$ onset.date : chr "4/19" "4/19" "4/19" "4/18" ...
$ onset.time : chr "12:30 AM" "10:30 PM" ...
...
$ vanilla.ice.cream : chr "Y" "Y" "Y" "Y" ...
$ chocolate.ice.cream: chr "N" "Y" "Y" "N" ...
$ fruit.salad : chr "N" "N" "N" "N" ...
To calculate the incubation period, for ill individuals, we need to subtract
the meal consumption times (occurring on 4/18) from the illness onset times
(occurring on 4/18 and 4/19). Therefore, we need two date-time objects to
do this arithmetic. First, let’s create a date-time object for the meal times:
> # look at existing data for meals
> odat$meal.time[1:5]
[1] "8:00 PM" "6:30 PM" "6:30 PM" "7:30 PM" "7:30 PM"
> # create character vector with meal date and time
> mdt <- paste("4/18/1940", odat$meal.time)
> mdt[1:4]
[1] "4/18/1940 8:00 PM" "4/18/1940 6:30 PM"
[3] "4/18/1940 6:30 PM" "4/18/1940 7:30 PM"
> # convert into standard date and time
> meal.dt <- strptime(mdt, format = "%m/%d/%Y %I:%M %p")
> meal.dt[1:4]
[1] "1940-04-18 20:00:00" "1940-04-18 18:30:00"
[3] "1940-04-18 18:30:00" "1940-04-18 19:30:00"
> # look at existing data for illness onset
> odat$onset.date[1:4]
[1] "4/19" "4/19" "4/19" "4/18"
> odat$onset.time[1:4]
[1] "12:30 AM" "12:30 AM" "12:30 AM" "10:30 PM"
> # create vector with onset date and time
> odt <- paste(paste(odat$onset.date, "/1940", sep=""),
+ odat$onset.time)
> odt[1:4]
[1] "4/19/1940 12:30 AM" "4/19/1940 12:30 AM"
[3] "4/19/1940 12:30 AM" "4/18/1940 10:30 PM"
> # convert into standard date and time
> onset.dt <- strptime(odt, "%m/%d/%Y %I:%M %p")
> onset.dt[1:4]
[1] "1940-04-19 00:30:00" "1940-04-19 00:30:00"
[3] "1940-04-19 00:30:00" "1940-04-18 22:30:00"
> # calculate incubation period
> incub.period <- onset.dt - meal.dt
> incub.period
Time differences of 4.5, 6.0, 6.0, 3.0, 3.0, 6.5, 3.0, 4.0,
6.5, NA, NA, NA, NA, 3.0, NA, NA, NA, 3.0, NA, NA,
...
NA, NA, NA, NA, NA, NA, NA hours
> mean(incub.period, na.rm = T)
Time difference of 4.295455 hours
> median(incub.period, na.rm = T)
Error in Summary.difftime(..., na.rm = na.rm) :
sum not defined for "difftime" objects
> # try ’as.numeric’ on ’incub.period’
> median(as.numeric(incub.period), na.rm = T)
[1] 4
To summarize, we used strptime to convert the meal consumption date
and times and illness onset dates and times into date-time objects (meal.dt
and onset.dt) that can be used to calculate the incubation periods by simple
subtraction (and assigned name incub.period).
Notice that incub.period is an atomic object of class difftime:
> str(incub.period)
Class ’difftime’ atomic [1:75] 4.5 6 6 3 3 6.5 3 4 NA ...
..- attr(*, "tzone")= chr ""
..- attr(*, "units")= chr "hours"
This is why we had trouble calculating the median (which should not be the
case). We got around this problem by coercion using as.numeric:
> as.numeric(incub.period)
[1] 4.5 6.0 6.0 3.0 3.0 6.5 3.0 4.0 6.5 NA NA NA NA 3.0
...
The POSIXlt list contains the date-time data in human readable forms. The
named list contains the following vectors:
’sec’ 0-61: seconds
’min’ 0-59: minutes
’hour’ 0-23: hours
’mday’ 1-31: day of the month
’mon’ 0-11: months after the first of the year.
’year’ Years since 1900.
’wday’ 0-6 day of the week, starting on Sunday.
’yday’ 0-365: day of the year.
’isdst’ Daylight savings time flag. Positive if in force,
zero if not, negative if unknown.
Let’s examine the onset.dt object we created from the Oswego data.
> is.list(onset.dt)
[1] TRUE
> names(onset.dt)
[1] "sec" "min" "hour" "mday" "mon" "year" "wday"
[8] "yday" "isdst"
> onset.dt$min
[1] 30 30 30 30 30 0 0 0 0 30 30 15 0 0 0 45 45 0
...
> onset.dt$hour
[1] 0 0 0 22 22 2 1 23 2 10 0 22 22 1 23 21 21 1
...
> onset.dt$mday
[1] 19 19 19 18 18 19 19 18 19 19 19 18 18 19 18 18 18 19
...
> onset.dt$mon
13 For more information visit the Portable Application Standards Committee site at
http://www.pasc.org/
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
...
> onset.dt$year
[1] 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
...
> onset.dt$wday
[1] 5 5 5 4 4 5 5 4 5 5 5 4 4 5 4 4 4 5
...
> onset.dt$yday
[1] 109 109 109 108 108 109 109 108 109 109 109 108 108 109
...
The POSIXlt list contains useful date-time information; however, it is not
in a convenient form for storing in a data frame. Using as.POSIXct we can
convert it to a “continuous time” object that contains the number of seconds
since 1970-01-01 00:00:00. as.POSIXlt coerces a date-time object to POSIXlt.
> onset.dt.ct <- as.POSIXct(onset.dt)
> onset.dt.ct[1:5]
[1] "1940-04-19 00:30:00 Pacific Daylight Time"
[2] "1940-04-19 00:30:00 Pacific Daylight Time"
[3] "1940-04-19 00:30:00 Pacific Daylight Time"
[4] "1940-04-18 22:30:00 Pacific Daylight Time"
[5] "1940-04-18 22:30:00 Pacific Daylight Time"
> as.numeric(onset.dt.ct[1:5])
[1] -937326600 -937326600 -937326600 -937333800 -937333800
The chron and survival packages have customized functions for dealing
with dates. Both packages come with the default R installation. To learn
more about date and time classes read R News, Volume 4/1, June 2004.14
14 http://cran.r-project.org/doc/Rnews
"1" "0-5yrs" 26 6 1 1 2 1 3
"2" "0-5yrs" 42 1 1 1 0 2 1
"3" "0-5yrs" 39 6 2 1 0 3 4
"4" "0-5yrs" 34 4 2 1 0 4 2
"5" "6-11yrs" 35 3 1 1 1 5 32
...
Because row.names=TRUE, the number field names in the header (row 1) will
one less that the number of columns (starting with row 2). The default row
names is a character vector of integers. The following code:
write.table(infert,"infert.dat", sep=",", row.names=FALSE)
produces a commna-delimited ASCII text file without row names:
"education","age","parity","induced","case", ...
"0-5yrs",26,6,1,1,2,1,3
"0-5yrs",42,1,1,1,0,2,1
"0-5yrs",39,6,2,1,0,3,4
"0-5yrs",34,4,2,1,0,4,2
"6-11yrs",35,3,1,1,1,5,32
...
Note that the write.csv function produces a comma-delimited data file by
default.
To read the raw data back into R, we would use the scan function. For
example, if the data had been written to data.txt, then the following code
reads the data back into R:
> matrix(scan("data.txt"), ncol=6, byrow=TRUE)
Read 12 items
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 66 54 24 12 4 5
[2,] 33 27 12 6 2 3
Of course, all the labeling was lost.
Data objects can also be exported as R code in an ASCII text file using the
dump and dput functions. This has advantages for complex R objects (e.g.,
arrays, lists) that do not have simple tabular structures, and the R code
makes the raw data human legible.
The dump function exports multiple objects as R code as the next example
illustrates:
> infert.tab1 <- xtabs(~case+parity,data=infert)
> infert.tab2 <- xtabs(~education+parity+case,data=infert)
> infert.tab1 #display matrix
parity
case 1 2 3 4 5 6
0 66 54 24 12 4 5
1 33 27 12 6 2 3
> infert.tab2 #display array
, , case = 0
parity
education 1 2 3 4 5 6
0-5yrs 2 0 0 2 0 4
6-11yrs 28 28 14 8 2 0
12+ yrs 36 26 10 2 2 1
, , case = 1
parity
education 1 2 3 4 5 6
0-5yrs 1 0 0 1 0 2
6-11yrs 14 14 7 4 1 0
12+ yrs 18 13 5 1 1 1
The dput function is similar to the dump function except that the object name
is not written. By default, the dput function prints to the screen:
> dput(infert.tab1)
structure(c(66, 33, 54, 27, 24, 12, 12, 6, 4, 2, 5, 3),
.Dim = c(2L, 6L), .Dimnames = structure(list(case = c("0", "1"),
parity = c("1", "2", "3", "4", "5", "6")), .Names = c("case",
"parity")), class = c("xtabs", "table"), call = quote(xtabs(
formula = ~case + parity, data = infert)))
To export to an ASCII text file, give a new file name as the second argument,
similar to dump. To get back the R code use the dget function:
> dput(infert.tab1, "infert_tab1.R")
> dget("infert_tab1.R")
parity
case 1 2 3 4 5 6
0 66 54 24 12 4 5
1 33 27 12 6 2 3
The foreign package contains functions for exporting R data frames to non-R
ASCII text and binary files. The write.foreign function write two ASCII
text files: the first file is the data file, and the second file is the code file
for reading the data file. The code file contains either SPSS, Stata, or SAS
programming code. The write.dbf function writes a data frame to a binary
DBF file, which can be read back into R using the read.dbf function. Finally,
the write.dta function writes a data frame to a binary Stata file, which can
be read back into R using the read.dta function.
The search pattern is built up from specifying one character at a time. For
example, the pattern "x" looks for the letter x in a text string. Next, consider
a character vector of text strings. We can use the grep function to search for
a pattern in this data vector.
> vec1 <- c("x", "xa bc", "abc", "ax bc", "ab xc", "ab cx")
3.11.3 Concatenation
3.11.4 Repetition
Regular expressions (so far: single characters, character classes, and concate-
nations) can be qualified by whether a pattern can repeat (Table 3.7 on the
next page). For example, the pattern "^f.+t$" matches single, isolated words
3.11.5 Alternation
Two or more regular expressions (so far: single characters, character classes,
concatenations, and repetitions) may be joined by the infix operator |. The
resulting regular expression can match the pattern of any subexpression. For
example, the World Health Organization (WHO) Global Burden of Disease
(GBD) Study used International Classification of Diseases, 10th Revision
(ICD-10) codes (ref). The GBD Study ICD-10 codes for hepatitis B are the
following:
B16, B16.0, B16.1, B16.2, B16.3, B16.4, B16.5, B16.7, B16.8, B16.9, B17, B17.0,
B17.2, B17.8, B18, B18.0, B18.1, B18.8, B18.9
Notice that B16 and B16.0 are not the same ICD-10 code! The GBD Study
methods were used to study causes of death in San Francisco, California
(ref). Underlying causes of death were obtained from the State of California,
Center for Health Statistics. The ICD-10 code field did not have periods so
that the hepatitis B codes were the following.
B16, B160, B161, B162, B163, B164, B165, B167, B168, B169, B17, B170, B172,
B178, B18, B180, B181, B188, B189
"^B16[0-9]?$|^B17[0,2,8]?$|^B18[0,1,8,9]?$"
This regular expression matches ^B16[0-9]?$ or ^B17[0,2,8]?$ or ^B18[0,1,8,9]?$.
Similar to the first and third pattern, the second regular expression, ^B17[0,2,8]?$,
matches B17, B170, B172, or B178 as isolated text strings.
To see how this works, we can match each subexpression individually and
then as an alternation:
> hepb <- c("B16", "B160", "B161", "B162", "B163", "B164",
+ "B165", "B167", "B168", "B169", "B17", "B170",
+ "B172", "B178", "B18", "B180", "B181", "B188",
+ "B189")
> grep("^B16[0-9]?$", hepb) #match 1st subexpression
[1] 1 2 3 4 5 6 7 8 9 10
> grep("^B17[0,2,8]?$", hepb) #match 2nd subexpression
[1] 11 12 13 14
> grep("^B18[0,1,8,9]?$", hepb) #match 3rd subexpression
[1] 15 16 17 18 19
> #match any subexpression
> grep("^B16[0-9]?$|^B17[0,2,8]?$|^B18[0,1,8,9]?$", hepb)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A natural use for these pattern matches is for indexing and replacement.
We illustrate this using the 2nd subexpression.
> #indexing
> hepb[grep("^B17[0,2,8]?$", hepb)]
[1] "B17" "B170" "B172" "B178"
> #replacement
> hepb[grep("^B17[0,2,8]?$", hepb)] <- "HBV"
> hepb
[1] "B16" "B160" "B161" "B162" "B163" "B164" "B165" "B167"
[9] "B168" "B169" "HBV" "HBV" "HBV" "HBV" "B18" "B180"
[17] "B181" "B188" "B189"
Using regular expression alternations allowed us to efficiently code over
14,000 death records, with over 900 ICD-10 cause of death codes, into 117
mutually exclusive cause of death categories for our San Francisco study.
Suppose sfdat was the data frame with San Francisco deaths for 2003–2004.
Then the following code would tabulate the deaths caused by hepatatis B:
> sfdat$hepb <- rep("No", nrow(sfdat)) #new field
> get.hepb <- grep("^B16[0-9]?$|^B17[0,2,8]?$|^B18[0,1,8,9]?$",
+ sfdat$icd10)
> sfdat$hepb[get.hepb] <- "Yes"
> table(sfdat$hepb)
No Yes
14125 23
3.11.7 Metacharacters
16 Although the - sign is not a metacharacter, it does have special meaning inside a
character class because it is used to specify a range of characters; e.g., [A-Za-z0-9].
integer(0)
> vec9[grep("[/^]", vec9)]
[1] "8^2" "y^x"
The first character in the list (/) was selected because there was no match in
the data vector.
For most epidemiologic applications, the grep function will meet our regu-
lar expression needs. Table 3.9 summarizes other functions that use regular
expressions. Whereas the grep function enables indexing and replacing ele-
ments of a character vector, the sub and gsub functions searches and replaces
single or multiple pattern matches within text string elements of a character
vector. Review the following example:
> vec10 <- c("California", "MiSSISSIppi")
> grep("SSI", vec10) #can be used for replacement
[1] 2
> sub("SSI", replacement="ssi", vec10) #replace 1st occurrence
[1] "California" "MissiSSIppi"
> gsub("SSI", replacement="ssi", vec10) #replace all occurrences
[1] "California" "Mississippi"
The regexpr function provides detailed information on the first pattern
match within text string elements of a character vector. It returns two inte-
ger vectors. In the first vector, -1 indicates no match, and nonzero positive
integers indicate the character position where the first match begins within
a text string. In the second vector, the nonzero positive integers indicate the
[[2]]
[1] 3 6
attr(,"match.length")
[1] 3 3
Problems
3.1. Using RStudio and the data from Table 3.1 on page 107 Create the
following data frame:
> dat
Status Treatment Agegrp Freq
1 Dead Tolbutamide <55 8
2 Survived Tolbutamide <55 98
3 Dead Placebo <55 5
4 Survived Placebo <55 115
5 Dead Tolbutamide 55+ 22
6 Survived Tolbutamide 55+ 76
7 Dead Placebo 55+ 16
8 Survived Placebo 55+ 69
3.2. Select 3 to 5 classmates and collect data on first name, last name, affil-
iation, two email addresses, and today’s date. Using a text editor, create a
data frame with this data.
3.3. Review the United States data on AIDS cases by year available at http:
//www.medepi.net/data/aids.txt. Read this data into a data frame. Graph
a calendar time series of AIDS cases.
# Hint
plot(x, y, type = "l", xlab = "x axis label", lwd = 2,
ylab = "y axis label", main = "main title")
3.4. Review the United States data on measles cases by year available at
http://www.medepi.net/data/measles.txt. Read this data into a data
frame. Graph a calendar time series of measle cases using an arithmetic and
semi-logarithmic scale.
# Hint
plot(x, y, type = "l", lwd = 2, xlab = "x axis label",
ylab="y axis label", main = "main title")
plot(x, y, type = "l", lwd = 2, xlab = "x axis label", log = "y",
ylab="y axis label", main = "main title")
3.5. Review the United States data on hepatitis B cases by year available at
http://www.medepi.net/data/hepb.txt. Read this data into a data frame.
Using the R code below, plot a times series of AIDS and hepatitis B cases.
matplot(hepb$year, cbind(hepb$cases,aids$cases),
type = "l", lwd = 2, xlab = "Year", ylab = "Cases",
main = "Reported cases of Hepatitis B and AIDS,
United States, 1980-2003")
legend(1980, 100000, legend = c("Hepatitis B", "AIDS"),
lwd = 2, lty = 1:2, col = 1:2)
3.6. Review data from the Evans cohort study in which 609 white males
were followed for 7 years, with coronary heart disease as the outcome of
interest (http://www.medepi.net/data/evans.txt). The data dictionary
is provided in Table 3.10.
a Recode the binary variables (0, 1) into factors with 2 levels.
b Discretized age into a factor with more than 2 levels.
c Create a new hyptertension categorical variable based on the current clas-
sification scheme17 :
Normal: SBP< 120 and DBP< 80;
Prehypertension: SBP=[120, 140) or DBP=[80, 90);
Hypertension-Stage 1: SBP=[140, 160) or DBP=[90, 100); and
Hypertension-Stage 2: SBP≥ 160 or DBP≥ 100.
d Using R, construct a contigency table comparing the old and new hyper-
tension variables.
3.7. Review the California 2004 surveillance data on human West Nile
virus cases available at http://www.medepi.net/data/wnv/wnv2004raw.
txt. Read in the data, taking into account missing values. Convert the cal-
endar dates into the international standard format. Using the write.table
function export the data as an ASCII text file.
17 http://www.nhlbi.nih.gov/guidelines/hypertension/phycard.pdf
3.8. On April 19, 1940, the local health officer in the village of Lycoming,
Oswego County, New York, reported the occurrence of an outbreak of acute
gastrointestinal illness to the District Health Officer in Syracuse. Dr. A. M.
Rubin, epidemiologist-in-training, was assigned to conduct an investigation.
(See Appendix A.2 on page 244 for data dictionary.)
When Dr. Rubin arrived in the field, he learned from the health officer
that all persons known to be ill had attended a church supper held on the
previous evening, April 18. Family members who did not attend the church
supper did not become ill. Accordingly, Dr. Rubin focused the investigation
on the supper. He completed Interviews with 75 of the 80 persons known to
have attended, collecting information about the occurrence and time of onset
of symptoms, and foods consumed. Of the 75 persons interviewed, 46 persons
reported gastrointestinal illness.
The onset of illness in all cases was acute, characterized chiefly by nau-
sea, vomiting, diarrhea, and abdominal pain. None of the ill persons reported
having an elevated temperature; all recovered within 24 to 30 hours. Approx-
imately 20% of the ill persons visited physicians. No fecal specimens were
obtained for bacteriologic examination. The investigators suspected that this
was a vehicle-borne outbreak, with food as the vehicle. Dr. Rubin put his
data into a line listing.18
The supper was held in the basement of the village church. Foods were
contributed by numerous members of the congregation. The supper began at
6:00 p.m. and continued until 11:00 p.m. Food was spread out on a table and
consumed over a period of several hours. Data regarding onset of illness and
food eaten or water drunk by each of the 75 persons interviewed are provided
in the line listing. The approximate time of eating supper was collected for
only about half the persons who had gastrointestinal illness.
a. Using RStudio plot the cases by time of onset of illness (include appropriate
labels and title). What does this graph tell you? (Hint: Process the text
data and then use the hist function.)
b. Are there any cases for which the times of onset are inconsistent with the
general experience? How might they be explained?
c. How could the data be sorted by illness status and illness onset times?
d. Where possible, calculate incubation periods and illustrate their distribu-
tion with an appropriate graph. Use the truehist function in the MASS
package. Determine the mean, median, and range of the incubation period.
181
182 4 Analyzing simple epidemiologic data
the varianace of the estimate. The variance depends on the natural variability
of the measure (e.g., height in human populations), measurement random
error (e.g., instrument, technician), and sample size. Increasing the sample
size improves estimation precision, narrowing the confidence interval.
Third, we might be interested in knowing whether this point estimate is
consistent with a reference value. Under the assumption that the estimate
comes from a distribution with the mean equal to the reference value (null
hypothesis), we can calculate the probability (two-sided p-value) of getting
the test statistic value or more extreme values. If the null hypothesis is true
and we incorrectly reject the null hypothesis because the p value is lower than
α (arbitrarily chosen—usually 0.05), we call this a Type I error.
Fourth, if it was “consistent” (p > α), was there a sufficient sample size to
detect a meaningful difference if one existed? We don’t want to mistakenly
infer no difference when there was insufficent data to draw this conclusion.
This requires defining what we mean by “meaningful difference” (we will
call it “effect size”1 ), and then calculating the probability of detecting the
effect size (or larger), if it exists. This probability is called statistical power.
An effect size implies an alternative hypothesis. The probability of failing to
detect the effect size under the alternative hypothesis is β or Type II error.
Power is equal to 1 − β
Fifth, when we decide to sample a target population to estimate epidemio-
logic measures, our required sample size calculations will depend on whether
the statistic is a one-sample measure (measures of occurrence) or a two-
sample measure (measures of association). For a one-sample measure, we
may be interested in the following:
• Sufficient sample size to achieve a desired confidence interval width (used
for descriptive studies);
• Sufficient sample size for hypothesis testing (meaningful difference from
some reference value). This requires setting type I (α) and type II errors
(β).
4.1.1 Estimation
A1 /B1 A1 B0
DOR = =
A0 /B0 A0 B1
A1 /A0 A1 B0
EOR = =
B1 /B0 A0 B1
was assessed comparing infants with low vs. high antibody titers. The UMLE
disease odds ratio can be calculated:
12/2 (12)(9)
ORUMLE = = = 7.71
7/9 (7)(2)
The odds of developing diarrhea was 7.7 times higher in infants with low
antibodies titers compared to infants with high antibody titers.
The CMLE odds ratio calculation is based on using the hypergeometric
distribution for tables with small numbers such as Table 4.4. In this case, we
treat the margins as fixed and model the distribution of A1 . The hypergeo-
metric equation (Equation 4.1) is the basis of Fisher’s exact test.
n1
n0 a1
a1 m1 −a1 OR
Pr(A1 = a1 | m1 , m0 , n1 , n0 ) = P n1 n0 k
(4.1)
k k m1 −k OR
Table 4.4 Comparison of diarrhea in 30 breast-fed infants colonized with V. Cholerae 01,
by antibody titers in mother’s breast milk
Antibody level
Low High Total
Diarrhea 12 7 19
No diarrhea 2 9 11
Total 14 16 30
Therefore, with Equation 4.2 we can use the uniroot and fisher.test func-
tions to solve for the MUE odds ratio. This involves two steps:
• Create function for Equation 4.2; and
• Pass equation function and data to uniroot to solve unknown OR.
The equation function must have one unknown, in this case it will be the
odds ratio output from fisher.test.
> dat <- matrix(c(12, 2, 7, 9), 2, 2) #create data object
> or.mue.eqn <- function(x, or) {
+ fisher.test(x, or = or, alt = "less")$p.value -
+ fisher.test(x, or = or, alt = "greater")$p.value
+ }
> uniroot(function(or) {or.mue.eqn(x = dat, or)},
+ interval = c(0, 100))$root
[1] 6.88068
Therefore, ORMUE = 6.88
Here is the small-sample adjusted odds ratio formula from Jewell: [6]:
A1 B0
ORSS = (4.3)
(A0 + 1)(B1 − 1)
Table 4.5 Summary of odds ratio estimation using data from Table 4.4
Method Odds ratio Comment
Unconditional MLE 7.7 Large sample
Conditional MLE 7.2 Small sample
Median Unbiased Estimate 6.9 Small sample
Small Sample Adjusted 4.5 Small sample; zero in denominator
H0 : OR = 1
H1 : OR 6= 1
For our hypothesis test we will calculate a two-sided p-value using the
fisher.test function.
> dat <- matrix(c(12, 2, 7, 9), 2, 2) #create data object
> fisher.test(dat, or = 1, alt = "two.sided", conf.int = F)
data: dat
p-value = 0.02589
alternative hypothesis: true odds ratio is not equal to 1
sample estimates:
odds ratio
7.166131
Therefore, under the null hypothesis, the probability of seeing the CMLE
odds ratios of 7.16 or more extreme is 0.026. Similarly, we can calculate and
plot two-sided p-values for a range of test hypotheses: this is called the p-
value function, and it enables us to view which test hypotheses are most
consistent with the observed data.
To create and plot a p-value function for our data (Table 4.4), we use the
fisher.test function to calculate the two-side exact p-values for a range of
test hypotheses. The following R code generates the p-value function plot in
Figure 4.1.
dat <- matrix(c(12, 2, 7, 9), 2, 2) #create data object
cases <- seq(0, 65, .1) #range of hypotheses
nn <- length(Cases) #length of vector
p.vals <- rep(NA, nn) #empty vector to fill
for(i in 1:nn){
p.vals[i] <- fisher.test(dat, or = Cases[i])$p.value
}
plot(cases, p.vals, cex = 0.5, log = "x",
xlab = "Cases (log scale)", ylab = "P value")
abline(v = 6.88068, h = 0.05, lty = 2)
abline(v = 1) #null hypothesis (OR = 1)
In summary, the p-value function gives the two-sided p-values for a range
of test hypotheses, including the null hypothesis. The point at which the
curve peaks is the point estimate. The concentration of the curve around
the point estimate represents the precision of the estimate. Where the p =
0.05 horizontal line intersects the curve delineates the lower and upper 95%
confidence limits. In fact, a confidence interval represents only one horizontal
slice through the p-value function. The (1 − α) confidence interval are the
test hypothesis values for which the two-sided p > α.
1.0
0.8
0.6
P value
0.4
0.2
0.0
ΘL , ΘU = Θ ± Z × SE(Θ), (4.4)
where ±Z are the quantile values of the standard normal distribution density
curve such that the area between −Z and +Z is equal to the confidence
level. SE(Θ) is the standard error of the measure Θ. For a 95% confidence
interval, P (−Z ≤ z ≤ +Z) = 0.95, where Z = 1.96. This means that P (Z ≤
−1.96) = P (z ≥ 1.96) = 0.025. In truth, 1.96 is only an approximation.
To get the precise Z quantile values that correspond to a specific confidence
level, we use the qnorm function. For a given probability p (area under the
normal distribution density curve), where p = P (Z ≤ q), the qnorm function
calculates the corresponding quantile value q. The following code does the
trick:
> conf.level <- 0.90
> qnorm((1 + conf.level)/2)
[1] 1.644854
> conf.level <- 0.95
> qnorm((1 + conf.level)/2)
[1] 1.959964
> conf.level <- 0.99
> qnorm((1 + conf.level)/2)
[1] 2.575829
Now, the SE(Θ) is often calculated from some formula that has been de-
rived. For example, for the disease odds ratio from a cohort study (Table 4.3),
here is the standard error formula:
r
1 1 1 1
SE[log(OR)] = + + +
A1 B1 A0 B0
Confidence interval for crude 2 × 2 table: odds ratio and rate ratio
R provides two functions for calculating exact confidence intervals for crude
2 × 2 tables:
• fisher.test for CMLE odds ratios; and
• poisson.test for rate ratios.
Tail method
Resampling
x=8 n = 160
R̂ = 8/160 = 0.05 R0 = 0.03
p = P (R ≥ R̂ | R0 ) = P (R ≥ 0.05 | R0 = 0.03)
How low should the p value be for us to conclude that R̂ is not compatible
with R0 ? There is no hard fast rule. However, sometimes a decision rule is
used to reject the null hypothesis if p ≤ α, where α is arbitrarily set to be
small (also called the significance level). Therefore, under the null hypothesis,
we are willing to incorrectly reject the null hypothesis α(×100)% of the time.
This is called the Type I error and is often set at α = 0.05.
This has been an example of a one-sample measure (R̂). These concepts
equally apply to two-sample measures of association such as rate ratios, risk
ratios, and odds ratios. The reference value for these measures of association
will be the null value 1 for no association.
Table 4.6 One-sided p value example: Is a hospital severe complication risk of 5% com-
patible with a reference value (“goal”) of 3% or less? Answer: It depends.
Reference Sample Complications Proportion (R̂) one-sided p value
(R0 ) (n) (x) (R̂ = x/n) P (R ≥ R̂)
0.03 20 1 0.05 0.4562
0.03 40 2 0.05 0.3385
0.03 80 4 0.05 0.2193
0.03 160 8 0.05 0.1101
0.03 640 32 0.05 0.0041
Table 4.7 Hypothetical cohort data corresponding the p value function in Figure 4.2
Exposure Cases Person-years
Yes 9 186
No 2 128
1 0
0.9 10
0.8 20
40
0.5 50
0.4 60
0.3 70
0.2 80
0.1 90
95% Confidence Interval
0.05 95
0 100
Rate Ratio
Fig. 4.2 The p-value function for the hypothetical cohort study in Table 4.7. The rate
ratio median unbiased estimate (rr
ˆ = 2.9) is tested against the null hypothesis (rr = 1) and
alternative hypotheses (rr 6= 1). The alternative hypotheses that correspond to p = 0.05
are the lower and upper 95% Confidence Limits, respectively.
2 Using methods described in Rothman [7], we calculated median unbiased estimates and
two-sided mid-P exact p-values.
Power
( 1 − β)
β α 2
− Z1−α 2 µ0 Z1−α 2 µ1
Fig. 4.3 Sample distribution curves to understand the relations of components that are
used to calculate sample size
Sample size to be
covered in differ-
ent section
4.2 Evaluating a single measure of occurrence
In this section, we address the count of new cases (incidence). For convenient,
we assume the occurrence of new cases follows a Poisson distribution. The
Poisson distribution is a discrete probability distribution with the following
density function:
x−λ λx
P (X = x) = , (4.5)
x!
where X is the random variable, x is the observed count, and λ is the expected
count. And, here is the distribution distribution function:
x
X k −λ λk
P (X ≤ x) = . (4.6)
k!
k=0
In R, we use the dpois and ppois functions for the density and distri-
bution functions, respectively. For example, in the United States, the rate of
meningococcal disease is about 1 case per 100,000 population per year [8]. In
San Francisco, with a population of about 800,000, we expected about 8 cases
per year (λ). In a given year, what is the probability of observing exactly 6
cases? Using the dpois function,
4.2.1.1 Estimation
Normal approximation
Here is the standard error of an incidence rate, and the formula to construct
a confidence interval using a normal approximation:
p
SE(r) = x/P T 2
rL ; rU = r ± Z × SE(r)
Exact approximation
For low counts, an exact confidence interval is more accurate. A normal dis-
tribution is symmetric; however, a low count will have an asymmetric distri-
bution. We will use the Poisson distribution to calculate an exact confidence,
but first we start with an exact approximation to the Poisson distribution
using Byar’s method [3].
Exact methods
4.2.1.3 Comparison
4.2.2.1 Estimation
x
R= (4.9)
N
Normal approximation
r
x(N − x)
SE(R) = (4.10)
N3
R does have a function for testing a one sample proportions. However,
it was more informative to learn how to create your own function using
known statistical formulas. Why? Because R or another software package
may not have the specific function you need, and by learning how to create
you own functions you will be able to solve many more problems effectively
and efficiently. Here is the same analysis using R’s prop.test function (which
additionally provides a confidence interval).
Exact approximation
Exact methods
4.2.2.3 Comparison
4.3.1.1 Estimation
r1 a/P T1
rr = = (4.12)
r0 b/P T0
Normal approximation
r
1 1
SE[log(rr)] = + (4.13)
a b
Exact approximation
None known to
TJA
Exact methods
Show example
using poisson.test
function
4.3.1.3 Comparison to reference value (hypothesis testing &
p-values) Compare to pois-
son.exact from
exactci package
Show example
using poisson.test
4.3.1.4 Power and sample size function
4.3.2 Comparing two risk estimates: Risk ratio and Compare to pois-
disease odds ratio son.exact from
exactci package
##Table set up
## Disease
##Exposure Yes No Total
## Yes x1 . n1
## No x0 . n0
4.3.2.1 Estimation
R1 a/N1
RR = = (4.14)
R0 b/N0
R1 /(1 − R1 ) a(N0 − b)
DOR = = (4.15)
R0 /(1 − R0 ) b(N1 − a)
Normal approximation
$x
CHD Event
Behavior type Yes No
Type A 178 1411
Type B 79 1486
$risks
p1 p0
0.112020 0.050479
$risk.ratio
[1] 2.2191
$conf.int
[1] 1.7186 2.8654
$conf.level
[1] 0.95
Exact approximation
None known to
TJA
Exact methods
##collect
list(x = x,
risks = c(p1 = p1, p0 = p0),
risk.ratio = RR,
rrboot.mean = rrbar,
conf.int = unname(ci),
conf.level = conf.level,
replicates = replicates)
}
$risks
p1 p0
0.112020 0.050479
$risk.ratio
[1] 2.2191
$rrboot.mean
[1] 2.2443
$conf.int
[1] 1.7235 2.9085
$conf.level
[1] 0.95
$replicates
[1] 5000
See fisher.exact in
exact2x2 package
4.3.3.1 Estimation
Normal approximation
r
1 1 1 1
SE[log(EOR)] = + + + (4.18)
a b c d
Exact approximation
Exact methods
Compare
fisher.test vs.
fisher.exact (ex-
act2x2)
4.3.3.3 Comparison to reference value (hypothesis testing &
p-values) consider mid-
p method from
4.3.3.4 Power and sample size Rothman
Problems
4.1. Using the 2 × 2 table format displayed in Table 4.11, create and test a
function that takes four integers and calculate the risk ratio (RR) and the
odds ratio (OR).
a/(a + b)
RR =
c/(c + d)
ad
OR =
bc
4.2. Using the 2 × 2 table format displayed in Table 4.11, create a function
that takes a 2 × 2 matrix and calculates the risk ratio (RR) and odds ratio
(OR).
4.3. Create test a function for a measure of association that includes a 90%
confidence interval and p-value. Use input data from an actual data frame.
Table 4.11 Deaths among subjects who received tolbutamide and placebo in the Unver-
sity Group Diabetes Program (1970), stratifying by age
Disease No disease Total
Exposed a b a+b
Nonexposed c d c+d
Total a+c b+d a+b+c+d
5.1 Graphs
207
208 5 Graphical display of epidemiologic data
1000
900
800
Reported Cases (x 1,000)
Vaccine
700 Licensed
600
500
400
300
200
100
0
Year
of reported cases each year are on the y-axis. This type of line graph is also
called a time series plot. The number of reported cases and the y-axis scale
have not been transformed and are presented in their native scales; hence,
the term “arithmetic scale.”
The plot function was used to create Figure 5.1. Think of an arithmetic
line graph as a plotted series of {xi , yi } coordinates with a line drawn through
each point. In effect, we provide as arguments to the plot function two
numeric vectors: a vector of xi coordinates and a vector of yi coordinates.
For example, read in the measles data and draw a basic plot (Figure 5.2 on
the facing page).
md <- read.table("http://www.medepi.net/data/measles.txt",
header = TRUE, sep = "")
plot(md$year, md$cases)
What can we learn from studying this Figure 5.2 on the next page: (1)
the x and y axis labels come from the vector names provided as arguments
to the plot function, (2) the x and y axis scales are set by R defaults, (3) the
x-y coordinates are plotted as points using R’s default symbol, and (4) the
md$cases
4 e+05
0 e+00
md$year
Fig. 5.2 Example of basic default plot
shape of the figure is approximately square. Now try the following on your
own:
## change plot type
plot(md$year, md$cases, type = "l")
plot(md$year, md$cases, type = "b")
axis(side = 2)
box()
In the plot function, the option type sets the type of line plot that should
be drawn (“p” = points, “l” = lines, “b” = both, etc.). The plot function has
many options which you can learn about by typing ?plot and ?plot.default
at the R prompt. Setting axes=FALSE removes all axes. The axis function
redraws an axis. The side option specifies which side to redraw (1 = bottom-
side, 2 = left-side, 3 = top-side, 4 = right-side). Setting axis(n ) alone re-
draws the original default axis for side n.
The following R code will reproduce Figure 5.1 on page 208 but with a
square plot area. Changing the plot area is left as an exercise.
plot(md$year, md$cases/1000, type = "l",
xlab = "Year", ylab = "Reported Cases (x 1000)",
main = "Measles cases reported in the United States, 1950-2001",
axes = FALSE, xlim = c(1950, 2001), ylim = c(0,1000),
xaxs = "i", yaxs = "i")
axis(1, at = md$year, labels = FALSE, tick = TRUE)
axis(1, at = seq(1950, 2000, 5), labels = seq(1950, 2000, 5),
tick = TRUE, tcl = -1)
axis(2, at = seq(0, 1000, 100), labels = seq(0, 1000, 100),
las = 2, tick = TRUE)
arrows(1963, 660, 1963, 400, length = 0.15, angle = 10)
text(1963, 700, "Vaccine\nLicensed")
box(bty="l")
The plot options xlim and ylim set the display range for the x-axis and
y-axis, respectively. By default, extra length is added to the range of the
x-axis and y-axis. The options xaxs="i" and yaxs="i" remove this extra
length (compare to previous plot to see difference). The first axis function
is used to set the x-axis tick marks at 1-year intervals. The second axis
function is used to add x-axis labels and to set longer x-axis tick marks at
5-year intervals. The option tcl changes the length. The third axis function
is used to add y-axis labels and tick marks. The las=2 option sets the labels
at 90 degrees to the y-axis. The arrows function adds an arrow starting at
{x0 = 1963, y0 = 660} and ending at {x1 = 1963, y1 = 400}. The text
function added “Vaccine Licensed” centered at {x = 1963, y = 700}. The
\n in Vaccine\nLicensed inserted a line break between the word “Vaccine”
and “Licensed”. Finally, the box option bty="l" sets the frame in the shape
of the letter “l”.
To get help for these function options, type ?box, ?axis, ?arrows, or
?text. If you cannot find a specific option, type ?par. The par function
controls many of the graphic parameter. In fact, it’s worth printing out the
par help pages and keeping it handy.
What are the limitations of Figure 5.1 on page 208? From years 1959 to
2001, the number of measle cases ranged from a low of 86 cases to a high
of 763,000 cases. With a large number of cases, the the y-axis must contain
a larger number range. Before 1970, we can appreciate the trend in measles
cases; however, after 1970, we cannot visually appreciate the trend in measles
cases. One graphical solution is to take the natural logarithm of the number of
cases (y-axix values). This compresses the larger numbers and decompresses
the smaller numbers, making the measles trend after 1970 easier to visualize.
For example, try this:
plot(md$year, log(md$cases), type = "l")
A problem with this approach is that the y-axis values are difficult to in-
terpret. Although you can re-label the y-axis, it’s easier to just use the log
option. This code creates a semi-logarithmic plot.
plot(md$year, md$cases, type = "l", log = "y")
The following R code will reproduce the semi-logarithmic graph in Fig-
ure 5.3 on the following page but with a square plot area. Changing the plot
area is left as an exercise. This graph is similar to Figure 5.1 on page 208
except that we spend extra time building the y-axis labels.
plot(md$year, md$cases, type = "l", log = "y",
xlim = c(1950, 2001), ylim = c(1, 1100000),
xlab = "Year", ylab = "Reported cases",
main = "Measles cases reported in the United States, 1950-2001",
xaxs = "i", yaxs = "i", axes = FALSE)
axis(1, at = md$year, labels = FALSE, tick = TRUE)
axis(1, at = seq(1950, 2000, 5), labels = seq(1950, 2000, 5),
tick = TRUE, tcl = -1, cex.axis = cex)
axis(2, at = c(seq(1, 10, 1), seq(10, 100, 10),
seq(100, 1000, 100), seq(1000, 10000, 1000),
seq(10000, 100000, 10000), seq(100000, 1000000, 100000)),
labels = FALSE, tick = TRUE, cex.axis = cex)
axis(2, at = c(1, 5, 10, 50, 100, 500, 1000, 5000, 10000,
50000, 100000, 500000, 1000000), labels = c("1", "5",
"10", "50", "100", "500", "1,000", "5,000", "10,000",
"50,000", "100,000", "500,000", "1,000,000"),
las = 2, tick = TRUE, tcl = -0.7, cex.axis = cex)
arrows(1963, 10000, 1963, 300000, length = 0.15, angle = 10)
text(1963, 5000, "Vaccine Licensed (1963)", cex = cex)
box(bty = "l")
The trick to constructing these graphs is to build the code incrementally
and testing it often to debug errors along the way.
1,000,000
500,000
100,000
50,000
Reported cases
10,000
5,000 Vaccine Licensed (1963)
1,000
500
100
50
10
5
Year
5.1.3 Histograms
5.2 Charts
5.2.7 Maps
5.4 Miscellaneous
locator identify
Table 6.1 Stratified tables for a cohort study with risk data
Exposure
Outcome Exposed Nonexposed Total
Cases ai bi M1i
Noncases ci di M0i
Total N1i N0i Ti
215
216 6 Control confounding with stratification methods
##test function
##read UGDP data
ud <- read.table("http://www.medepi.net/data/ugdp.txt",
header=T, sep=",")
str(ud)
ud[1:6,]
tab <- table(ud$Status,ud$Treatment,ud$Agegrp)[,2:1,2:1]
tab
rr.mh(tab)
P N1i N0i 2 h ai d i bi c i
i
i Ti 2 (N −1)
N1i 1i
+ 2 (N −1)
N0i 0i
Var(RDM H ) = P 2 (6.4)
N1i N0i
i Ti
Table 6.2 Stratified tables for a cohort study with incidence rate data
Exposure
Outcome Exposed Nonexposed Total
Cases ai bi Mi
Person-time at risk P T1i P T0i Ti
P Mi P T1i P T0i 2
i Ti
Var[log(IRM H )] = P P (6.6)
ai P T0i bi P T1i
i Ti i Ti
P P T1i P T0i 2 ai bi
i Ti 2
P T1i
+ 2
P T0i
Var(IDM H ) = P 2 (6.8)
P T1i P T0i
i Ti
P P P
Gi Pi
i i (Gi Q i + Hi P i ) i Hi Q i
Var[log(ORM H )] = 2 + P P + 2 (6.10)
2 ( i Gi i Hi )
P P
2 ( i Gi ) 2 ( i Hi )
where
a i di bi c i (ai + di ) (bi + ci )
Gi = , Hi = , Pi = , Gi =
Ti Ti Ti Ti
In this chapter we cover how to use regression models to control for confound-
ing for a continuous and binary outcome. We cover the following models:
• Continuous outcome
– Linear regression
• Binary outcome
– Unconditional logistic regression
– Conditional logistic regression
– Poisson regression
– Cox proportional hazards regression model
Our goal is not to teach statistical methods, but rather to briefly review se-
lected regression models and to implement them in R for conducting epidemi-
ologic analyses. Several excellent biostatistic books cover regression methods
for epidemiologists [9, 10, 11, 6].
219
220 7 Control confounding with regression methods
10 20 30 40 50 60 70 80 90
y
x2
50
40
30
20
10
0
0 10 20 30 40 50
x1
ye = a0 + a1 (1)
−[yu = a0 + a1 (0)]
Therefore, for this linear regression equation, the coefficient a1 is equal to the
change in y (or ye − yu ) for a unit change in x (or 1 − 0). A regression model
can incorporate multiple variables where each coefficient is now adjusted for
the presence of other variables. For example, in Figure 7.1, the coefficient
a1 represents the change y for a unit change in x1 , adjusted for x2 ; and the
Fig. 7.2 In epidemiology, we often use two types of regression models. The dependent
variable, y, can undergo no transformation before fitting a regression model (Approach
A), or can undergo a transformation (e.g., natural logarithm) before fitting a regression
model. For example, the measure y (e.g., counts or rates), the assumed distribution of y
(e.g., Poisson), and the transformation of y (e.g., natural logirithm) can come together as
a Poisson regression model.
log(ye ) = a0 + a1 (1)
Again, to interpret the coefficient a1 take the difference of the equation when
x = 1 and the equation when x = 0.
−[log(yu ) = a0 + a1 (0)]
Therefore, now the coefficient a1 is equal to the change in the log(y) (or
log(ye ) − log(yu )) for a unit change in x (or 1 − 0). Equivalently, a1 is equal
to the log(y1 /y0 ), or ea1 = y1 /y0 .
In multivariable regression, the general formula becomes
family option, it is not necessary to specify the link; i.e., we accept the
default link function. Here is the multiple linear regression analysis:
> ## Latina mothers birthweight of newborns
> bdat <- read.table("http://www.medepi.net/data/birthwt9.txt",
+ header =TRUE, sep = "")
> mod1 = glm(log(bwt) ~ age + parity + gest + sex + cigs + ht + wt,
+ family = gaussian, data = bdat)
> mod1
Coefficients:
(Intercept) age parity gest sex
6.1270855 0.0023539 0.0089355 0.0067584 -0.0326892
cigs ht wt
-0.0030201 0.0000147 0.0020536
Call:
glm(formula = log(bwt) ~ age + parity + gest + sex + cigs + ht +
Table 7.1 Data dictionary for Latina mothers and their newborn infants
Variable Description Possible values
age Maternal age In years (self-reported)
parity Parity Count of previous live births
gest Gestation Reported in days
sex Gender Male = 1, Female = 2
bwt Birth weight Grams
cigs Smoking Number of cigarettes per day
(self-reported)
ht Maternal height Measured in centimeters
wt Maternal weight Pre-pregnancy weight
(self-reported)
r1 Rate of weight gain (1st trimester) Kilograms per day (estimated)
r2 Rate of weight gain (2nd trimester) Kilograms per day (estimated)
r2 Rate of weight gain (3rd trimester) Kilograms per day (estimated)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.39748 -0.07429 0.00522 0.08219 0.39097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1270855 0.2157182 28.403 < 2e-16 ***
age 0.0023539 0.0011364 2.071 0.03893 *
parity 0.0089355 0.0062630 1.427 0.15441
gest 0.0067584 0.0005187 13.030 < 2e-16 ***
sex -0.0326892 0.0116243 -2.812 0.00515 **
cigs -0.0030201 0.0017457 -1.730 0.08437 .
ht 0.0000147 0.0009819 0.015 0.98806
wt 0.0020536 0.0006236 3.293 0.00107 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
log(R/(1−R))
4
4
R/(1−R)
0
−4
−4
0.0 0.4 0.8 0.0 0.4 0.8
R R
Fig. 7.3 The logit transformation is a double transformation. First, the odds tranfor-
mation (R/(1 − R)) unbounds the probabilities near 1; second, the logit transformation
(log(odds)) additionally unbounds the probabilities near 0.
with individual obervation times and censoring we use survival analysis (see
Cox proportional hazards regression in Section 7.5 on page 239).
We recall that the logit, or log-odds, is used so that we can map proba-
bilities values, that are between 0 and 1, to values between negative infinity
and positive infinity (see Figure 7.3). This is useful for very low or very high
probabilities that will have asymmetric distributions.
Consider a binary outcome measure Ri with a binomial distribution, where
the log-odds (or logit) is conditioned on a dichotomous predictor variable xi ,
and is represented with this formula:
Ri
log( ) = a0 + a1 x i (7.2)
1 − Ri
logit = a0 + a1 xi (7.3)
where the log-odds is fitted modeling an additive relationship (or the odds
is fitted modeling a multiplicative relationship). The exp(ai ) is an adjusted
odds ratio for the ith covariable, adjusted for the remaining covariables.
Coefficients:
(Intercept) smokingYes obesityYes snoringYes
-2.37766 -0.06777 0.69531 0.87194
> summary(mod1)
Call:
glm(formula = htn ~ smoking + obesity + snoring, family = binomial,
data = adat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8578 -0.6330 -0.6138 -0.4212 2.2488
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766 0.38010 -6.255 3.97e-10 ***
smokingYes -0.06777 0.27812 -0.244 0.8075
$odds.ratio
coef exp(coef) lcl ucl
(Intercept) -2.37766133 0.09276728 0.04404061 0.1954053
smokingYes -0.06777489 0.93447081 0.54179029 1.6117596
obesityYes 0.69530959 2.00432951 1.14632691 3.5045298
snoringYes 0.87193919 2.39154401 1.09731504 5.2122522
Group-level data can come in two forms. Here is the first form:
> adat2
smoking obesity snoring htn Freq
1 No No No No 55
2 Yes No No No 15
3 No Yes No No 7
4 Yes Yes No No 2
5 No No Yes No 152
6 Yes No Yes No 72
7 No Yes Yes No 36
8 Yes Yes Yes No 15
9 No No No Yes 5
10 Yes No No Yes 2
11 No Yes No Yes 1
12 Yes Yes No Yes 0
13 No No Yes Yes 35
14 Yes No Yes Yes 13
15 No Yes Yes Yes 15
16 Yes Yes Yes Yes 8
This data frame provides every covariable pattern and frequency of the pat-
tern.2 To model this data we use the glm function weight option:
> mod2 = glm(htn ~ smoking + obesity + snoring, weight = Freq,
+ family = binomial, data = adat2)
> or.glm(mod2)
$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766146 0.3801835 -6.2539830 4.001145e-10
smokingYes -0.06777489 0.2781237 -0.2436861 8.074739e-01
obesityYes 0.69530960 0.2850849 2.4389559 1.472976e-02
snoringYes 0.87193932 0.3975727 2.1931569 2.829607e-02
$odds.ratio
coef exp(coef) lcl ucl
(Intercept) -2.37766146 0.09276726 0.04403337 0.1954374
smokingYes -0.06777489 0.93447081 0.54178428 1.6117775
obesityYes 0.69530960 2.00432951 1.14631607 3.5045629
snoringYes 0.87193932 2.39154432 1.09714478 5.2130624
Here is an alternative form for group-level data:
> adat3
$odds.ratio
coef exp(coef) lcl ucl
(Intercept) -2.37766146 0.09276726 0.04403329 0.1954377
smokingYes -0.06777489 0.93447081 0.54178381 1.6117789
obesityYes 0.69530960 2.00432951 1.14631567 3.5045641
snoringYes 0.87193932 2.39154432 1.09714274 5.2130721
To summarize, we showed how to use the glm function to conduct uncon-
ditional logistic regression for individual-level data and two forms of group-
level data. We also created the or.glm function to calculate odds ratios and
confidence intervals.
HERE 2012-11-25
smkCurrent
cut(sbp, 2)(140,160] ***
ecgAbnormal *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
log(rate) = a0 + a1 x
log(r1 ) = a0 + a1 (1)
−[log(r0 ) = a0 + a1 (0)]
S
ISR = (SIR)(Rcrude )
Observed S
= Rcrude
Expected
P
Ai S
= P i S
Rcrude
i P T R
i i
In this example, we use the dataset of U.S. white male population estimates
and male cancer deaths in 1940 compared to 1960.
> #enter data
> dth60 <- c(141, 926, 1253, 1080, 1869, 4891, 14956, 30888,
+ 41725, 26501, 5928)
> pop60 <- c(1784033, 7065148, 15658730, 10482916, 9939972,
+ 10563872, 9114202, 6850263, 4702482, 1874619,
+ 330915)
> dth40 <- c(45, 201, 320, 670, 1126, 3160, 9723, 17935,
+ 22179, 13461, 2238)
> pop40 <- c(906897, 3794573, 10003544, 10629526, 9465330,
+ 8249558, 7294330, 5022499, 2920220, 1019504,
+ 142532)
> #housekeeping
> tab <- cbind(pop40, dth40, pop60, dth60)
> agelabs <- c("<1", "1-4", "5-14", "15-24", "25-34",
+ "35-44", "45-54", "55-64", "65-74", "75-84",
+ "85+")
> rownames(tab) <- agelabs
> #calculate crude rates
> CDR.1940 <- sum(dth40)/sum(pop40)
> CDR.1960 <- sum(dth60)/sum(pop60)
> #display data and crude rates
> tab
pop40 dth40 pop60 dth60
<1 906897 45 1784033 141
1-4 3794573 201 7065148 926
5-14 10003544 320 15658730 1253
15-24 10629526 670 10482916 1080
25-34 9465330 1126 9939972 1869
35-44 8249558 3160 10563872 4891
45-54 7294330 9723 9114202 14956
55-64 5022499 17935 6850263 30888
65-74 2920220 22179 4702482 41725
75-84 1019504 13461 1874619 26501
85+ 142532 2238 330915 5928
> round(100000*c(CDR.1940=CDR.1940, CDR.1960=CDR.1960), 1)
CDR.1940 CDR.1960
119.5 166.1
Now, we calculate the indirect standardized rate for 1940 using the 1960
data as the standard. Using the data we prepared above, we do the calcula-
tions in two steps:
> # sir calculation
> Ri.1960 <- dth60/pop60
> CDR.1960 <- sum(dth60)/sum(pop60)
> SIR.1940 <- sum(dth40)/sum(pop40*Ri.1960)
> ISR.1940 <- SIR.1940*CDR.1960
> c(SIR.1940=SIR.1940, ISR.1940=100000*ISR.1940,
+ CDR.1960=100000*CDR.1960)
SIR.1940 ISR.1940 CDR.1960
0.8305351 137.9414537 166.0874444
Now, to use Poisson regression we need to prepare our male cancer death
data as a data frame, and set 1960 as the standard (reference) population.
> #create data frame for poisson regression
> year <- factor(c(rep(1960, 11), rep(1940, 11)),
+ levels = c(1960, 1940)) #1960 as reference
> pop <- c(pop60, pop40)
> deaths <- c(dth60, dth40)
> age <- factor(rep(agelabs, 2), levels=agelabs)
> cad <- data.frame(year, pop, deaths, age)
> cad
year pop deaths age
1 1960 1784033 141 <1
2 1960 7065148 926 1-4
3 1960 15658730 1253 5-14
...
20 1940 2920220 22179 65-74
21 1940 1019504 13461 75-84
22 1940 142532 2238 85+
We are now in the position to use the Poisson regression model to estimate
the crude death rates for the 1940 and 1960 data.
> pmod <- glm(deaths~year, family = poisson(link="log"), data = cad,
+ offset = log(pop))
> summary(pmod)
Call:
glm(formula = deaths ~ year, family = poisson(link = "log"),
data = cad, offset = log(pop))
Deviance Residuals:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.400411 0.002772 -2309.11 <2e-16 ***
year1940 -0.328958 0.004664 -70.53 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> pmod$coef
(Intercept) year1940
-6.4004110 -0.3289584
> cdr.1940 <- exp(pmod$coeff[1] + pmod$coeff[2])
> cdr.1940*100000 #from model
(Intercept)
119.5286
> CDR.1940*100000 #from arithmetic
[1] 119.5286
> cdr.1960 <- exp(pmod$coeff[1])
> cdr.1960*100000 #from model
(Intercept)
166.0874
> CDR.1960*100000 #from arithmetic
[1] 166.0874
We now use the Poisson model to calculate an adjusted rate ratio compar-
ing the 1940 population to the 1960 population (standard). This rate ratio
closely approximates the SIR that, when multiplied by the 1960 crude rate,
gives us the indirect standardized rate.
> pmod2 <- glm(deaths ~ year + age, family= poisson(link="log"),
+ data = cad, offset = log(pop))
> summary(pmod2)
Call:
glm(formula = deaths ~ year + age, family = poisson(link = "log"),
data = cad, offset = log(pop))
Deviance Residuals:
Min 1Q Median 3Q Max
-10.51100 -3.69907 -0.02425 3.15316 8.81875
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.521870 0.073337 -129.838 < 2e-16 ***
year1940 -0.182227 0.004679 -38.945 < 2e-16 ***
age1-4 0.408585 0.079143 5.163 2.44e-07 ***
age5-14 -0.110780 0.077538 -1.429 0.15308
age15-24 0.211467 0.077125 2.742 0.00611 **
age25-34 0.830261 0.075569 10.987 < 2e-16 ***
age35-44 1.841193 0.074167 24.825 < 2e-16 ***
age45-54 3.099207 0.073601 42.108 < 2e-16 ***
age55-64 4.101147 0.073464 55.825 < 2e-16 ***
age65-74 4.806312 0.073430 65.454 < 2e-16 ***
age75-84 5.299837 0.073494 72.112 < 2e-16 ***
age85+ 5.513262 0.074154 74.349 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> pmod2$coef
(Intercept) year1940 age1-4 age5-14 age15-24
-9.5218697 -0.1822271 0.4085846 -0.1107800 0.2114672
age25-34 age35-44 age45-54 age55-64 age65-74
0.8302607 1.8411928 3.0992069 4.1011465 4.8063119
age75-84 age85+
5.2998370 5.5132617
> exp(pmod2$coeff[2]) #SIR from Poisson model
year1940
0.833412
> SIR.1940 #SIR from arithmetic
[1] 0.8305351
Poisson regression seems like a lot of work. However, an advantage is that
we have standard errors estimates for constructing confidence intervals.
> cdr.1960 #crude death rate for 1960 from Poisson model
(Intercept)
0.001660874
> conf.level <- 0.95
> Z <- qnorm((1+conf.level)/2)
> sy <- summary(pmod2)$coef #this includes standard errors
> log.sir.1940 <- sy["year1940","Estimate"]
> sir.1940 <- exp(log.sir.1940)
> SE.log.sir.1940 <- sy["year1940","Std. Error"]
> LCL.sir.1940 <- exp(log.sir.1940 - Z*SE.log.sir.1940)
> UCL.sir.1940 <- exp(log.sir.1940 + Z*SE.log.sir.1940)
> isr.1940 <- cdr.1960*c(sir.1940, LCL.sir.1940, UCL.sir.1940)
> names(isr.1940) <- c("ISR","LCL","UCL")
> round(isr.1940*100000, 1)
ISR LCL UCL
138.4 137.2 139.7
Recall that the previously calculated ISR1940 = 137.9 per 100,000 person-
years.
library(survival)
pbc2 <- pbc
pbc2[pbc==-9] <- NA
str(pbc2)
From 1980 to 1990 data was collected on 427 Latino mothers that gave birth
at the University of California, San Francisco [12, 13]. Data was collected
on the characteristics of the mothers and their newborn infants (Table A.1).
Mothers were weighed at each prenatal visit. Rate of weight gain during each
trimester was based on a linear regression interpolation. The data set can
be viewed and downloaded from http://www.medepi.net/data/birthwt9.
txt.
Table A.1 Data dictionary for Latina mothers and their newborn infants
Variable Description Possible values
age Maternal age In years (self-reported)
parity Parity Count of previous live births
gest Gestation Reported in days
sex Gender Male = 1, Female = 2
bwt Birth weight Grams
cigs Smoking Number of cigarettes per day
(self-reported)
ht Maternal height Measured in centimeters
wt Maternal weight Pre-pregnancy weight
(self-reported)
r1 Rate of weight gain (1st trimester) Kilograms per day (estimated)
r2 Rate of weight gain (2nd trimester) Kilograms per day (estimated)
r2 Rate of weight gain (3rd trimester) Kilograms per day (estimated)
243
244 A Available data sets
On April 19, 1940, the local health officer in the village of Lycoming, Oswego
County, New York, reported the occurrence of an outbreak of acute gastroin-
testinal illness to the District Health Officer in Syracuse. Dr. A. M. Rubin,
epidemiologist-in-training, was assigned to conduct an investigation.
When Dr. Rubin arrived in the field, he learned from the health officer
that all persons known to be ill had attended a church supper held on the
previous evening, April 18. Family members who did not attend the church
supper did not become ill. Accordingly, Dr. Rubin focused the investigation
on the supper. He completed Interviews with 75 of the 80 persons known to
have attended, collecting information about the occurrence and time of onset
of symptoms, and foods consumed. Of the 75 persons interviewed, 46 persons
reported gastrointestinal illness.
The onset of illness in all cases was acute, characterized chiefly by nau-
sea, vomiting, diarrhea, and abdominal pain. None of the ill persons reported
having an elevated temperature; all recovered within 24 to 30 hours. Approx-
imately 20 physicians. No fecal specimens were obtained for bacteriologic
examination.
The supper was held in the basement of the village church. Foods were
contributed by numerous members of the congregation. The supper began at
6:00 p.m. and continued until 11:00 p.m. Food was spread out on table and
consumed over a period of several hours. Data regarding onset of illness and
food eaten or water drunk by each of the 75 persons interviewed are provided
in the attached line listing (Oswego dataset). The approximate time of eating
supper was collected for only about half the persons who had gastrointestinal
illness.
The data set can be viewed and downloaded from http://www.medepi.
net/data/oswego.txt. The data dictionary is provided in Table A.2 on the
facing page.
The Evans County data set is used to demonstrate a standard logistic re-
gression (unconditional) [15]. The data are from a cohort study in which 609
white males were followed for 7 years, with coronary heart disease as the
outcome of interest.
The data set can be viewed and downloaded from http://www.medepi.
net/data/evans.txt. The data dictionary is provided in Table A.4 on the
following page.
Table A.3 Data dictionary for Western Collaborative Group Study data set
Variable Variable name Variable type Possible values
id Subject ID Integer 2001–22101
age0 Age Continuous 39–59 years
height0 Height Continuous 60–78 in
weight0 Weight Continuous 78–320 lb
sbp0 Systolic blood pressure Continuous 98–230 mm Hg
dbp0 Diastolic blood pressure Continuous 58–150 mm Hg
chol0 Cholesterol Continuous 103–645 mg/100 ml
behpat0 Behavior pattern Categorical 1 = Type A1
2 = Type A2
3 = Type B1
4 = Type B2
ncigs0 Smoking Integer Cigarettes/day
dibpat0 Behavior pattern Categorical 0 = Type B
1 = Type A
chd69 Coronary heart disease event Categorical 0 = None
1 = Yes
typechd Coronary heart disease event Categorical 0 = CHD event
1 = Symptomatic MI
2 = Silent MI
3 = Classical angina
time169 Observation (follow up) time Continuous 18–3430 days
arcus0 Corneal arcus Categorical 0 = None
1 = Yes
Table A.5 Data dictionary for myocardial infarction (MI) case-control data set
Variable Variable name Variable type Possible values
match Matching strata Integer 1–39
person Subject identifier Integer 1–117
mi Myocardial infarction Categorical- 0 = No
nominal 1 = Yes
smk Smoking status Categorical- 0 = Not current smoker
nominal 1 = Current smoker
sbp Systolic blood pressure Categorical- 120, 140, or 160
ordinal
ecg Electrocardiogram Categorical- 0 = No abnormality
nominal 1 = abnormality
The myocardial infarction (MI) data set [15] is used to demonstrate condi-
tional logistic regression. The study is a case-control study that involves 117
subjects in 39 matched strata (matched by age, race, and sex). Each stratum
contains three subjects, one of whom is a case diagnosed with myocardial
infarction and the other two are matched controls.
The data set can be viewed and downloaded from http://www.medepi.
net/data/mi.txt. The data dictionary is provided in Table A.5.
http://www.medepi.net/data/aids.txt
http://www.medepi.net/data/hepb.txt
http://www.medepi.net/data/measles.txt
http://www.medepi.net/data/ugdp.txt
http://www.medepi.net/data/h1n1panflu23jul09usa.txt
Read data
wnv <-read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt",
sep = ",", header = TRUE, na.strings = ".")
str(wnv) #display data set structure
head(wnv) #display first 6 lines
edit(wnv) #browse data frame
fix(wnv) #browse with ability to edit (be careful!!!)
249
250 B Outbreak analysis template in R
barplot(sexage, legend.text=TRUE,
xlab="Age", ylab="Frequency", main="title")
barplot(sexage, legend.text=TRUE, beside=TRUE,
xlab="Age", ylab="Frequency", main="title")
barplot(t(sexage), legend.text=TRUE, ylim=c(0, 650),
xlab="Sex", ylab="Frequency", main="title")
barplot(t(sexage), legend.text=TRUE, beside=TRUE, ylim=c(0, 300),
xlab="Sex", ylab="Frequency", main="title")
From the main menu select Packages > Install Package(s). Select CRAN
mirror near you. Select epitools package.
library(epitools) #load ’epitools’; only needed once per session
tab.age3 <- xtabs(~age3 + death, data = wnv)
epitab(tab.age3) #default is odds ratio
epitab(tab.age3, method = "riskratio")
prop.table(tab.age3, 1) #display row distribution (2=column)
prop.test(tab.age3[,2:1]) #remember to reverse columns
chisq.test(tab.age3) #Chi-square test
fisher.test(tab.age3) #Fisher exact test
Single variable
sink("c:/temp/job.log")
print(x)
sink()
capture.output(cbind(x, y),
file="c:/temp/job.log", append=TRUE)
Multivariable analysis
header=TRUE)
head(chd)
chd$mi2 <- ifelse(chd$mi=="Yes", 1, 0) #re-code case status
mod1 <- clogit(mi2~smk+strata(match), data=chd)
summary(mod1)
mod2 <- clogit(mi2~smk+sbp+strata(match), data=chd)
summary(mod2)
mod3 <- clogit(mi2~smk+sbp+ecg+strata(match), data=chd)
summary(mod3)
anova(mod1,mod2,mod3, test="Chisq") #compare nested models
“Good programmers write good code, great programmers borrow good code.”
257
258 C Programming and creating R functions
Control flow involves one or more decision points. A simplest decision point
goes like this: if a condition is TRUE, do {this} and then continue; if it is
FALSE, do not do {this} and then continue. When R continues, the next R
expression can be any valid expression, including another decision point.
Up to now the if condition had only one possible response. If there are two,
mutually exclusive possible responses, add one else statement:
if(TRUE) {
execute these R expressions
} else {
execute these R expressions
}
Here is an example:
> x <- c(1, 2, NA, 4, 5);
> y <- c(1, 2, 3, 4, 5)
> if(any(is.na(x))) {
+ x[is.na(x)] <- 999; cat("NAs replaced\n")
+ } else {cat("No missing values; no replacement\n")}
NAs replaced
> if(any(is.na(y))) {
+ y[is.na(y)] <- 999; cat("NAs replaced\n")
+ } else {cat("No missing values; no replacement\n")}
No missing values; no replacement
> x
[1] 1 2 999 4 5
> y
[1] 1 2 3 4 5
> y
[1] 1 2 3 4 5
Therefore, use the if and else combination if one needs to evaluat of one of
two possible collection of R expressions.
If one needs to evaluate possibly one of two possible collection of R expres-
sions then use the following pattern:
if(TRUE) {
execute these R expressions
} else if(TRUE) {
execute these R expressions
}
The if and else functions can be combined to achieve any desired control
flow.
The “short circuit” && and || logical operators are used for control flow in if
functions. If logical vectors are provided, only the first element of each vector
is used. Therefore, for element-wise comparisons of 2 or more vectors, use the
& and | operators but not the && and || operators (discussed in Chapter 2).
For if function comparisons use the && and || operators.
Suppose we want to square the elements of a numeric vector but not if it
is a matrix.
> x <- 1:5
> y <- matrix(1:4, 2, 2)
> if(is.numeric(x) && !is.matrix(x)) {
+ x^2
+ } else cat("Either not numeric or is a matrix\n")
[1] 1 4 9 16 25
> if(!is.matrix(y) && is.numeric(y)) {
+ y^2
+ } else cat("Either not numeric or is a matrix\n")
Either not numeric or is a matrix
The && and || operators are called “short circuit” operators because not
all its arguments may be evaluated: moving from left to right, only sufficient
arguments are evaluated to determine if the if function should return TRUE
or FALSE. This can save considerable time if some the arguments are complex
functions that require significant computing time to evaluate to either TRUE
or FALSE. In the previous example, because !is.matrix(y) evaluates to
FALSE, it was not necessary to evaluate is.numeric(y).
C.2.3 Looping
+ }
> xsum
[1] 55
A much better approach is to use the sum function:
> sum(x)
[1] 55
Unless it is absolutely necessary, we avoid looping.
Looping is necessary when (1) there is no R function to conduct a vector-
ized calculation, and (2) when the result of an element depends on the result
of a preceding element which may not be known beforehand (e.g., when it is
the result of a random process).
[1] "Luisito"
[1] "Angela"
The break expression will break out of a for or while loop if a condition
is met, and transfers control to the first statement outside of the inner-most
loop. Here is the general syntax:
for (i in somevector {
do some calcuation with ith element of somevector
if(TRUE) break
}
The next expression halts the processing of the current iteration and ad-
vances the looping index. Here is the general syntax:
for (i in somevector {
do some calcuation with ith element of somevector
if(TRUE) next
}
Both break and next apply only to the innermost of nested loops.
In the next example we nest two for loop to generate a multiplication table
for the integers 6 to 10:
> x <- 6:10
> mtab <- matrix(NA, 5, 5)
> rownames(mtab) <- x
> colnames(mtab) <- x
> for(i in 1:5){
+ for(j in 1:5){
+ mtab[i, j] <- x[i]*x[j]
+ }
+ }
> mtab
6 7 8 9 10
6 36 42 48 54 60
7 42 49 56 63 70
8 48 56 64 72 80
9 54 63 72 81 90
10 60 70 80 90 100
> bb = tab1[1, 2]
> cc = tab1[2, 1]
> dd = tab1[2, 2]
>
> ## Do calculations
> crossprod.OR = (aa*dd)/(bb*cc)
>
> ## Collect results
> list(data = tab1, odds.ratio = crossprod.OR)
$data
spinach
ill N Y
N 12 17
Y 20 26
$odds.ratio
[1] 0.9176471
Now that we are familiar of what it takes to calculate an odds ratio from
a 2-way table we can convert the code into a function and load it at the R
console. Here is new function:
myOR = function(x){
## Prepare input
## x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
## Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
## Collect results
list(data = x, odds.ratio = crossprod.OR)
}
Now we can test the function:
> tab.test = xtabs(~ ill + spinach, data = oswego)
> myOR(tab.test)
$data
spinach
ill N Y
N 12 17
Y 20 26
$odds.ratio
[1] 0.9176471
## Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
## Collect results
list(data = x, odds.ratio = OR, conf.int = CI)
}
Notice that conf.level is a new argument, but with no default value. If a
user forgets to specify a default value, the following line handles this possi-
bility:
if(missing(conf.level)) stop("Must specify confidence level")
Now we test this function:
> tab.test = xtabs(~ ill + spinach, data = oswego)
> myOR2(tab.test)
Error in myOR2(tab.test) : Must specify confidence level
> myOR2(tab.test, 0.95)
$data
spinach
ill N Y
N 12 17
Y 20 26
$odds.ratio
[1] 0.9176471
$conf.int
[1] 0.3580184 2.3520471
If an argument has a usual value, then specify this as an argument default
value:
myOR3 = function(x, conf.level = 0.95){
## Prepare input
## x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
## Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
## Collect results
list(data = x, odds.ratio = OR, conf.int = CI)
}
We test our new function:
> tab.test = xtabs(~ ill + spinach, data = oswego)
> myOR3(tab.test)
$data
spinach
ill N Y
N 12 17
Y 20 26
$odds.ratio
[1] 0.9176471
$conf.int
[1] 0.3580184 2.3520471
$odds.ratio
[1] 0.9176471
$conf.int
[1] 0.4165094 2.0217459
On occasion we will have a function nested inside one of our functions and we
need to be able to pass optional arguments to this nested function. This com-
monly occurs when we write functions for customized graphics but only wish
to specify some arguments for the nested function and leave the remaining
arguments optional. For example, consider this function:
myplot = function(x, y, type = "b", ...){
plot(x, y, type = type, ...)
}
When using myplot one only needs to provide x and y arguments. The type
option has been set to a default value of ”b”. The ... function will pass
any optional arguments to the nested plot function. Of course, they optional
arguments must be valid options for plot function.
The variables which occur in the body of a function can be divided into
three classes; formal parameters, local variables and free variables. The formal
parameters of a function are those occurring in the argument list of the
function. Their values are determined by the process of binding the actual
function arguments to the formal parameters. Local variables are those whose
values are determined by the evaluation of expressions in the body of the
functions. Variables which are not formal parameters or local variables are
called free variables. Free variables become local variables if they are assigned
to. Consider the following function definition.
f <- function(x){
y <- 2*x
print(x)
print(y)
print(z)
}
In this function, x is a formal parameter, y is a local variable and z is a free
variable. In R the free variable bindings are resolved by first looking in the
environment in which the function was created. This is called lexical scope.
If the free variable is not defined there, R looks in the enclosing environment.
For the function f this would be the global environment (workspace).
To understand the implications of lexical scope consider the following:
> rm(list = ls())
> ls()
character(0)
> f <- function(x){
+ y <- 2*x
+ print(x)
+ print(y)
+ print(z)
+ }
> f(5)
[1] 5
[1] 10
Error in print(z) : object ’z’ not found
> z = 99
> f(5)
[1] 5
[1] 10
[1] 99
1 http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Problems of Chapter 1
271
272 Solutions
1
0
log(x)
−2 −1
0 1 2 3 4 5 6
log(R/(1−R))
4
4
R/(1−R)
0
−4
−4
0.0 0.4 0.8 0.0 0.4 0.8
R R
Fig. C.2 The logit transformation is a double transformation. First, the odds tranfor-
mation (R/(1 − R)) unbounds the probabilities near 1; second, the logit transformation
(log(odds)) additionally unbounds the probabilities near 0.
1.11
n <- 365
per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000
risks <- 1-(1-per.act.risk)^n
risks
##label risks (optional)
act <- c("IOI", "ROI", "IPVI", "IAI", "RPVI", "PNS", "RAI", "IDU")
names(risks) <- act
risks
1.12 For this problem I put the following code into an ASCII text file named
job01.R:
n <- 365
per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000
risks <- 1-(1-per.act.risk)^n
risks
##label risks (optional)
act <- c("IOI", "ROI", "IPVI", "IAI", "RPVI", "PNS", "RAI", "IDU")
names(risks) <- act
risks
Here is what happened when I sourced it:
> source("/home/tja/Documents/courses/ph251d/jobs/job01.R")
sink("/home/tja/Documents/courses/ph251d/jobs/job01.log1b")
source("/home/tja/Documents/courses/ph251d/jobs/job01.R", echo = TRUE)
sink() #closes connection
The job01.log1a is empty. Here are the contents of job01.log1b:
> n <- 365
> per.act.risk <- c(0.5, 1, 5, 6.5, 10, 30, 50, 67)/10000
> risks <- 1-(1-per.act.risk)^n
> risks
[1] 0.01808493 0.03584367 0.16685338 0.21126678 0.30593011
[6] 0.66601052 0.83951869 0.91402762
Conclusion: running the sink function sends what would normally go to the
screen to a log file.
1.14 Sourcing job02.R at the R command line looks like this:
> source("/home/tja/Documents/courses/ph251d/jobs/job02.R")
[1] 0.01808493 0.03584367 0.16685338 0.21126678 0.30593011
[6] 0.66601052 0.83951869 0.91402762
Conclusion: The source function, without echo = TRUE, will not return
anything to the screen unless the show (or print) function is used to “show”
an R object. This make complete sense. If one is sourcing a file with thousands
of R expressions, we do not want to see all those expressions, we only want to
selected data objects with relevant results. Sinking a file only directs anything
that would appear on the screen to a log file.
Problems of Chapter 2
2.1 n/a
2.2 See Table 2.1 on page 28.
2.3 We can index by position, by logical, and by name—if it exists.
2.4 Any R object component(s) that can be indexed, can be replaced.
2.5 Study and practice the following R code.
tab <- matrix(c(139, 443, 230, 502), nrow = 2, ncol = 2,
dimnames = list("Vital Status" = c("Dead", "Alive"),
Smoking = c("Yes", "No")))
tab
# equivalent
tab <- matrix(c(139, 443, 230, 502), 2, 2)
dimnames(tab) <- list("Vital Status" = c("Dead", "Alive"),
Smoking = c("Yes", "No"))
tab
# equivalent
tab <- matrix(c(139, 443, 230, 502), 2, 2)
rownames(tab) <- c("Dead", "Alive")
colnames(tab) <- c("Yes", "No")
names(dimnames(tab)) <- c("Vital Status", "Smoking")
tab
2.6 Using the tab object from Solution 2.5, study and practice the following
R code to recreate Table 2.38 on page 103.
rowt <- apply(tab, 1, sum)
tab2 <- cbind(tab, Total = rowt)
colt <- apply(tab2, 2, sum)
tab2 <- rbind(tab2, Total = colt)
names(dimnames(tab2)) <- c("Vital Status", "Smoking")
tab2
2.7 Using the tab object from Solution 2.5, study and practice the following
R code to calculate row, column, and joint distributions.
# row distrib
rowt <- apply(tab, 1, sum)
rowd <- sweep(tab, 1, rowt, "/")
rowd
# col distrib
# joint distrib
jtd <- tab/sum(tab); jtd
distr <- list(row.distribution = rowd,
col.distribution = cold,
joint.distribution = jtd)
distr
2.8 Using the tab2 object from Solution 2.6, study and practice the following
R code to recreate Table 2.39 on page 103. Note that the column distributions
from Solution 2.7 can also be used.
risk = tab2[1,1:2]/tab2[3,1:2]
risk.ratio <- risk/risk[2]
odds <- risk/(1-risk)
odds.ratio <- odds/odds[2]
ratios <- rbind(risk, risk.ratio, odds, odds.ratio)
ratios
Interpretation: The risk of death among non-smokers is higher than the risk
of death among smokers, suggesting that there may be some confounding.
2.9 Implement analysis below.
wdat = read.table("http://www.medepi.net/data/whickham-engl.txt",
sep = ",", header = TRUE)
str(wdat)
wdat.vas = xtabs(~Vital.Status + Age + Smoking, data = wdat)
wdat.vas
wdat.tol.vas = apply(wdat.vas, c(2, 3), sum)
wdat.risk.vas = sweep(wdat.vas, c(2, 3), wdat.tot.vas, "/")
round(wdat.risk.vas, 2)
Here are the final results:
> round(wdat.risk.vas, 2)
, , Smoking = No
Age
Vital.Status 18-24 25-34 35-44 45-54 55-64 65-74 75+
Alive 0.98 0.97 0.94 0.85 0.67 0.22 0.00
Dead 0.02 0.03 0.06 0.15 0.33 0.78 1.00
, , Smoking = Yes
Age
# 1-D tables
tab.a <- apply(tab.ars, 1, sum); tab.a
tab.r <- apply(tab.ars, 2, sum); tab.r
tab.s <- apply(tab.ars, 3, sum); tab.s
2.12 For this example, we’ll choose one 3-D array.
tab.ars <- table(std$Age, std$Race, std$Sex)
# row distrib
rowt <- apply(tab.ars, c(1, 3), sum)
rowd <- sweep(tab.ars, c(1, 3), rowt, "/"); rowd
#confirm
apply(rowd, c(1, 3), sum)
# col distrib
colt <- apply(tab.ars, c(2, 3), sum)
cold <- sweep(tab.ars, c(2, 3), colt, "/"); cold
#confirm
apply(cold, c(2, 3), sum)
# joint distrib
jtt <- apply(tab.ars, 3, sum)
jtd <- sweep(tab.ars, 3, jtt, "/"); jtd
#confirm
apply(jtd, 3, sum)
Problems of Chapter 3
3.1 First, we recognize that this data frame contains aggregrate-level data,
not individual-level data. Each row represents a unique covariate pattern,
and the last field is the frequency of that pattern. Because the data frame
only has a few rows here is one way:
Status <- rep(c("Dead", "Survived"), 4)
Treatment <- rep(c("Tobutamide", "Tobutamide",
"Placebo", "Placebo"), 2)
Agegrp <- c(rep("<55", 4), rep("55+", 4))
Freq <- c(8, 98, 5, 115, 22, 76, 16, 69)
dat <- data.frame(Status, Treatment, Agegrp, Freq)
dat
An alternative, and better way, is to create an array that reproduce the
core data from Table 3.1 on page 107. Then we use the data.frame and
as.table functions. Here we show a few ways to create this array object.
#answer 1
udat <- array(c(8, 98, 5, 115, 22, 76, 16, 69), dim =
c(2, 2, 2),
dimnames = list(Status = c("Dead", "Survived"),
Treatment = c("Tolbutamide", "Placebo"),
Agegrp = c("<55", "55+")))
dat <- data.frame(as.table(udat))
dat
#answer 2
Status <- rep(c("Dead", "Survived"), 4)
Treatment <- rep(rep(c("Tolbutamide", "Placebo"),
c(2, 2)), 2)
Agegrp <- rep(c("<55", "55+"), c(4, 4))
Freq <- c(8, 98, 5, 115, 22, 76, 16, 69)
dat <- data.frame(Status, Treatment, Agegrp, Freq)
dat
table(edat$cat2)
#
table(edat$smk)
edat$smk2 <- factor(edat$smk, levels = 0:1,
labels = c("Never", "Ever"))
table(edat$smk2)
#
table(edat$ecg)
edat$ecg2 <- factor(edat$ecg, levels = 0:1,
labels = c("Normal", "Abnormal"))
table(edat$ecg2)
#
table(edat$hpt)
edat$hpt2 <- factor(edat$hpt, levels = 0:1,
labels = c("No", "Yes"))
table(edat$hpt2)
Answer to (b):
quantile(edat$age)
edat$age4 <- cut(edat$age, quantile(edat$age),
right = FALSE, include.lowest = TRUE)
table(edat$age4)
Answer to (c):
hptnew <- rep(NA, nrow(edat))
normal <- edat$sbp<120 & edat$dbp<80
hptnew[normal] <- 1
prehyp <- (edat$sbp>=120 & edat$sbp<140) |
(edat$dbp>=80 & edat$dbp<90)
hptnew[prehyp] <- 2
stage1 <- (edat$sbp>=140 & edat$sbp<160) |
(edat$dbp>=90 & edat$dbp<100)
hptnew[stage1] <- 3
stage2 <- edat$sbp>=160 | edat$dbp>=100
hptnew[stage2] <- 4
edat$hpt4 <- factor(hptnew, levels=1:4,
labels=c("Normal", "PreHTN", "HTN.Stage1", "HTN.Stage2"))
table(edat$hpt4)
Answer to (d):
table("Old HTN"=edat$hpt2, "New HTN"=edat$hpt4)
3.7
wdat <- read.table("http://www.medepi.net/data/wnv/wnv2004raw.txt",
header=TRUE, sep=",", as.is=TRUE, na.strings=c(".","Unknown"))
str(wdat)
wdat$date.onset2 <- as.Date(wdat$date.onset, format="%m/%d/%Y")
wdat$date.tested2 <- as.Date(wdat$date.tested, format="%m/%d/%Y")
write.table(wdat, "c:/temp/wnvdat.txt", sep=",", row.names=FALSE)
3.8 See Appendix A.2 on page 244 for Oswego data dictionary.
a. Using RStudio plot the cases by time of onset of illness (include appropriate
labels and title). What does this graph tell you? (Hint: Process the text
data and then use the hist function.)
Plotting an epidemic curve with this data has special challenges because we
have dates and times to process. To do this in R, we will create date objects
that contain both the date and time for each primary event of interest: meal
time, and onset time of illness. From this we can plot the distribution of
onset times (epidemic curve). An epidemic curve is the distribution of illness
onset times and can be displayed with a histogram. First, carefully study the
Oswego data set at http://www.medepi.net/data/oswego.txt. We need to
do some data preparation in order to work with dates and times. Our initial
goal is to get the date/time data to a form that can be passed to R’s strptime
function for conversion in a date-time R object. To construct the following
curve, study, and implement the R code that follows:
odat <- read.table("http://www.medepi.net/data/oswego.txt",
sep = "", header = TRUE, na.strings = ".")
str(odat)
head(odat)
## create vector with meal date and time
mdt <- paste("4/18/1940", odat$meal.time)
## convert into standard date and time
meal.dt <- strptime(mdt, "%m/%d/%Y %I:%M %p")
## create vector with onset date and time
odt <- paste(paste(odat$onset.date,"/1940",sep = ""), odat$onset.time)
## convert into standard date and time
onset.dt <- strptime(odt, "%m/%d/%Y %I:%M %p")
hist(onset.dt, breaks = 30, freq = TRUE)
b. Are there any cases for which the times of onset are inconsistent with the
general experience? How might they be explained?
Now that we have our data frame in R, we can identify those subjects that
correspond to minimum and maximum onset times. We will implement R
code that can be interpreted as “which positions in vector Y correspond to
the minimum values in Y?” We then use these position numbers to indexing
the corresponding rows in the data frame.
##Generate logical vectors and identify ’which’ position
min.obs.pos <- which(onset.dt==min(onset.dt,na.rm=T))
min.obs.pos
We can sort the data frame based values of one or more fields. Suppose
we want to sort on illness status and illness onset times. We will use our
onset.times vector we created earlier; however, we will need to convert it to
“continuous time” in seconds to sort this vector. Study and implement the
R code below.
onset.ct <- as.POSIXct(onset.dt)
odat2 <- odat[order(odat$ill, onset.ct), ]
odat2
d. Where possible, calculate incubation periods and illustrate their distribu-
tion with an appropriate graph. Use the truehist function in the MASS
package. Determine the mean, median, and range of the incubation period.
287
288 References
15. Kleinbaum DG, Klein M, Pryor ER. Logistic Regression: A self-learning text. 2nd ed.
Springer; 2002.