R Introduction
R Introduction
Josep L. Carrasco
Bioestadística. Departament de Fonaments Clínics
Universitat de Barcelona
First steps
R is a programming language and free software environment for statistical computing and
graphics. Base and contributed packages can be set up from CRAN (Comprehensive R
Archive Network) website (http://cran.r-project.org/).
R has a command line interface though it is more common to use a graphical user interface
as RStudio (https://www.rstudio.com/)
Once R and RStudio has been set up, let’s open RStudio and create an R script file.
In this file we can write the code to run in a session and save it.
To embed comments through the code use the symbol #.
1
Working environment
R always uses a directory on your computer to read and/or save data and objects. To know
which folder is used by default use the function getwd().
To change the working directory use the function setwd().
One appealing option is Session –> Set Working Directory –> To Source File Location from
the bar menu. The working directory will be set to the folder where the R script file is
located.
Mathematical functions
R can be used a a calculator using the implemented mathematical operators and functions.
The following table shows the main operators and functions.
To round results one could opt fro changing the global options of the R session using the
function options and its argument digits. For example, options(digits=10) would change all
the the numerical outputs to 10 digits.
However, the common choice is just to round a single result. To do that the next table gives
some options.
2
Operation Operator/Function Example
Round to upper integer ceiling(x) ceiling(5.37)
Round to lower integer floor(x) floor(5.37)
Truncate trunc(x) trunc(5.37)
Rounding round(x,digits) round(5.37,digits=1)
Allocation
x=10
x
[1] 10
y<-12
y
[1] 12
z<-x+y
z
[1] 22
w<-log(z)
w
[1] 3.091042
3
class(w)
[1] "numeric"
Next, let’s see the main classes for data analysis: vectors, matrices, data frames and lists.
Vectors
A vector is a set of values that can be numeric or character. To create a vector use the
function c
The following table shows the main operators and functions that apply to vector.
x=c(5,7,9,13,-4,8)
x
[1] 5 7 9 13 -4 8
class(x)
[1] "numeric"
y=c("A","B","C")
y
4
class(y)
[1] "character"
Notice that character values must be enclosed between quotation marks. Furthermore, all
values must be of the same class (numeric or character). If one character value is combined
with numeric values then all values are converted to character.
z<-c(x,y)
z
[1] "5" "7" "9" "13" "-4" "8" "A" "B" "C"
class(z)
[1] "character"
The function seq creates a sequential vector. We just need to specify the starting and ending
values, and the sequential increment. In case the increment is equal to 1 we could use the
simplified version of two points.
[1] 1 3 5 7 9
[1] 1 2 3 4 5 6 7 8 9 10
1:10
[1] 1 2 3 4 5 6 7 8 9 10
Additionally, it could be interesting to replicate a vector some times. In that case the function
to use is rep. Notice the different result when the argument each is used.
rep(1,5)
[1] 1 1 1 1 1
5
z=c(1,2)
z
[1] 1 2
rep(z,5)
[1] 1 2 1 2 1 2 1 2 1 2
rep(z,each=5)
[1] 1 1 1 1 1 2 2 2 2 2
Vector indexing refers to the order position of values in the vector. This is an important
issue to extract one single value or a set of values from the vector. The way to indicate the
position is using the brackets symbol.
[1] 7
[1] 7 9 13
[1] 5 9
The same operator and function showed for values also apply to vectors. The operations
are element-wise, i.e they are applied element by element. Additionally, the symbol %*%
produces the matrix product. Here there are some examples:
x<-c(1,3,4,7)
y=1:4
x+2 # Add a constant
[1] 3 5 6 9
6
x*2 # Product with a constant
[1] 2 6 8 14
xˆ2 # Power
[1] 1 9 16 49
[1] 2 5 7 11
[1] 1 6 12 28
[,1]
[1,] 47
[1] 15
[1] 84
[1] 4
[1] 1 3 4 7
7
order(x) # Gives the original locations of the sorted vector
[1] 1 2 3 4
Logical operators are useful to check which elements of the vector meet some condition.
The following table shows the logical operators.
Notice that the equality operator is a double equal sign. Remember that one single equal
sign is used to assign names to objects.
The result of the logical functions will be a logical vector with the same length than that
vector used in the logical evaluation. The values of a logical vector are TRUE in case that
the value meets the condition, or FALSE otherwise.
Let’s see some examples.
x<-c(1,3,4,7,10)
x==4 # which values are equal to 4
8
x<=4 # which values are lower or equal than 4
Logical conditions also applied to character vectors. However, in that case only the operators
equal and not equal make sense.
z<-c("A","B","C","D","E")
z=="C" # which values are equal to C
Matrices
In R matrices are objects that store values in a two-dimensional array (rows and columns).
The generic function to create a matrix is matrix(). The main arguments are:
x<-c(1,3,4,7,10,12)
A=matrix(x,nrow=2,ncol=3)
A
9
A=matrix(x,nrow=2,ncol=3,byrow=T) # Notice the matrix is filled now by rows.
A
Matrices can be also created by concatenating vectors using he functions cbind (concatenate
by columns) and rbind (concatenate by rows).
x<-c(1,3,4,7,10,12)
y<-1:6
cbind(x,y)
x y
[1,] 1 1
[2,] 3 2
[3,] 4 3
[4,] 7 4
[5,] 10 5
[6,] 12 6
rbind(x,y)
One important issue to bear in mind is all elements in a matrix have to be of the same class
(numeric or character). Let’ see what happens if a numeric vector and a character vector are
concatenated to create a matrix.
x<-c(1,3,4,7,10,12)
z<-c("A","B","C","D","E","F")
cbind(x,z)
x z
[1,] "1" "A"
[2,] "3" "B"
[3,] "4" "C"
[4,] "7" "D"
[5,] "10" "E"
[6,] "12" "F"
10
R automatically converts the numeric vector to character values.
Matrix indexing will involve two values because the two-dimensionality of the object. The
two values will refer to the row and column number respectively, and they will be separated
by a comma.
x<-c(1,3,4,7,10,12)
y<-1:6
A<-cbind(x,y)
A
x y
[1,] 1 1
[2,] 3 2
[3,] 4 3
[4,] 7 4
[5,] 10 5
[6,] 12 6
y
1
x
3
x y
1 1
[1] 1 3 4 7 10 12
11
Matrix operations and functions
The same operators used for vector apply to matrices. Furthermore, specific functions for
matrices are:
Some examples:
x<-c(1,3,4,7,10,12)
A<-matrix(x,nrow=2)
12
A-A # Element-wise subtraction of matrices
B=t(A) # Transpose
B
[,1] [,2]
[1,] 1 3
[2,] 4 7
[3,] 10 12
C=A%*%B # Product
C
[,1] [,2]
[1,] 117 151
[2,] 151 202
diag(C) # Diagonal
det(C) # Determinant
[1] 833
solve(C) #Inverse
[,1] [,2]
[1,] 0.2424970 -0.1812725
[2,] -0.1812725 0.1404562
13
Lists
Lists group different R objects data into a one-dimensional set. Actually, a list is an ordered
collection of objects. Thus, list are useful to combine different objects into a single object.
For example, let’s create a list containing two vectors. The first is a numeric vector whilst
the second is a logical vector created after applying a logical condition to the first vector.
x<-c(1,3,4,7,10,12)
z<-x>=4
w<-list(x,z)
w
[[1]]
[1] 1 3 4 7 10 12
[[2]]
[1] FALSE FALSE TRUE TRUE TRUE TRUE
The list objects indexes the elements using double brackets. So that, the way to recover one
element of the list is:
w[[1]]
[1] 1 3 4 7 10 12
w[[2]]
Some functions uses list objects as arguments. This is the case of dimnames argument in
the matrix function. This argument gives names to the rows and columns respectively.
x<-c(1,3,4,7,10,12)
A=matrix(x,nrow=2,ncol=3, dimnames=list(c("A","B"),c("1","2","3")))
A
1 2 3
A 1 4 10
B 3 7 12
14
Data frames
Data frame objects are matrices with further properties. The main difference stem from the
fact that data frames allow different type of data (quantitative and qualitative) whilst only
one kind of data is possible with matrices.
First, let’s create two vectors: one quantitative and one qualitative. Let’s suppose that these
vectors contain the age and gender of eight subjects. Notice that character values are between
quotation marks.
To create a data frame with these two vectors use the function data.frame.
dades=data.frame(x,y)
dades
x y
1 65 M
2 34 F
3 44 F
4 25 M
5 38 F
6 41 M
7 59 M
8 21 F
Names
The current names of the columns (variables) are “x” and “y”. We may change that by
applying the function names to the data frame.
It is convenient to use short names for a better handling of data. Additionally, accents and
special characters should be avoided. For example, let’s name them as “Age” and “Gender”.
15
dades
Age Gender
1 65 M
2 34 F
3 44 F
4 25 M
5 38 F
6 41 M
7 59 M
8 21 F
Indexing
To extract rows or columns from a data frame we could proceed applying the matrices
indexing. For example, let’s extract the “Age” column which is the first column.
dades[,1]
[1] 65 34 44 25 38 41 59 21
However, data frames have further properties than matrices. One of them is the names of
the variables. Using this property it is possible to extract a column by using the symbol “$”
and its name and .
dades$Age
[1] 65 34 44 25 38 41 59 21
dades[,"Age"]
[1] 65 34 44 25 38 41 59 21
Additionally, we could select those data from males (“M”) by applying a logical condition to
the rows. The result is a new data frame that only contains de rows that meet the condition.
dades[dades$Gender=="M",]
16
Age Gender
1 65 M
4 25 M
6 41 M
7 59 M
subset(dades,Gender=="M")
Age Gender
1 65 M
4 25 M
6 41 M
7 59 M
Factors
Factors are variables that contain qualitative information. They can be created by quoting
the values or by using the factor function applied to a vector.
Following with the example of the eight subjects, suppose now that we recorded a binary
variable indicating whether the subjects were exposed to a risk factor (smoking, for instance)
or not. Additionally, this variable is coded as 0 or 1 meaning “No” and “Yes” respectively.
sm<-c(1,0,0,0,1,0,0,1)
sm
[1] 1 0 0 0 1 0 0 1
factor(sm)
[1] 1 0 0 0 1 0 0 1
Levels: 0 1
To transform sm to a factor.
17
sm<-factor(sm)
sm
[1] 1 0 0 0 1 0 0 1
Levels: 0 1
To add the new variable to the existing data frame it is possible to use the function cbind
cbind(dades,sm)
Age Gender sm
1 65 M 1
2 34 F 0
3 44 F 0
4 25 M 0
5 38 F 1
6 41 M 0
7 59 M 0
8 21 F 1
This can be convenient to change outputs appearance. One easy way to reordering the factor
levels is using the argument levels in the factor function. We have to use a vector with the
ordered values.
sm<-factor(sm,levels=c("1","0"))
sm
[1] 1 0 0 0 1 0 0 1
Levels: 1 0
There are different ways to change the names of the levels. Perhaps, one of the more appealing
ways is using the function recode from dplyr package.
To do that you need first to install the package using the function install.packages() or the
packages tab either.
18
install.packages("dplyr")
After that the package needs to be loaded in memory using the library() function.
library(dplyr)
sm<-recode(sm,"0"="No","1"="Yes")
sm
To know if a variable is already a factor we just need to apply the function is.factor or class().
is.factor(sm)
[1] TRUE
class(sm)
[1] "factor"
Finally, to add this vector to our data frame we could proceed by applying the function cbind
cbind(dades,sm)
Age Gender sm
1 65 M Yes
2 34 F No
3 44 F No
4 25 M No
5 38 F Yes
6 41 M No
7 59 M No
8 21 F Yes
19
dades$SM<-sm
dades
Age Gender SM
1 65 M Yes
2 34 F No
3 44 F No
4 25 M No
5 38 F Yes
6 41 M No
7 59 M No
8 21 F Yes
Notice that with the last option we could even change the name of the variable in the data
frame.
Split
In some occasions it can be of interest to select a subset of data or dividing a data frame
according to a condition. We have seen that option when applying a logical condition on the
rows index. For example, let’s divide the data frame in smokers and no smokers.
dades.Y<-dades[dades$SM=="Yes",]
dades.Y
Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes
dades.N<-dades[dades$SM=="No",]
dades.N
Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No
20
dades.SM<-split(dades,dades$SM)
dades.SM
$Yes
Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes
$No
Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No
The new object dades.SM is a list that contains the two data.frames. To recover the data
frames that are within the list we may use the double brackets or just the symbol “$” and
the name of the object.
dades.SM[[1]]
Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes
dades.SM$Yes
Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes
dades.SM[[2]]
Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No
21
dades.SM$No
Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No
Let’s suppose that we want to create a new variable that classifies the subject’s age into three
groups as:
That means categorizing the quantitative variable into intervals. The appropriate function to
do that is cut(x,breaks,labels,right). The arguments are:
• breaks. Numeric vector with the cut points. They should include the minimum and
maximum values in data.
• labels. Labels for the levels of the resulting category. By default, labels are constructed
using “(a,b]” interval notation.
• right. Logical. TRUE indicates if the intervals should be closed on the right (and open
on the left) whilst the contrary if FALSE value is used.
cut(dades$Age,breaks=c(0,39,64,100),labels=c("G1","G2","G3"),right=T)
[1] G3 G1 G2 G1 G1 G2 G2 G1
Levels: G1 G2 G3
However, when categorization involves just two groups it could be simpler to just apply a
logical condition. For example, if the categorization was
22
• Group 1: lower than 40
dades$Age>=40
Now, TRUE values would stand for those subjects with an age greater or equal than 40.
There are multiple formats to save and load data in R. Here we only focus in text (ASCII)
code which is probably the broader used and simpler formatting.
To save the data frame in text format we will use the write.table function. The main arguments
are:
write.table(dades,"dades.txt",quote=F,sep="\t")
Load data
Use read.table function to read data from a text file. Some of the arguments are the same as
those from write.table function. However read.table has further arguments, some of them are:
• header. If TRUE the file contains the names of the variables as its first line.
• nrows. The maximum number of rows to read in.
• skip. The number of lines of the data file to skip before beginning to read data.
dades<-read.table("dades.txt",header=T,sep="\t")
dades
23
Age Gender SM
1 65 M Yes
2 34 F No
3 44 F No
4 25 M No
5 38 F Yes
6 41 M No
7 59 M No
8 21 F Yes
fib=read.table("fibrinol.txt",header=T)
head(fib) # This function is used to show the first rows
Qualitative data
Firstly we have to define as factors those variables that are codified as numeric and we will
use in the analyses.
Let’s do it with fever and gender variables. Furthermore, we will recode these variable with
new values.
library(dplyr)
fib$fiebre<-factor(recode(fib$fiebre,"1"="Yes","2"="No"))
fib$sexo<-factor(recode(fib$sexo,"1"="M","2"="F"))
We are also going to use the size of the pleural effusion which values (1,2,3) mean small,
medium and large.
24
fib$tamano<-factor(recode(fib$tamano,"1"="Small","2"="Medium","3"="Large"),
levels=c("Small","Medium","Large"),ordered=T)
Notice that we use the argument ordered to indicate that the levels follow a hierarchical
order.
Frequency table
The basic summary for qualitative data are counts. For example, fiebre variable indicates if
a subject had fever or not. Using the function table we obtain the counts of this variable.
Proportions are obtained by applying the prop.table function to the object generated by table.
tab<-table(fib$fiebre) # Counts
tab
No Yes
49 51
prop.table(tab)
No Yes
0.49 0.51
prop.table(tab)*100 #Percentage
No Yes
49 51
Contingency table
A contingency table is used to describe the frequencies of two qualitative variables at the
same time. To do that we still apply the function table.
tab2<-table(fib$tamano,fib$sexo)
tab2
25
F M
Small 13 17
Medium 23 24
Large 9 14
Gender
Size F M
Small 13 17
Medium 23 24
Large 9 14
Furthermore, once the table object has been created we could add the marginals of the table
(rows and columns total counts).
addmargins(tab2)
Gender
Size F M Sum
Small 13 17 30
Medium 23 24 47
Large 9 14 23
Sum 45 55 100
Now, three proportions are possible depending on the total used: rows, columns, or total
data.
Gender
Size F M
Small 0.13 0.17
Medium 0.23 0.24
Large 0.09 0.14
26
prop.table(tab2,margin=1) # Proportions to total row
Gender
Size F M
Small 0.4333333 0.5666667
Medium 0.4893617 0.5106383
Large 0.3913043 0.6086957
Gender
Size F M
Small 0.2888889 0.3090909
Medium 0.5111111 0.4363636
Large 0.2000000 0.2545455
Quantiles
A quantile is such a value that accumulates a certain proportion of the ordered data.
Most used quantiles are the quartiles that split the data in quarters and percentiles that
accumulates an specific percent of data.
There are three quartiles:
Quantiles are only computed with ordered data, i.e. quantitative data or qualitative data
with order. The way to compute them changes somewhat depending on the type of data
(quantitative or qualitative) and it must be specified.
Following the example, the fever variable has not the order property. So, it is meaningless
to compute quantiles. However we can find a qualitative variable with order. The tamano
variable indicates the size of the pleural effusion in an ordinal scale: small, medium and large.
table(fib$tamano)
27
prop.table(table(fib$tamano))
Let’s compute the main quantiles. The function quantile considers the data as numeric by
default. This can be changed by using the argument type so that the quantiles are computed
using the approach for qualitative data.
Percentile 0 is the minimum value of data which in this case is a “Small” value. First quartile
is “Small” that means that a “Small” value accumulates the 25% of data. A “Small” value
is also the median (percentile 50). The third quartile (percentile 75) is a “Medium” value.
Finally, the maximum value in data (percentile 100) is a “Large” value.
Plots
Plots for qualitative data are usually based on the counts. Most used are the bar chart and
the pie chart.
Some common arguments is plot objects:
• Bar plot
28
tab<-table(fib$tamano)
barplot(tab,ylim=c(0,60),col="red")
60
50
40
30
20
10
0
barplot(tab2,ylim=c(0,40),col=c("blue","red","green"),beside=T,
legend=rownames(tab2),xlab="Gender",
main="Size by Gender",args.legend=list(cex=0.7))
29
Size by Gender
40
Small
Medium
30
Large
20
10
0
F M
Gender
• Pie chart
pie(tab)
Small
Medium
Large
30
Quantitative data
The main statistics to describe quantitative data can be classified as:
• Central tendency. They inform about where the distribution of values is centered. Most
used are: mean and median.
• Position. They give information about where the distribution of values is located. We
already have seen these statistics in the qualitative data case: the quantiles.
• Dispersion. They are useful to know how much different the values are. Low dispersion
would mean that values are mainly concentrated around similar values. On the other
hand, large dispersion would imply values are very sparse with different values.
Quick summary
The summary() function computes the following statistics: minimum, quartiles, mean and
maximum.
summary(fib$edad)
Central tendency
mean(fib$edad)
[1] 57.72
median(fib$edad)
[1] 60
Position
31
quantile(fib$edad)
min(fib$edad)
[1] 18
max(fib$edad)
[1] 94
Dispersion
n
1X
σ2 = (xi − x̄)2
n i=1
var(fib$edad)
[1] 414.143
sd(fib$edad)
[1] 20.3505
(sd(fib$edad)/mean(fib$edad))*100
[1] 35.25728
32
IQR(fib$edad)
[1] 34.5
Summarising by a factor
1) Split the data frame by the factor and summarize the variable at each subset.
2) Use the by function.
by(fib$edad,fib$sexo,mean)
fib$sexo: F
[1] 59.6
------------------------------------------------------------
fib$sexo: M
[1] 56.18182
Plots
• Histogram
This is one of the most used plots to describe quantitative data. It shows the distribution of
data by counting the number of values that lies into specific intervals that account for the
range of the variable.
The rule to bear in mind is the area of the bins (rectangles created at each interval) must be
proportional to the interval counts.
The function to generate a histogram is hist()-
hist(fib$edad,main="Age",xlab="Age")
33
Age
20
15
Frequency
10
5
0
20 40 60 80 100
Age
• Histogram by factor
par(mfrow=c(1,2))
hist(fib$edad[fib$sexo=="M"],main="Age - Males",xlab="Age",col="red")
hist(fib$edad[fib$sexo=="F"],main="Age - Females",xlab="Age",col="red")
34
Age − Males Age − Females
10 12
10
8
Frequency
Frequency
8
6
6
4
4
2
2
0
0
20 40 60 80 100 20 40 60 80
Age Age
• Density plot
A density plot shows the distribution of data as the histogram does, but now the interval
width tends to zero. Thus, the distribution of data is more realistic because it won’t depend
on the width of the intervals.
To draw a density plot, first we have to create a density object using the density function,
After that, we need to apply the function plot to the density object.
35
Density plot of Age
0.010
Density
0.000
0 20 40 60 80 100 120
Age
36
Density plot of Age
0.000 0.005 0.010 0.015
Males
Females
Density
0 20 40 60 80 100
Age
• Boxplot
The Box-plot is one of the most popular charts to describe quantitative data. It is base on
five summaries: minimum and maximum, and the three quartiles.
The box is defined by the quartiles: box boundaries are the first and third quartile; the line
inside the box is the median. Furthermore, the lines that came out from the box (also known
as whiskers) end up at the minimum and maximum.
boxplot(fib$edad,main="Boxplot of Age")
37
Boxplot of Age
80
60
40
20
• Boxplot by factor
boxplot(fib$edad~fib$sexo,main="Boxplot of Age",xlab="Gender")
Boxplot of Age
80
fib$edad
60
40
20
F M
Gender
38
It is also possible to give the dataframe name as argument avoiding the use of $ symbol. In
that case we need to specify the two variables by means a formula object using the symbol ~.
boxplot(edad~sexo,data=fib,main="Boxplot of Age",xlab="Gender")
Boxplot of Age
80
60
edad
40
20
F M
Gender
boxplot(lelastas~tamano,fib,main="Boxplot of L-elastase",xlab="Size")
39
Boxplot of L−elastase
30000
lelastas
0 10000
Size
40