0% found this document useful (0 votes)
32 views

R Introduction

This document provides an introduction to using R for statistical analysis. It discusses setting up R and RStudio, creating R scripts, and the basic building blocks of R including vectors, matrices, and data frames. Vectors can contain numeric or character data and support common mathematical operations. Matrices are two-dimensional arrays that store values in rows and columns. Key classes and functions for creating, manipulating, and applying operations to vectors and matrices are also outlined.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

R Introduction

This document provides an introduction to using R for statistical analysis. It discusses setting up R and RStudio, creating R scripts, and the basic building blocks of R including vectors, matrices, and data frames. Vectors can contain numeric or character data and support common mathematical operations. Matrices are two-dimensional arrays that store values in rows and columns. Key classes and functions for creating, manipulating, and applying operations to vectors and matrices are also outlined.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

R doc: Let´s use R

Josep L. Carrasco
Bioestadística. Departament de Fonaments Clínics
Universitat de Barcelona

First steps
R is a programming language and free software environment for statistical computing and
graphics. Base and contributed packages can be set up from CRAN (Comprehensive R
Archive Network) website (http://cran.r-project.org/).
R has a command line interface though it is more common to use a graphical user interface
as RStudio (https://www.rstudio.com/)
Once R and RStudio has been set up, let’s open RStudio and create an R script file.

In this file we can write the code to run in a session and save it.
To embed comments through the code use the symbol #.

1
Working environment
R always uses a directory on your computer to read and/or save data and objects. To know
which folder is used by default use the function getwd().
To change the working directory use the function setwd().
One appealing option is Session –> Set Working Directory –> To Source File Location from
the bar menu. The working directory will be set to the folder where the R script file is
located.

Mathematical functions
R can be used a a calculator using the implemented mathematical operators and functions.
The following table shows the main operators and functions.

Operation Operator/Function Example


Addition + 3+4
Subtraction - 3-4
Product * 3*4
Division / 3/4
Power ˆ 3ˆ4
Square Root sqrt(x) sqrt(4)
Natural logarithm log(x) log(2)
Logarithm base 2 log(x,2) log(4,2)
Logarithm base 10 log10(x) log10(100)
Exponential exp(x) exp(2)

To round results one could opt fro changing the global options of the R session using the
function options and its argument digits. For example, options(digits=10) would change all
the the numerical outputs to 10 digits.
However, the common choice is just to round a single result. To do that the next table gives
some options.

2
Operation Operator/Function Example
Round to upper integer ceiling(x) ceiling(5.37)
Round to lower integer floor(x) floor(5.37)
Truncate trunc(x) trunc(5.37)
Rounding round(x,digits) round(5.37,digits=1)

Allocation

Objects and classes


R is an object oriented programming language that uses classes and objects. An object is a
data structure that contains some methods that act upon its attributes. Objects properties
and methods depend on the object class. For example, arithmetic operations apply to objects
of class numeric but they can be used on object of class character.
To assign a name to an object we have two options: = or <-. Both give the same result.

x=10
x

[1] 10

y<-12
y

[1] 12

An object may be also the result of an operation or a function.

z<-x+y
z

[1] 22

w<-log(z)
w

[1] 3.091042

To know the class of an object we can use the function class

3
class(w)

[1] "numeric"

Next, let’s see the main classes for data analysis: vectors, matrices, data frames and lists.

Vectors
A vector is a set of values that can be numeric or character. To create a vector use the
function c
The following table shows the main operators and functions that apply to vector.

Operation Operator/Function Example


Add a constant + x+4
Subtract a constant - x-4
Product with constant * x*4
Division with constant / x/4
Power to a constant ˆ xˆ4
Square Root sqrt(x) sqrt(y)
Natural logarithm log(x) log(y)
Logarithm base 2 log(x,2) log(4,2)
Logarithm base 10 log10(x) log10(100)
Exponential exp(x) exp(2)

x=c(5,7,9,13,-4,8)
x

[1] 5 7 9 13 -4 8

class(x)

[1] "numeric"

y=c("A","B","C")
y

[1] "A" "B" "C"

4
class(y)

[1] "character"

Notice that character values must be enclosed between quotation marks. Furthermore, all
values must be of the same class (numeric or character). If one character value is combined
with numeric values then all values are converted to character.

z<-c(x,y)
z

[1] "5" "7" "9" "13" "-4" "8" "A" "B" "C"

class(z)

[1] "character"

The function seq creates a sequential vector. We just need to specify the starting and ending
values, and the sequential increment. In case the increment is equal to 1 we could use the
simplified version of two points.

seq(from=1, to=10, by=2)

[1] 1 3 5 7 9

seq(from=1, to=10, by=1)

[1] 1 2 3 4 5 6 7 8 9 10

1:10

[1] 1 2 3 4 5 6 7 8 9 10

Additionally, it could be interesting to replicate a vector some times. In that case the function
to use is rep. Notice the different result when the argument each is used.

rep(1,5)

[1] 1 1 1 1 1

5
z=c(1,2)
z

[1] 1 2

rep(z,5)

[1] 1 2 1 2 1 2 1 2 1 2

rep(z,each=5)

[1] 1 1 1 1 1 2 2 2 2 2

Vector indexing refers to the order position of values in the vector. This is an important
issue to extract one single value or a set of values from the vector. The way to indicate the
position is using the brackets symbol.

x[2] # Value at second position

[1] 7

x[2:4] # Values between positions 2 and 4

[1] 7 9 13

x[c(1,3)] # First and third value

[1] 5 9

The same operator and function showed for values also apply to vectors. The operations
are element-wise, i.e they are applied element by element. Additionally, the symbol %*%
produces the matrix product. Here there are some examples:

x<-c(1,3,4,7)
y=1:4
x+2 # Add a constant

[1] 3 5 6 9

6
x*2 # Product with a constant

[1] 2 6 8 14

xˆ2 # Power

[1] 1 9 16 49

x+y # Addition of vectors

[1] 2 5 7 11

x*y # Product of elements of a vector

[1] 1 6 12 28

x%*%y # Matrix product

[,1]
[1,] 47

Furthermore, there are other interesting functions for vectors:

sum(x) # Sum all values of the vector

[1] 15

prod(x) # Product of all values of the vector

[1] 84

length(x) # Number of values in a vector

[1] 4

sort(x) # Sorts the values of vector

[1] 1 3 4 7

7
order(x) # Gives the original locations of the sorted vector

[1] 1 2 3 4

Logical operators are useful to check which elements of the vector meet some condition.
The following table shows the logical operators.

Operation Operator/Function Example


Equal == x==4
Greater > x>4
Greater or equal >= x >= 4
Lower < x<4
Lower or equal <= x <= 4
Not equal != x!=4

Notice that the equality operator is a double equal sign. Remember that one single equal
sign is used to assign names to objects.
The result of the logical functions will be a logical vector with the same length than that
vector used in the logical evaluation. The values of a logical vector are TRUE in case that
the value meets the condition, or FALSE otherwise.
Let’s see some examples.

x<-c(1,3,4,7,10)
x==4 # which values are equal to 4

[1] FALSE FALSE TRUE FALSE FALSE

x>4 # which values are greater than 4

[1] FALSE FALSE FALSE TRUE TRUE

x>=4 # which values are greater or equal than 4

[1] FALSE FALSE TRUE TRUE TRUE

x<4 # which values are lower than 4

[1] TRUE TRUE FALSE FALSE FALSE

8
x<=4 # which values are lower or equal than 4

[1] TRUE TRUE TRUE FALSE FALSE

x!=4 # which values are different to 4

[1] TRUE TRUE FALSE TRUE TRUE

Logical conditions also applied to character vectors. However, in that case only the operators
equal and not equal make sense.

z<-c("A","B","C","D","E")
z=="C" # which values are equal to C

[1] FALSE FALSE TRUE FALSE FALSE

z!="C" # which values are different to C

[1] TRUE TRUE FALSE TRUE TRUE

Matrices
In R matrices are objects that store values in a two-dimensional array (rows and columns).
The generic function to create a matrix is matrix(). The main arguments are:

• data: A vector with the values to store in the matrix.


• nrow: number of rows
• ncol: number of columns
• byrow: A logical value (TRUE or FALSE) indicating if the matrix has to be filled by
rows. The default value is FALSE.

Let’s see some examples:

x<-c(1,3,4,7,10,12)
A=matrix(x,nrow=2,ncol=3)
A

[,1] [,2] [,3]


[1,] 1 4 10
[2,] 3 7 12

9
A=matrix(x,nrow=2,ncol=3,byrow=T) # Notice the matrix is filled now by rows.
A

[,1] [,2] [,3]


[1,] 1 3 4
[2,] 7 10 12

Matrices can be also created by concatenating vectors using he functions cbind (concatenate
by columns) and rbind (concatenate by rows).

x<-c(1,3,4,7,10,12)
y<-1:6
cbind(x,y)

x y
[1,] 1 1
[2,] 3 2
[3,] 4 3
[4,] 7 4
[5,] 10 5
[6,] 12 6

rbind(x,y)

[,1] [,2] [,3] [,4] [,5] [,6]


x 1 3 4 7 10 12
y 1 2 3 4 5 6

One important issue to bear in mind is all elements in a matrix have to be of the same class
(numeric or character). Let’ see what happens if a numeric vector and a character vector are
concatenated to create a matrix.

x<-c(1,3,4,7,10,12)
z<-c("A","B","C","D","E","F")
cbind(x,z)

x z
[1,] "1" "A"
[2,] "3" "B"
[3,] "4" "C"
[4,] "7" "D"
[5,] "10" "E"
[6,] "12" "F"

10
R automatically converts the numeric vector to character values.
Matrix indexing will involve two values because the two-dimensionality of the object. The
two values will refer to the row and column number respectively, and they will be separated
by a comma.

x<-c(1,3,4,7,10,12)
y<-1:6
A<-cbind(x,y)
A

x y
[1,] 1 1
[2,] 3 2
[3,] 4 3
[4,] 7 4
[5,] 10 5
[6,] 12 6

A[1,2] # First row, second column

y
1

A[2,1] # Second row, first column

x
3

A[1,] # First row, all columns

x y
1 1

A[,1] # All rows, first column

[1] 1 3 4 7 10 12

11
Matrix operations and functions

The same operators used for vector apply to matrices. Furthermore, specific functions for
matrices are:

• t(). Transpose the matrix.


• diag(). Extract the diagonal of a matrix, or creates a diagonal matrix.
• det(). Determinant of a matrix.
• solve(). Inverse of a matrix.

Some examples:

x<-c(1,3,4,7,10,12)
A<-matrix(x,nrow=2)

A+2 # Add a constant to all elements

[,1] [,2] [,3]


[1,] 3 6 12
[2,] 5 9 14

A*2 # Multiply by a constant all elements

[,1] [,2] [,3]


[1,] 2 8 20
[2,] 6 14 24

Aˆ2 # Power to a value all elements

[,1] [,2] [,3]


[1,] 1 16 100
[2,] 9 49 144

A+A # Element-wise addition of matrices

[,1] [,2] [,3]


[1,] 2 8 20
[2,] 6 14 24

12
A-A # Element-wise subtraction of matrices

[,1] [,2] [,3]


[1,] 0 0 0
[2,] 0 0 0

B=t(A) # Transpose
B

[,1] [,2]
[1,] 1 3
[2,] 4 7
[3,] 10 12

C=A%*%B # Product
C

[,1] [,2]
[1,] 117 151
[2,] 151 202

diag(C) # Diagonal

[1] 117 202

diag(1:3) # Create a diagonal matrix

[,1] [,2] [,3]


[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3

det(C) # Determinant

[1] 833

solve(C) #Inverse

[,1] [,2]
[1,] 0.2424970 -0.1812725
[2,] -0.1812725 0.1404562

13
Lists
Lists group different R objects data into a one-dimensional set. Actually, a list is an ordered
collection of objects. Thus, list are useful to combine different objects into a single object.
For example, let’s create a list containing two vectors. The first is a numeric vector whilst
the second is a logical vector created after applying a logical condition to the first vector.

x<-c(1,3,4,7,10,12)
z<-x>=4
w<-list(x,z)
w

[[1]]
[1] 1 3 4 7 10 12

[[2]]
[1] FALSE FALSE TRUE TRUE TRUE TRUE

The list objects indexes the elements using double brackets. So that, the way to recover one
element of the list is:

w[[1]]

[1] 1 3 4 7 10 12

w[[2]]

[1] FALSE FALSE TRUE TRUE TRUE TRUE

Some functions uses list objects as arguments. This is the case of dimnames argument in
the matrix function. This argument gives names to the rows and columns respectively.

x<-c(1,3,4,7,10,12)
A=matrix(x,nrow=2,ncol=3, dimnames=list(c("A","B"),c("1","2","3")))
A

1 2 3
A 1 4 10
B 3 7 12

14
Data frames
Data frame objects are matrices with further properties. The main difference stem from the
fact that data frames allow different type of data (quantitative and qualitative) whilst only
one kind of data is possible with matrices.
First, let’s create two vectors: one quantitative and one qualitative. Let’s suppose that these
vectors contain the age and gender of eight subjects. Notice that character values are between
quotation marks.

x<-c(65, 34, 44, 25, 38, 41, 59, 21)


y<-c("M","F","F","M","F","M","M","F")

To create a data frame with these two vectors use the function data.frame.

dades=data.frame(x,y)
dades

x y
1 65 M
2 34 F
3 44 F
4 25 M
5 38 F
6 41 M
7 59 M
8 21 F

Names

The current names of the columns (variables) are “x” and “y”. We may change that by
applying the function names to the data frame.
It is convenient to use short names for a better handling of data. Additionally, accents and
special characters should be avoided. For example, let’s name them as “Age” and “Gender”.

names(dades) # Current names

[1] "x" "y"

names(dades)<-c("Age","Gender") # New names


names(dades)

[1] "Age" "Gender"

15
dades

Age Gender
1 65 M
2 34 F
3 44 F
4 25 M
5 38 F
6 41 M
7 59 M
8 21 F

Indexing

To extract rows or columns from a data frame we could proceed applying the matrices
indexing. For example, let’s extract the “Age” column which is the first column.

dades[,1]

[1] 65 34 44 25 38 41 59 21

However, data frames have further properties than matrices. One of them is the names of
the variables. Using this property it is possible to extract a column by using the symbol “$”
and its name and .

dades$Age

[1] 65 34 44 25 38 41 59 21

Furthermore, data frame naming properties can be also applied.

dades[,"Age"]

[1] 65 34 44 25 38 41 59 21

Additionally, we could select those data from males (“M”) by applying a logical condition to
the rows. The result is a new data frame that only contains de rows that meet the condition.

dades[dades$Gender=="M",]

16
Age Gender
1 65 M
4 25 M
6 41 M
7 59 M

The same result is obtained using the function subset

subset(dades,Gender=="M")

Age Gender
1 65 M
4 25 M
6 41 M
7 59 M

Factors

Factors are variables that contain qualitative information. They can be created by quoting
the values or by using the factor function applied to a vector.
Following with the example of the eight subjects, suppose now that we recorded a binary
variable indicating whether the subjects were exposed to a risk factor (smoking, for instance)
or not. Additionally, this variable is coded as 0 or 1 meaning “No” and “Yes” respectively.

sm<-c(1,0,0,0,1,0,0,1)
sm

[1] 1 0 0 0 1 0 0 1

R will understand this variable as a quantitative one. In principle there is no problem to


codify a qualitative variable with numbers. However, in further analysis it will be more
appropriate to define this variable as a factor and using the properties of this kind of object.
Applying the function factor to the variable gives the same vector values, but additional
information is printed now. “Levels” informs the different values of the factor.

factor(sm)

[1] 1 0 0 0 1 0 0 1
Levels: 0 1

To transform sm to a factor.

17
sm<-factor(sm)
sm

[1] 1 0 0 0 1 0 0 1
Levels: 0 1

To add the new variable to the existing data frame it is possible to use the function cbind

cbind(dades,sm)

Age Gender sm
1 65 M 1
2 34 F 0
3 44 F 0
4 25 M 0
5 38 F 1
6 41 M 0
7 59 M 0
8 21 F 1

Reordering the factor levels

This can be convenient to change outputs appearance. One easy way to reordering the factor
levels is using the argument levels in the factor function. We have to use a vector with the
ordered values.

sm<-factor(sm,levels=c("1","0"))
sm

[1] 1 0 0 0 1 0 0 1
Levels: 1 0

Recoding: change the names of the levels

There are different ways to change the names of the levels. Perhaps, one of the more appealing
ways is using the function recode from dplyr package.
To do that you need first to install the package using the function install.packages() or the
packages tab either.

18
install.packages("dplyr")

After that the package needs to be loaded in memory using the library() function.

library(dplyr)

Warning: replacing previous import 'lifecycle::last_warnings' by


'rlang::last_warnings' when loading 'pillar'

Warning: replacing previous import 'lifecycle::last_warnings' by


'rlang::last_warnings' when loading 'tibble'

sm<-recode(sm,"0"="No","1"="Yes")
sm

[1] Yes No No No Yes No No Yes


Levels: Yes No

To know if a variable is already a factor we just need to apply the function is.factor or class().

is.factor(sm)

[1] TRUE

class(sm)

[1] "factor"

Finally, to add this vector to our data frame we could proceed by applying the function cbind

cbind(dades,sm)

Age Gender sm
1 65 M Yes
2 34 F No
3 44 F No
4 25 M No
5 38 F Yes
6 41 M No
7 59 M No
8 21 F Yes

or just using the symbol “$”.

19
dades$SM<-sm
dades

Age Gender SM
1 65 M Yes
2 34 F No
3 44 F No
4 25 M No
5 38 F Yes
6 41 M No
7 59 M No
8 21 F Yes

Notice that with the last option we could even change the name of the variable in the data
frame.

Split

In some occasions it can be of interest to select a subset of data or dividing a data frame
according to a condition. We have seen that option when applying a logical condition on the
rows index. For example, let’s divide the data frame in smokers and no smokers.

dades.Y<-dades[dades$SM=="Yes",]
dades.Y

Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes

dades.N<-dades[dades$SM=="No",]
dades.N

Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No

An alternative approach is to apply the function split to the data frame.

20
dades.SM<-split(dades,dades$SM)
dades.SM

$Yes
Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes

$No
Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No

The new object dades.SM is a list that contains the two data.frames. To recover the data
frames that are within the list we may use the double brackets or just the symbol “$” and
the name of the object.

dades.SM[[1]]

Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes

dades.SM$Yes

Age Gender SM
1 65 M Yes
5 38 F Yes
8 21 F Yes

dades.SM[[2]]

Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No

21
dades.SM$No

Age Gender SM
2 34 F No
3 44 F No
4 25 M No
6 41 M No
7 59 M No

Categorizing a quantitative variable

Let’s suppose that we want to create a new variable that classifies the subject’s age into three
groups as:

• Group 1: lower than 40

• Group 2: between 40 and 64

• Group 3: greater than 64

That means categorizing the quantitative variable into intervals. The appropriate function to
do that is cut(x,breaks,labels,right). The arguments are:

• x. Vector of values to be categorized.

• breaks. Numeric vector with the cut points. They should include the minimum and
maximum values in data.

• labels. Labels for the levels of the resulting category. By default, labels are constructed
using “(a,b]” interval notation.

• right. Logical. TRUE indicates if the intervals should be closed on the right (and open
on the left) whilst the contrary if FALSE value is used.

In our example the appropriate syntax is:

cut(dades$Age,breaks=c(0,39,64,100),labels=c("G1","G2","G3"),right=T)

[1] G3 G1 G2 G1 G1 G2 G2 G1
Levels: G1 G2 G3

However, when categorization involves just two groups it could be simpler to just apply a
logical condition. For example, if the categorization was

22
• Group 1: lower than 40

• Group 2: greater or equal than 40.

dades$Age>=40

[1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE

Now, TRUE values would stand for those subjects with an age greater or equal than 40.

Save data frame in a file

There are multiple formats to save and load data in R. Here we only focus in text (ASCII)
code which is probably the broader used and simpler formatting.
To save the data frame in text format we will use the write.table function. The main arguments
are:

• x. The name of the object (data frame) that is going to be saved.


• file. The path and name of the file.
• append. If TRUE, the output is appended to the file.
• quote. If TRUE, any character or factor columns will be surrounded by double quotes.
• sep. Values within each row of x are separated by this string.
• dec. The string to use for decimal points

write.table(dades,"dades.txt",quote=F,sep="\t")

Load data

Use read.table function to read data from a text file. Some of the arguments are the same as
those from write.table function. However read.table has further arguments, some of them are:

• header. If TRUE the file contains the names of the variables as its first line.
• nrows. The maximum number of rows to read in.
• skip. The number of lines of the data file to skip before beginning to read data.

dades<-read.table("dades.txt",header=T,sep="\t")
dades

23
Age Gender SM
1 65 M Yes
2 34 F No
3 44 F No
4 25 M No
5 38 F Yes
6 41 M No
7 59 M No
8 21 F Yes

Summarizing the data


This section is devoted to show the basics to summarize data with R.
Let’s use the data in file fibrinol.txt that correspond to subjects that suffered a pleural effusion
and were treated with fibrinolytics.

fib=read.table("fibrinol.txt",header=T)
head(fib) # This function is used to show the first rows

edad sexo fiebre dolortor tamano lelastas drenaje fibrinol drenxcir


1 64 1 2 1 1 81 1 1 1
2 79 1 2 2 2 8487 2 2 1
3 34 1 2 2 2 8035 2 1 1
4 81 1 1 1 3 374 2 1 1
5 75 2 2 2 2 1883 1 1 1
6 54 1 1 2 2 5351 2 1 1

Notice that fib is already a data frame.

Qualitative data
Firstly we have to define as factors those variables that are codified as numeric and we will
use in the analyses.
Let’s do it with fever and gender variables. Furthermore, we will recode these variable with
new values.

library(dplyr)
fib$fiebre<-factor(recode(fib$fiebre,"1"="Yes","2"="No"))
fib$sexo<-factor(recode(fib$sexo,"1"="M","2"="F"))

We are also going to use the size of the pleural effusion which values (1,2,3) mean small,
medium and large.

24
fib$tamano<-factor(recode(fib$tamano,"1"="Small","2"="Medium","3"="Large"),
levels=c("Small","Medium","Large"),ordered=T)

Notice that we use the argument ordered to indicate that the levels follow a hierarchical
order.

Frequency table

The basic summary for qualitative data are counts. For example, fiebre variable indicates if
a subject had fever or not. Using the function table we obtain the counts of this variable.
Proportions are obtained by applying the prop.table function to the object generated by table.

tab<-table(fib$fiebre) # Counts
tab

No Yes
49 51

prop.table(tab)

No Yes
0.49 0.51

prop.table(tab)*100 #Percentage

No Yes
49 51

Contingency table

A contingency table is used to describe the frequencies of two qualitative variables at the
same time. To do that we still apply the function table.

tab2<-table(fib$tamano,fib$sexo)
tab2

25
F M
Small 13 17
Medium 23 24
Large 9 14

It is possible to assign labels to the table by using the dnn argument.

tab2<-table(fib$tamano,fib$sexo,dnn=list("Size","Gender")) #Add labels


tab2

Gender
Size F M
Small 13 17
Medium 23 24
Large 9 14

Furthermore, once the table object has been created we could add the marginals of the table
(rows and columns total counts).

addmargins(tab2)

Gender
Size F M Sum
Small 13 17 30
Medium 23 24 47
Large 9 14 23
Sum 45 55 100

Now, three proportions are possible depending on the total used: rows, columns, or total
data.

prop.table(tab2) # Proportions to total

Gender
Size F M
Small 0.13 0.17
Medium 0.23 0.24
Large 0.09 0.14

26
prop.table(tab2,margin=1) # Proportions to total row

Gender
Size F M
Small 0.4333333 0.5666667
Medium 0.4893617 0.5106383
Large 0.3913043 0.6086957

prop.table(tab2,margin=2) # Proportions to total column

Gender
Size F M
Small 0.2888889 0.3090909
Medium 0.5111111 0.4363636
Large 0.2000000 0.2545455

Quantiles

A quantile is such a value that accumulates a certain proportion of the ordered data.
Most used quantiles are the quartiles that split the data in quarters and percentiles that
accumulates an specific percent of data.
There are three quartiles:

• first quartile accumulates a 25% of ordered data, so it is also the 25 percentile.


• second quartile accumulates a 50% of ordered data. Is is also known as the median and
the 50 percentile.
• third quartile accumulates a 75% of ordered data, so it is the 75 percentile too.

Quantiles are only computed with ordered data, i.e. quantitative data or qualitative data
with order. The way to compute them changes somewhat depending on the type of data
(quantitative or qualitative) and it must be specified.
Following the example, the fever variable has not the order property. So, it is meaningless
to compute quantiles. However we can find a qualitative variable with order. The tamano
variable indicates the size of the pleural effusion in an ordinal scale: small, medium and large.

table(fib$tamano)

Small Medium Large


30 47 23

27
prop.table(table(fib$tamano))

Small Medium Large


0.30 0.47 0.23

Let’s compute the main quantiles. The function quantile considers the data as numeric by
default. This can be changed by using the argument type so that the quantiles are computed
using the approach for qualitative data.

quantile(fib$tamano,type=1) # Quantile for ordinal data

0% 25% 50% 75% 100%


Small Small Medium Medium Large
Levels: Small < Medium < Large

Percentile 0 is the minimum value of data which in this case is a “Small” value. First quartile
is “Small” that means that a “Small” value accumulates the 25% of data. A “Small” value
is also the median (percentile 50). The third quartile (percentile 75) is a “Medium” value.
Finally, the maximum value in data (percentile 100) is a “Large” value.

Plots

Plots for qualitative data are usually based on the counts. Most used are the bar chart and
the pie chart.
Some common arguments is plot objects:

• xlim, ylim. Scale for x and y axes.


• xlab, ylab. Labels for x and y axes.
• main. Main title
• col. Default plotting color.
• pch. Symbol or a single character to be used in plots using points.
• lty. Line type for plots using lines.
• lwd. Line size for plots using lines.
• cex. Character font size. Proportional to the default.

To explore more graphical parameters run ?par.

• Bar plot

The input data are the counts.

28
tab<-table(fib$tamano)
barplot(tab,ylim=c(0,60),col="red")
60
50
40
30
20
10
0

Small Medium Large

• Bar plot by factor

barplot(tab2,ylim=c(0,40),col=c("blue","red","green"),beside=T,
legend=rownames(tab2),xlab="Gender",
main="Size by Gender",args.legend=list(cex=0.7))

29
Size by Gender
40

Small
Medium
30

Large
20
10
0

F M

Gender

• Pie chart

pie(tab)

Small

Medium

Large

30
Quantitative data
The main statistics to describe quantitative data can be classified as:

• Central tendency. They inform about where the distribution of values is centered. Most
used are: mean and median.

• Position. They give information about where the distribution of values is located. We
already have seen these statistics in the qualitative data case: the quantiles.

• Dispersion. They are useful to know how much different the values are. Low dispersion
would mean that values are mainly concentrated around similar values. On the other
hand, large dispersion would imply values are very sparse with different values.

Quick summary

The summary() function computes the following statistics: minimum, quartiles, mean and
maximum.

summary(fib$edad)

Min. 1st Qu. Median Mean 3rd Qu. Max.


18.00 40.50 60.00 57.72 75.00 94.00

Nevertheless, there are specific functions to compute the descriptive statistics.

Central tendency

mean(fib$edad)

[1] 57.72

median(fib$edad)

[1] 60

Position

31
quantile(fib$edad)

0% 25% 50% 75% 100%


18.0 40.5 60.0 75.0 94.0

min(fib$edad)

[1] 18

max(fib$edad)

[1] 94

Dispersion

• Variance: average distance to the mean

n
1X
σ2 = (xi − x̄)2
n i=1

• Standard deviation: square root of variance

• Coefficient of variation: ratio of standard deviation and mean

• Interquartile range: Difference between Q3 and Q1 .

var(fib$edad)

[1] 414.143

sd(fib$edad)

[1] 20.3505

(sd(fib$edad)/mean(fib$edad))*100

[1] 35.25728

32
IQR(fib$edad)

[1] 34.5

Summarising by a factor

Sometimes it can be of interest to compute the summary of a quantitative variable at each


level of a factor. In this case we have two options (at least):

1) Split the data frame by the factor and summarize the variable at each subset.
2) Use the by function.

by(fib$edad,fib$sexo,mean)

fib$sexo: F
[1] 59.6
------------------------------------------------------------
fib$sexo: M
[1] 56.18182

Plots

• Histogram

This is one of the most used plots to describe quantitative data. It shows the distribution of
data by counting the number of values that lies into specific intervals that account for the
range of the variable.
The rule to bear in mind is the area of the bins (rectangles created at each interval) must be
proportional to the interval counts.
The function to generate a histogram is hist()-

hist(fib$edad,main="Age",xlab="Age")

33
Age
20
15
Frequency

10
5
0

20 40 60 80 100

Age

• Histogram by factor

par(mfrow=c(1,2))
hist(fib$edad[fib$sexo=="M"],main="Age - Males",xlab="Age",col="red")
hist(fib$edad[fib$sexo=="F"],main="Age - Females",xlab="Age",col="red")

34
Age − Males Age − Females
10 12

10
8
Frequency

Frequency
8

6
6

4
4

2
2
0

0
20 40 60 80 100 20 40 60 80

Age Age

• Density plot

A density plot shows the distribution of data as the histogram does, but now the interval
width tends to zero. Thus, the distribution of data is more realistic because it won’t depend
on the width of the intervals.
To draw a density plot, first we have to create a density object using the density function,
After that, we need to apply the function plot to the density object.

plot(density(fib$edad),main="Density plot of Age",xlab="Age")

35
Density plot of Age
0.010
Density

0.000

0 20 40 60 80 100 120

Age

• Density plot by factor

plot(density(fib$edad[fib$sexo=="M"]),main="Density plot of Age",xlab="Age")


lines(density(fib$edad[fib$sexo=="F"]),
main="Density plot of Age",xlab="Age",col="red")
legend(0,0.017,c("Males","Females"),col=c("black","red"),lty=c(1,1),cex=0.5)

36
Density plot of Age
0.000 0.005 0.010 0.015

Males
Females
Density

0 20 40 60 80 100

Age

• Boxplot

The Box-plot is one of the most popular charts to describe quantitative data. It is base on
five summaries: minimum and maximum, and the three quartiles.
The box is defined by the quartiles: box boundaries are the first and third quartile; the line
inside the box is the median. Furthermore, the lines that came out from the box (also known
as whiskers) end up at the minimum and maximum.

boxplot(fib$edad,main="Boxplot of Age")

37
Boxplot of Age
80
60
40
20

• Boxplot by factor

boxplot(fib$edad~fib$sexo,main="Boxplot of Age",xlab="Gender")

Boxplot of Age
80
fib$edad

60
40
20

F M

Gender

38
It is also possible to give the dataframe name as argument avoiding the use of $ symbol. In
that case we need to specify the two variables by means a formula object using the symbol ~.

boxplot(edad~sexo,data=fib,main="Boxplot of Age",xlab="Gender")

Boxplot of Age
80
60
edad

40
20

F M

Gender

Outlier detection with boxplot


When data have extreme values (outliers) in relation to the remaining data the boxplot
function mark them with circles. In such case, the limit of the whiskers represents the last
values that are not considered as outliers.

boxplot(lelastas~tamano,fib,main="Boxplot of L-elastase",xlab="Size")

39
Boxplot of L−elastase
30000
lelastas

0 10000

Small Medium Large

Size

40

You might also like