Getting Started With R and RStudio - Feb2018
Getting Started With R and RStudio - Feb2018
© 2017
This content has been provided to be used in the Statistical Methods for Data Scientists subject.
Apart from being reformatted to an online delivery and some minor modifications, the content has
been created by the authors Rhondda Jones and Robin Gilliver and they retain their copyright.
However, the authors are not responsible for the delivery of the content in the Quantitative Methods
for Scientists subject.
The authors can be contacted about this material via the email: [email protected].
Preface
The first version of R was written largely for teaching purposes by two New Zealand statisticians,
Ross Ihaka and Robert Gentleman, in 1995: it evolved from the S language developed by John
Chambers at the AT&T Bell laboratories. It was an instant hit! By now, R has become the fastest
growing analytical software available, in terms of both its usage and its capacity. The ability to use it
effectively is now an essential and marketable skill for most quantitatively-oriented professions, from
bioinformatics to data science. Because it is free and open-source, R has also become the
environment where new analytical methods first appear. It has a huge and growing online user
community, and numerous websites devoted to assisting and coaching new users. Welcome to the R
community! We hope you enjoy the journey.
https://cran.r-project.org/
This page provides links to let you download installers for the latest R versions for Windows, Mac,
and Linux machines.
https://www.rstudio.com/products/rstudio/download3/
This website has links to download either personal (free) or commercial versions of RStudio. Most
users will use the desktop personal license. As with R, you can choose an installer appropriate to
your computer.
1
You should install R before installing RStudio. In each case, double-click on the downloaded installer
and follow the installation instructions provided. At the end of this process, you should have icons
for both R and RStudio on your desktop.
You can double-click on the icon in the usual way to start either program. Starting RStudio will also
start R automatically – but initially, start R on its own to see how to use it without RStudio.
2
Figure 1-1: The RGui window containing the R console window.
We will not dwell on them here because their functions are superseded
by the capabilities of RStudio.
> (3 + 9) * 5 - subtract
Press the Enter key. R responds with the answer – in this case
[1] 60
Notice that the command appears in red and the response in blue. Depending on the command,
there can be multiple values in the response: the number at the beginning of the output line is the
index number for the first value in the line.
Now enter the same calculation again, but with an error – omit the closing bracket after the 9.
> (3 + 9 * 5
This time R notices the absence of a closing bracket and indicates that it expects further input. We
can give it a closing bracket and press Enter
+)
3
[1] 48
but the answer is wrong because the final bracket should not be at the end. Commands can be
retrieved using the up and down arrows on your keyboard. In this case, press the up arrow twice to
retrieve the command with the mistake, and then correct it. Then try the following, and more of
your own …
> 12/8
[1] 1.5
> 12%%8
[1] 4
> 12%/%8
Inside parentheses ()
You can type more than one command on a line if you separate the commands with a semicolon.
[1] 36
[1] 8
10^2 == 99
[1] TRUE
(10==8)&&(5^2==25) R names
> joanne1
This time, instead of displaying the result, R has stored it in a variable Joanne1
called x. To display it, use:
>x
[1] 7
= an equals sign.
are all equivalent—they all say “Assign the character string "My cat" to the name Lizzie”. In this
manual we will use the last of these options. The assignment command creates an R object called
Lizzie which has the value "My cat" and stores it in your workspace. Assignment to a previously used
name will destroy the old object and create a new one.
5
An assignment command will not automatically display the value of the new object. To display the
value, either enter the name separately, or place the assignment in brackets (which will assign and
display in the one operation ... ).
height = 172
height
[1] 172
(height = 172)
[1] 172
To clear the Console window, right click on it and select Clear All.
At this point we will move on to using RStudio as an environment for R, so close the RGui window.
6
1.2.2 Using R with RStudio
Start RStudio by double-clicking its startup icon
When you start RStudio for the first time, it should resemble the screenshot below, with a large
window to the left, and two smaller windows to the right.
The top right-hand window has two tabs – Environment and History. The Environment tab gives you
the contents of your current R workspace – which will be empty at this point. The History tab keeps
track of all the commands you have used – and again, is empty when you first open RStudio.
7
The bottom right-hand window has five tabs. The first one, Files, gives you the files within the folder
where your workspace is located. The second, Plots, will display plots you create in your R session.
The third, Packages, gives you a list of the packages you currently have access to, and the fourth,
Help, lets you search the available help files. The fifth (Viewer) can display local web content.
You will find the right-hand windows very helpful. But the most useful part of RStudio is not visible
yet: to obtain it, we need a new window which appears when you create a script, and gives you
access to the source editor of RStudio.
A new window is created above the Console window, so that RStudio now looks as shown below:
8
Figure 1-3: The RStudio windows including a script window.
Notice that the script window tab is labelled “Untitled1”, This is currently the name of the script. To
change the name to something more descriptive and save it, press
enter Introduction for the File name: and press Save . Notice that it is saved as
Introduction.R – RStudio expects any file containing an R script to end in .R.
To illustrate how the script window works, we will enter, edit, and execute a few simple commands.
x = (50 + 6) / 2^3
As you type the line, notice that RStudio helpfully matches brackets as you go. It will do the same for
quote marks.
9
With the cursor still on the line, press the Run button at the top of the script window.
Notice that the command is echoed in the Console window and has been successfully executed.
Also, in the Environment window on the top right, x is now listed under Values, together with its
value (7).
Now select 2^3 within the typed line, and press Run again. The selected expression is again echoed,
and this time (because the x = part of the command was not selected), the expression is evaluated
and the result is shown in the Console window. Both commands are echoed under the History tab.
The general rule is that when you press Run, if nothing is selected, RStudio runs the line in the script
where the cursor is located. If the line is an incomplete command, it will keep reading lines until the
command is completed. If something is selected – part of a line, or several lines – it runs whatever is
selected.
Now close RStudio by clicking the red Close button on the top right. RStudio will ask you if you want
to save the data and script you have created in the session.
10
Figure 1-4 Quit dialogue for RStudio
If you press the Save Selected button, both the script and any saved data will appear the next time
you open RStudio – press it and reopen RStudio to check.
11
2 Introduction to R functions
A function is a set of R operations which are given a name and executed as a group for a particular
purpose. Most statistical operations can be executed with a single
command which calls a ready-made function. Parentheses must always be
included when you use a
Functions have the form : function in a command
somename() A function may have no
usually with one or more values inside the brackets. The quantities arguments if:
inside the brackets are called the arguments of the function— they
are the data on which the function operates. For example, the function • it has a value which is not
to calculate square roots is affected by user input—
supplied - e.g.
Any arithmetic expression you type in for R to evaluate can include any of the R functions with
appropriate arguments in parentheses ( ). If the expression cannot be evaluated, R will respond with
a missing value (NaN - for "not a number") and a warning message. Division by zero gives Inf (for
"infinity").
Type the expressions below on the left into the RStudio script window and run each one. The
Console window should then appear as shown below on the right.
5 + sqrt(16)
round(25.1872,
1)
log (-1)
12
exp(2) / log(1)
For the rest of this manual, to demonstrate a command we will generally provide the command as
entered in the script window in black, followed by the output produced on the console in blue – eg
5 + sqrt(16)
[1] 9
Where a command is too long to fit on one line of this manual, we will indent the second and
subsequent lines slightly.
Each of the arguments used by a function has a name, and for some of them, R will provide a default
value. For example, the full definition of the round() function - found by running the command
?round is
round(x, digits = 0)
x has no default value, because that is data that you must provide - the number(s) to be rounded.
But digits has a default value of 0: this means that if you do not specify a value, R will round x to
the nearest whole number, with no digits after the decimal place. If you provide a value - as we did
in the earlier example, it replaces the default.
You can specify the argument name within the command: if you provide argument names,
arguments can be in any order. If you do not provide the name, then arguments must be in the same
order as in the function definition.
round(digits=1, x=25.1872)
[1] 25.2
There is an enormous number of R functions – to create and modify data, or to analyze it, search for
it, model it, or plot it in a variety of ways. If you can’t find a function to do exactly what you want, it
is possible to write your own. Most people have a set of functions that they use frequently and
remember, and use Google to search for those they don’t use very often.
13
So far we have applied functions to single values, as in the round() example above. But most R
functions are designed to operate on groups of values, which R calls vectors. A vector is a set of
values of the same data type (eg all numeric, or all character) which has a single name for the whole
set. This means that you can do calculations which apply to all elements of the vector with a single
command.
myData
[1] 25 30 12 15 8 9 32
weights
myData
[1] 25 30 12 15 8 9 32 16 22
myNames
14
Individual elements of a vector can be given names with the names()function. So to attach
“Mary” and “Fred” to the values in the weights vector we created earlier:
names(weights) = myNames
weights
Mary Fred
125 147
You can also give names to each value when a vector is created using the c()function.
roles
myData
[1] 25 30 12 15 8 9 32 16 22
myData[3]
[1] 12
myData[3] = 16
myData
[1] 25 30 16 15 8 9 32 16 22
You can select more than one element using a vector of index numbers inside the square brackets:
myData[c(1,3,5,7,9)]
[1] 25 16 8 32 22
15
myData[myData >20]
[1] 25 30 32 22
my.series = 1:10
my.series
[1] 1 2 3 4 5 6 7 8 9 10
my.second.series = 4:-4
my.second.series
[1] 4 3 2 1 0 -1 -2 -3 -4
my.third.series
[1] 10 8 6 4 2 0 -2 -4 -6 -8 -10
my.fourth.series
my.fifth.series
• A labelled vector containing the numbers of described species in different vertebrate groups
• A vector called Year which holds a sequence of 10 years from 2006 to 2015
Year = 2006:2015
• A vector called Density which holds the 10 values: 5, 3, 7, 8, 6, 9, 8, 10, 14, 11, representing
hypothetical densities of seedlings in each of the years listed in the Year vector.
Notice that these four vectors and some or all of the values they hold are now listed under Values in
RStudio’s Environment window.
17
Once a vector has been created, functions or arithmetic operations applied to a vector act on all
elements of the vector.
vertebrateSpecies / 1000
sum(vertebrateSpecies)
[1] 66178
And then, to estimate the percentage of species in each group, rounded to 1 decimal place:
sum(vertebrateSpecies), 1)
percentageSpecies
With a labelled vector, it is also straightforward to draw pie plots or bar plots to display the data.
When you run each plot command, the plot should appear in the Plots tab on the bottom right
window of RStudio .
pie(vertebrateSpecies)
Fish
Amphibia Mammals
Reptiles Birds
18
For the bar plot, you will probably need to make the plot window wider to allow all the bar labels to
be displayed – this can be done by dragging the corner of the window after the plot has been drawn.
barplot(percentageSpecies)
40
30
20
10
0
This plot could do with a bit of improvement – below we modify the command extensively to
produce the final product. You should try adding each new argument one at a time to see what each
does. Note the use of \n in the main argument to force part of the title to be on a new line.
ylim=c(0,55))
box()
50
Colors
40
19
There are several useful buttons above the RStudio plot window. Try them out. The Zoom button
gives you the plot in a separate window whose shape and size it is easier to change. The Export
button allows you to either save the plot or copy it to the Clipboard to transfer to another document
– eg a Word document or a PowerPoint slide. If you have drawn several plots, the arrows allow you
to move between them. The remaining two buttons delete either the current plot or all the plots you
have drawn.
[1] 10.42242
[1] 108.6269
[1] 500
There are two standard ways to plot a distribution of numbers like the bodyWeights data: a
boxplot and a histogram.
20
boxplot(bodyWeights)
hist(bodyWeights)
Histogram of bodyWeights
100
80
Frequency
60
40
20
0
50 60 70 80 90 100 110
bodyWeights
Now edit the hist(bodyWeights) command in the script window to change some of the
attributes of the plot, and run the edited command.
21
ylab="Number of individuals",
labels=T, ylim=c(0,120))
Then put a box around the plot to obtain the final result below.
box()
120
108
100
85
Number of individuals
80
70
66
60
46
41
40
30 30
20
7 8
2 4 2 1
0
50 60 70 80 90 100 110
mean(Density2)
[1] NA
R returns a missing value – because two values are missing, the mean of all 10 values in Density2
cannot be calculated. If you want the mean of the remaining values, the command needs to specify
that missing values should be removed before the calculation is made.
mean(Density2, na.rm=TRUE)
22
[1] 9.125
The function is.na() returns TRUE for values which are missing and FALSE otherwise:
is.na(Density2)
Key arguments in plot()
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
FALSE The initial part of this plot
command:
To find the indices of missing values:
Density~Year
which(is.na(Density2))
is a formula which can be read as
[1] 3 5 “Density (on the y-axis) is a
function of Year (on the x-axis)”.
To find the indices of values which are NOT missing:
The type argument specifies
which(!is.na(Density2))
what kind of plot – “b” means
[1] 1 2 4 6 7 8 9 10 “both points and lines”. “l” would
plot only lines, and “p” would plot
only points. There are other
options too = see ?plot
2.2.6 Calculations and plots with two pch specifies the plotting
numeric vectors character – in this case a solid
circle – there are a lot of options,
Here we use the two associated vectors Year and Density,
see ?points
which each has 10 observations. Initially draw a line plot to show how the the density changes with
year.
23
14
12
10
Density
Year
The plot suggests that density increases with year – that is, there is a positive correlation between
Year and Density. To calculate the correlation coefficient (a measure of the strength of the
correlation) use
cor(Year, Density)
[1] 0.8699182
Since the maximum possible value of the correlation coefficient is +1, the correlation appears quite
strong.
To calculate the intercept and slope of the straight line which best describes the relationship
between Density and Year:
lm(Density~Year)
Call:
Coefficients:
(Intercept) Year
-1807.442 0.903
24
Key arguments in abline()
To draw the regression line superimposed on an existing scatterplot
(ie a plot which uses points instead of both points and lines): lwd changes the thickness of
the line.
text(x=2008, y=12,
25
26
3 Data in R
3.1 Data types
Data in R come in a variety of different types, some of them (like dates) quite complex. The data
types you will probably use most often are:
• Integer
• Numeric (Double)
• Character
• Factor
• Logical
The properties of each are outlined below: – we have used the first three in earlier sections.
Integer: integer data are whole numbers – eg 12, -103, 0 The Year vector created earlier
was an integer variable.
Numeric (Double): numeric data with a decimal part - eg 12.15, -103.0, 0.007 “Double” is
short for “double precision”, which means that each value is accurate up to 16 places – so 0.007 is
actually stored with another 13 zeroes after the 7. The bodyWeights variable created earlier
stored its data as doubles.
Character: character data are character strings, used most often to label or name individual
values or data rows. When you surround a value with quotation marks, it is created as character
data even if what is in quotes is a number, for example
textStrings
Factor: when a variable is a category which is used to group data into subsets, it should be stored
as a factor variable. Factor data has two parts – an integer which defines the level of the factor, and
a label which is a text string used to describe each level. For example, consider the bodyWeights
vector holding 500 hypothetical male body weights we created in chapter 2 via the command
Add 500 hypothetical female body weights with a mean of 65 and a standard deviation of 10 to the
original bodyWeights vector via the command below:
27
bodyWeights = c(bodyWeights, rnorm(n=500, mean=65, sd=10))
Notice that in the Environment window, the bodyWeights vector now has 1000 values: the first
500 are the original 500 males, and the second 500 are the females. We now create a vector called
gender which can be used to specify which individuals are male and which are female:
Check the vector in the Environment window: notice that it has been created as a character
variable. It can be converted to a factor with the command:
gender = factor(gender)
Notice that the Environment window now tells us that gender is a Factor with 2 levels, and that
“Female” is listed before “Male”. This means that Female is level 1 for the factor , and Male is
level 2 (note that the early values of gender are all 2, as they should be). In general, unless you
specify otherwise, the factor levels will be defined in alphabetical order: in this case “female”is
defined as level 1 because F comes before M in the alphabet. The order can be changed: the
command below creates a new gender1 vector with “Male” as the first level.
The order of levels determines how values are plotted and analyzed, as illustrated in the two plots
generated below.
28
Logical: logical variables can have only two possible values – TRUE and FALSE. Like factor
variables, there is also a number associated with them (FALSE is 0 and TRUE is 1). Logical variables
are often created by comparisons of some kind.
For example, consider the relationship between Density and Year which we evaluated at the end
of Chapter 2, producing the regression equation and plot shown again below.
Year = 2006:2015
lm(Density~Year)
text(x=2008, y=12,
We can use the equation to obtain the fitted values for each year:
The fitted points should of course sit exactly on the regression line, as we can demonstrate by
plotting them on the same graph:
So which years had densities higher than expected (ie above the fitted line)? We can identify them
by creating a logical variable which compares Density with fitted:
isHigh
29
[1] TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
If isHigh = TRUE, the relevant year had a density above the line. We can show those years
using the which() function to find the indices of the TRUE values:
which(isHigh)
[1] 1 3 4 6 9
Year[which(isHigh)]
Usually, you will start with the data in a spreadsheet of some kind, and then read the whole data set
into R – we will demonstrate how to do that shortly.
However, data frames can also be created manually by creating the individual vectors and then
copying them into a data frame with the data.frame() function, So, for example, we might want
to copy the Year, Density, fitted and isHigh vectors created in Section 3.1 into a data
frame called seedlings:
Similarly we can combine the bodyWeights and gender vectors into a data frame called
weights:
30
In the RStudio Environment window, notice that there are still “loose” Year, Density
bodyWeights and gender vectors under the Values heading, but there is now also a Data
heading which lists the seedlings and weights data frames: click the little blue arrows to the
left of seedlings and weights to display their vectors. Then click the grid symbol to the right of
seedlings to show the whole data frame in a new tab to the left, as in the screenshot below.
You can refer to (and change) any value in the data frame by specifying its row index and column
index in square brackets using dataframeName[row index, column index]. So to display
the value in the first row and first column:
seedlings[1, 1]
[1] 2006
Similarly you can obtain blocks of values by specifying vectors of row and column numbers: for
example
seedlings[1:3, c(1,3)]
Year fitted
31
1 2006 3.976
2 2007 4.879
3 2008 5.782
Leaving out the row (or the column) number requests a whole column (or row) – notice that the
comma still has to be provided.
seedlings[, 1]
[1] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
seedlings[1, ]
Once vectors have been copied into a data frame, it is good practice to go back to the script tab and
remove the originals from the workspace to avoid confusion:
Notice that the vectors have disappeared from the Values part of the environment window, but they
are still in the seedlings data frame.
To use vectors stored within a data frame, the name of the data frame must be specified within the
command. There are four common ways to do this:
1. Specify the name of the data frame whenever you refer to one of its vectors, separating the
two names with a $ sign.
2. Specify the vectors via their column numbers (leaving out the row numbers requests the
whole column):
32
3. Use the with() function. This instructs R to look inside the specified data frame for the
vector names. This is often the most economical approach.
4. Commands which include a formula will usually allow the data frame to be specified in the
command itself.
When you run the command, you will get a Select file window which B letsNone 24to the file,180
you navigate
select it, and press Open to load it into R. Then inspect the fileBin the Environment
Low window.
27 By 158
clicking the little grid button to the right of the file name, display it in a script window tab, as shown
below.You can also obtain this display with the command B Low 34 196
B High 32 213
33 B High 30 174
B High 39 262
Notice that RStudio is telling you that PlantGrowthExpt has 18 observations of 4 variables, and
that the first two variables – variety and treatment – are factor variables – that is, they define
categories. Variety has two factor levels, A and B, and treatment has three – High, Low, and None. In
general, when you use read.csv() to import a dataset into a data frame, it will assume that any
non-numeric variables should be imported as factor variables rather than as character variables.
For importing csv files like this one, using the read.csv() function is probably the simplest option,
and we will continue to use it in this manual, However, RStudio provides support for importing files
saved in several different formats. To look at the options available, click the Import Dataset button at
the top of the Environment window.
(The first time you use any of these import options, RStudio will require you to load additional
libraries from the internet which underpin the import process.)
34
The other two data structures commonly used in R – arrays and lists - are less often needed by
beginning R users, and we only deal with them briefly here.
35
36
4 Manipulating and describing
data
This chapter uses the plantGrowthExpt, weights, and seedlings data frames created
earlier to illustrate some options for manipulating data and summarizing it numerically. The next two
chapters will review a range of ways to display data graphically.
The simplest option for describing the data held in a data frame is the summary() function, which
provides summary statistics for each vector in the data frame:
summary(plantGrowthExpt)
summary(weights)
gender bodyWeights
Median : 72.50
Mean : 73.04
Max. :117.98
summary(seedlings)
37
Min. :2006 Min. : 3.00 Min. : 3.976 Mode :logical
Notice that for factor and logical variables, using summary() provides the number of observations
for each level. For numeric variables (both integer and double), it provides the minimum, maximum,
first and third quartiles, and the median and mean.
This is straightforward, but doesn’t begin to cover all the ways you might want to manipulate and
summarize the data. For example, you might want to apply a log transform to the biomass data, or
calculate mean body weights for males and females separately. In fact, most data analysts spend
more time getting their data into the right form than they do on formal analysis.
Below we look at a number of techniques for manipulating, transforming and summarizing data.
Initially (section 4.1) we look at some techniques available in the R base package. Then, in section
4.2 we look at the options provided by a specialist R package, dplyr, written specifically to make
data manipulation and description easier, faster, and more flexible.
plantGrowthExpt$bh.ratio = with(plantGrowthExpt,
biomass.gm / height.cm)
plantGrowthExpt$sqrt.biom = with(plantGrowthExpt,
sqrt(biomass.gm))
38
R includes a wide range of functions which may be used to transform data. The table below gives a
brief list of some of the most commonly used. In each case x is a number or vector of numbers.
39
4.1.2 Logical transforms
It is often useful to create a logical variable which simply takes the value TRUE or FALSE. For
example in the plantGrowthExpt data, we might want to create a variable isLarge which has
the value TRUE if the plant's biomass is greater than the mean biomass for the whole group, and
FALSE otherwise. This is straightforward using a logical expression:
plantGrowthExpt$isLarge = with(plantGrowthExpt,
Then we might check whether large plants are similarly distributed between varieties: the result
suggests that larger plants may be more likely for variety B than for variety A.
varietyTable
isLarge
A 7 2
B 2 7
mosaicplot(varietyTable, col=c("red","blue"))
40
4.1.3 Creating categories from numeric data
To demonstrate this we use the TullySugar.ONI csv file. It contains three vectors:
TullySugar.ONI = read.csv(file.choose())
The command below creates a variable called phase which specifies the climatic phase for each
year. It uses the cut() function to match the right phase label with each ONI.SON value.
TullySugar.ONI$phase = with(TullySugar.ONI,
Notice the use of –Inf and Inf to represent minus and plus infinity. The breaks define three
categories – all values up to -0.49, values between -0.49 and 0.49, and all values greater than 0.49.
Because the ONI.SON values are reported with only one decimal place, setting the break points
with an additional decimal place means that we don’t have to worry about whether values which
exactly equal a break point are inside or outside a particular category (because breaks are defined so
that no value will ever exactly equal a break point). Notice also that the categories are created as
factor data, and that the order of the levels is not alphabetic – they are in order of increasing size of
the cut variable.
Then the commands below find how many years were in each phase and display the results in a bar
plot
with(TullySugar.ONI, table(phase))
41
19 17 14
with(TullySugar.ONI, barplot(table(phase),col="slateblue",
box()
20
15
# years
10
0
La.Nina Neutral El.Nino
42
5 Plotting data with base R
R has a very extensive plotting and plot-editing capability, and is particularly well suited to
exploratory data analysis. Here we expand on the introduction to plots given in chapter 2 to cover
some of the more common plotting functions which you are likely to need.
For the first few plots demonstrated below, we show you both the plot with default options and the
code changes needed to put it in its final form.
43
Samoa 2000
Tonga 5300
Tuvalu 3600
Vanuatu 2600
The data shown in the table above can be found in the file OceaniaGDP.csv. Read the file into a
data frame, check the structure of the data in the environment window and examine the first few
rows, and then use arrange() to sort the countries in order of per capita GDP:
OceaniaGDP = read.csv(file.choose())
head(OceaniaGDP)
Country per.capita.GDP
1 Australia 47600
3 Fiji 9400
4 Kiribati 1800
library(dplyr)
The names argument in barplot() can be used to specify a vector holding the bar labels – it
needs a vector of character strings, but the Country variable has been read as factor data, so the
code below creates a character vector called labels.
44
labels = as.character(OceaniaGDP$Country)
Below is a first try at the bar plot, using the las=1 argument to force the y-axis labels to be
horizontal.
1. The left margin of the plot is much too narrow for the labels.
2. The x-axis should extend to 50000 so that it extends a little past the largest bar.
3. The colour is a little uninspired.
4. The x-axis should have a label to say what the numbers mean.
5. It might be good to have a title for the plot.
The first of these is the trickiest since margins have to be set as a global graphical parameter using
the par() function. (Use ?par to check the function’s numerous arguments.) The required
argument here is mar (for margins), which takes a vector of four values, one for each margin,
starting at the bottom and moving clockwise. The default settings are 5,4,4,2, so we need to increase
the second one – to 14 should be enough. When the par() settings are changed, they stay changed
until you change them back, so it is useful to store the old settings in order to restore them later.
The remaining changes can be handled via arguments within the barplot() command.
45
barplot(OceaniaGDP$per.capita.GDP, horiz=TRUE, names=labels, las=1,
Finally, add a frame for the plot and restore the default graphics settings
box()
par(opar)
Australia
New Zealand
Palau
Fiji
Papua New Guinea
Tonga
Tuvalu
Marshall Islands
Federated States of Micronesia
Vanuatu
Samoa
Kiribati
Solomon Islands
Below is a table of counts resulting from a poll about Australian voter intentions in three towns. We
will construct a stacked-bar plot which compares the percentages of each voting intention in each
town.
In R, a stacked bar plot requires a matrix for data entry: the values in each column of the matrix are
stacked. Since we want each bar to represent a town, the rows in the table above will need to
become columns in the matrix The code below creates a vector of percentages for each town, uses
cbind() to combine them into a matrix called VoterIntention, and then uses the party
affiliations as row names for the matrix.
Town.B = c(1125,1318,404,130)
Town.B = 100*Town.B/sum(Town.B)
Town.C = c(978,525,215,101)
Town.C = 100*Town.C/sum(Town.C)
rownames(VoterIntention)= c("LNP","Labour","Green","Other")
VoterIntention
Then we can try the default plot. Setting the legend.text argument to TRUE requests the plot to
include a legend and to use the row names as labels for the legend.
barplot(VoterIntention, legend.text=TRUE)
47
100
Other
Green
80
Labour
LNP
60
40
20
0
1. Make space for the legend. Since the legend is within the plot area, not in the margin, changing
the margins won’t help this time. However we can extend the x-axis to make space within the
plot area (bars are 1 unit apart: trial and error suggests x-axis limits of about 0.2 and 4.4 are OK).
2. Make the y-axis values horizontal.
3. Provide a y-axis label.
4. Change the colour scheme.
5. Add a title for the plot,
6. Add a horizontal line at y=0 to serve as an x-axis.
col=c("blue","red","green","grey"),
abline(h=0)
48
Voter intentions in three towns
100
Other
Percentage of respondents
80 Green
Labour
LNP
60
40
20
0
Town.A Town.B Town.C
• Boxplots allow a simple comparison of the median, interquartile range, and total range of a
numerical variable;
• Histograms split the data values into consecutive intervals (bins) in order to plot frequencies
in each interval.
To demonstrate how these are drawn and used to compare distributions, we use a sample of blood
haemoglobin concentrations for human males living in different locations. In this section we use
boxplots, in the next, we will use histograms.
Three of the locations are at high altitude (Tibet, Ethiopia, and the Andes) and one is at sea-level
(USA). The data are available in the file haemoglobin.csv. Import the data into a new R data
frame called haemoglobin, check its structure in the Environment window, and examine its first
few rows .
haemoglobin = read.csv(file.choose())
head(haemoglobin)
group Hb
1 Ethiopia 12.47
49
2 Ethiopia 12.68
3 Ethiopia 12.54
4 Ethiopia 12.55
5 Ethiopia 12.77
6 Ethiopia 12.57
The Environment window tells us that group is a factor variable with four levels. Generating the
default boxplot is straightforward.
The upper and lower boundaries of the boxes show the interquartile range. The whiskers extend an
additional 1.5 times the interquartile range. More distant points are plotted as individual outliers.
This is the most conventional form of boxplot, but other forms are possible. Run ?boxplot to
investigate alternative options.
Modifications are also straightforward. The command below adds a y-axis label and a plot title, and
changes the colour scheme.”
50
col="orange", border="darkred")
The plot makes it very obvious that the sample from the Andes tends to have higher values than the
other three locations.
It is also possible to examine the effects of two factors simultaneously. Below we load and examine
the plantGrowthExpt data set and then plot the effect of variety and hormone treatment on plant
height. via two different layouts. Each layout draws six groups in two colours. In the first, we want
two colours to alternate, so we need only specify only two, and the pair will repeat. In the second,
we want two sets of three, so we need to specify values for all six.
plantGrowthExpt = read.csv(file.choose())
head(plantGrowthExpt, 4)
1 A None 14 73
2 A None 17 65
3 A None 20 84
4 A Low 24 119
boxplot(height.cm~variety*treatment, data=plantGrowthExpt,
col=c("gold","lightgreen"),las=1,
boxplot(height.cm~treatment*variety, data=plantGrowthExpt,
52
check of this assumption. A quantile-quantile (Q-Q) plot plots the quamtiles of one data set against
the quantiles of another. If they come from the same distribution, the points will lie close to a
straight line with a slope of 1, through the origin. So to check if a set of data are Normally distributed
we plot the quantiles of the data against what the quantiles should be if the data really are Normally
distributed. Substantial curvature of the points away from the line would indicate a failure of the
normality assumption
trialdata = rnorm(100)
qqnorm(trialdata, las=1)
CO2Australia = read.csv(file.choose())
53
We provided code to produce a line plot in section 2.3. Equivalent code for this data set is shown
below (note the use of \n to spread the title over two lines:
0.565
Per capita CO2 emissions
0.560
0.555
0.550
0.545
0.540
0.535
Year
The y-axis label is a little cramped, the code below expands the y-axis margin slightly to
accommodate it, and uses the title()function to add the axis label (title() allows the distance
between the axis label and the axis line to be modified). (The problem could also be fixed by
reducing the size of all tick labels using cex.axis=0.8 within the plot() command – try it!)
opar = par()
par(mar=c(5,6,4,2))
ylab ="",
0.565
Per capita CO2 emissions
0.560
0.555
0.550
0.545
0.540
0.535
Year
55