R SPiR
R SPiR
Lars Snipen
Preface
This text is the result of teaching the course STIN300 Statistical program-
ming in R at the Norwegian University of Life Sciences over a few years. It has
evolved by gradual interaction between the teachers and the students. It is still
evolving, and this is the 2015 version.
Contents
1 Getting started 7
1.1 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 The R software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 The windows in R . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The RStudio software . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Help facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 The Help window . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Help on commands . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 Searching for commands . . . . . . . . . . . . . . . . . . . 9
1.4.4 Partly matching command names . . . . . . . . . . . . . . 10
1.4.5 Web search . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.6 Remarks on quotes . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Demos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Important functions . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7.1 Install and customize . . . . . . . . . . . . . . . . . . . . . 11
1.7.2 Demos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7.3 Help facilities . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
2 CONTENTS
3 Scripting 21
3.1 R sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Working directory . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The first script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 The Global Environment . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Important functions . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.1 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.2 Cylinder volumes again . . . . . . . . . . . . . . . . . . . 24
7 Control structures 51
7.1 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.1.1 The for loop . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.1.2 The while loop . . . . . . . . . . . . . . . . . . . . . . . . 52
7.1.3 Avoiding loops . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.1 The if statement . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.2 The else branch . . . . . . . . . . . . . . . . . . . . . . . 54
7.2.3 The switch statement . . . . . . . . . . . . . . . . . . . . 55
7.2.4 Short conditional assignment . . . . . . . . . . . . . . . . 56
7.3 Logical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.4 Stopping loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.4.1 The keyword break . . . . . . . . . . . . . . . . . . . . . . 57
7.4.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.5 Important functions . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 CONTENTS
8 Building functions 61
8.1 About functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.1.1 What is a function? . . . . . . . . . . . . . . . . . . . . . 61
8.1.2 Documentation of functions . . . . . . . . . . . . . . . . . 61
8.1.3 Why functions? . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2 The R function syntax . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2.1 The output: return . . . . . . . . . . . . . . . . . . . . . 64
8.2.2 The input: Arguments . . . . . . . . . . . . . . . . . . . . 64
8.3 Using a function . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.4 Organizing functions . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.5 Local variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.6 Important functions . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.7.1 Yatzy - bonus . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.7.2 Yatzy - straights . . . . . . . . . . . . . . . . . . . . . . . 68
8.7.3 Yatzy - pair . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9 Plotting 69
9.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.2 Lineplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.4 Barplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.5 Pie charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.6 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.7 Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.8 Contourplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.9 Color functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.9.1 Making your own palette . . . . . . . . . . . . . . . . . . 81
9.10 Multiple panels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.11 Manipulating axes . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.12 Adding text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.13 Graphics devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.14 Important functions . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.15.1 Plotting function . . . . . . . . . . . . . . . . . . . . . . . 89
9.15.2 Weather data . . . . . . . . . . . . . . . . . . . . . . . . . 89
10 Handling texts 91
10.1 Some basic text functions . . . . . . . . . . . . . . . . . . . . . . 91
10.2 Merging texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2.1 Special characters . . . . . . . . . . . . . . . . . . . . . . 93
10.3 Splitting texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.4 Extracting subtexts . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.5 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.5.1 Functions using regular expressions . . . . . . . . . . . . . 95
CONTENTS 5
11 Packages 103
11.1 What is a package? . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.2 Default packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.3 Where to look for packages? . . . . . . . . . . . . . . . . . . . . . 104
11.4 Installing packages . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.4.1 Installing from CRAN . . . . . . . . . . . . . . . . . . . . 104
11.4.2 Installing from file . . . . . . . . . . . . . . . . . . . . . . 105
11.4.3 Installing from the Bioconductor . . . . . . . . . . . . . . 105
11.5 Loading packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.6 Building packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.6.1 Without RStudio . . . . . . . . . . . . . . . . . . . . . . . 106
11.6.2 Using RStudio . . . . . . . . . . . . . . . . . . . . . . . . 106
11.6.3 How to create the package archive? . . . . . . . . . . . . . 107
11.7 Important functions . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.8.1 Imputation of data . . . . . . . . . . . . . . . . . . . . . . 108
11.8.2 Making a package . . . . . . . . . . . . . . . . . . . . . . 109
14 Cross-validation 147
14.1 The bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . 147
14.1.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 147
14.1.2 Why model selection? . . . . . . . . . . . . . . . . . . . . 148
14.1.3 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . 150
14.2 The cross-validation algorithm . . . . . . . . . . . . . . . . . . . 150
14.2.1 The leave-one-out cross-validation . . . . . . . . . . . . . 151
14.2.2 C-fold cross-validation . . . . . . . . . . . . . . . . . . . . 151
14.3 Example: Temperature and radiation . . . . . . . . . . . . . . . . 152
14.3.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
14.3.2 Fitting a linear model . . . . . . . . . . . . . . . . . . . . 153
14.3.3 The local model alternative . . . . . . . . . . . . . . . . . 154
14.3.4 Fitting the final local model . . . . . . . . . . . . . . . . . 158
14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
14.4.1 Classification of bacteria again . . . . . . . . . . . . . . . 159
14.4.2 Faster computations . . . . . . . . . . . . . . . . . . . . . 159
Chapter 1
Getting started
1.1 Installing R
The latest version of R is available for free from http://www.r-project.org/.
The Comprehensive R Archive Network (CRAN, http://cran.r-project.
org/) has several mirror sites all over the world, and you choose one to download
from. The Norwegian mirror is at the University of Bergen (UiB).
The R software consists of a base distribution plus a large number of packages
that you can add if you like. You start by downloading and installing the base
distribution. Choose the proper platform (Windows, Mac of Unix) for your
computer. We will come back to the packages later.
Our own resource for R is found at http://repository.umb.no/R/. Here
you can find documents that may be helpful for installing R. Please note that
we will not use R Commander in this course, you only need to install R.
7
8 CHAPTER 1. GETTING STARTED
(e.g. Notepad, Wordpad etc) as long as you can save your text as simple text
files. However, R also has a built-in editor. If you choose the File menu on
the Console window, and choose New script you should get the Editor window
opened. In this window you can now type any text, and save it on file in the
same way you do with any simple text editor.
Sometimes R does not respond to our commands by outputting text, but
rather by displaying some graphical output. This will appear in the Graphics
window. Move to the Console window and type the command
> plot ( c (1 ,2 ,3 ,4) , pch =16 , col = " red " )
on the command line, and return. This should produce a Graphics window with
a simple plot.
Let us now move on to the RStudio software that we will use throughout
this course.
in the Console window. As an example, let us look for help for the function
called lm in R: Write
> ? lm
in the Console window, and then return. The Help-file for this function will be
displayed. The command help(lm) is just an alternative to the question mark,
and produces identical results.
NOTE: When we start some code by >, as above, it indicates we write this
directly in the Console window. You do not actually type in the >. We will see
later that we often write code in files instead, and then we have no > at the start
of the lines.
which will give you a list of available functions where the term "linear" is
mentioned. If you scroll down this list you should find somewhere the entry
stats::lm. This means that in the package called stats there is a function called
lm and that the help file for this function uses the term linear. But, as you
can see, the list of hits is rather long, and the term linear is found in many
Help-files.
10 CHAPTER 1. GETTING STARTED
and get a listing of all available functions containing the term part in its name.
1.5 Demos
There are several demos in the base distribution of R. Type
> demo ()
in the Console window, and return, to get a list of available demos. To run a
specific demo, e.g. graphics, type
> demo ( graphics )
1.7 Exercises
1.7.1 Install and customize
Install R and RStudio on your laptop. Customize the appearance of RStudio.
If you want to have the same pane layout as the teacher, take a look at Figures
1.1 and 1.2 for guidance.
Open the Help on RStudio (Help and then RStudio Docs). This will open in
a separate web-browser. We will not spend time on learning RStudio as such,
but you may find it helpful to take some time exploring this.
1.7.2 Demos
Run some of the demos available. At least try out demo(graphics) and demo(
persp).
number of Help files. If you search Help for exp you will be guided to the same
Help file.
Some questions:
3. Read the Usage, Arguments and Details sections, and compute the natural
logarithm of 10 in the Console window.
4. Compute the logarithm with base 3 of 10.
5. Compute the logarithm of 0 and of -1.
6. At the end of the file are always some examples. Copy these into the
Console window line by line, observe the output and try to understand
what happens.
Chapter 2
Some other frequently used operators are exponentiation, square-root and log-
arithm:
2^2
sqrt (2)
log (2)
log2 (2)
log10 (2)
13
14 CHAPTER 2. BASIC OPERATORS AND DATA TYPES
This line of code should be read as follows: Create the variable with name a and
assign to it the value 2. Notice how we never really specified that a should be
a numeric data type, this is understood implicitly by R since we immediately
give it a numerical value. This is different from most programming languages,
where an explicit declaration of data type is required for all variables.
Notice also the assignment operator <-, which is quite unique to R (the
S-language). It is a left-arrow, which is a nice symbol, since it means that
whatever is on its right hand side should now be stored in the memory location
indicated by whatever is on its left hand side. Information flows along the arrow
from right to left. You can even write
> 2 -> a
and the assignment is identical! Again, the value 2 is assigned to the variable
a, information flows along the arrow direction. In R you are also allowed to
use = as the assignment operator, i.e. we could have written a=2 above. This
is similar to some other languages, but notice that it says nothing about flow
direction, and you must always assign from right to left (e.g. you cannot write
2=a). We will stick to the <- for assignments in this text.
The arithmetic operators can be used on variables of type numeric:
> a <- 2
> b <- 3
> c <- a * b
which results in three variables (a, b and c), all of type numeric, and with
values 2, 3 and 6. To see the value of a variable, just type its name in the
console window, and upon return R will print its value.
> a <- 2 L
2.3.2 Text
In addition to numbers, we often need texts when dealing with real problems.
A variable that can store a text has the data type character in R. We can create
a character variable in the same way as a numeric:
> name <- " Lars "
Notice the quotes, a text must always be enclosed by quotes in R. The arithmetic
operators are in general meaningless for texts. Text-handling is something we
will come back to in later chapters.
2.3.3 Logicals
A logical data type is neither a number nor a text, but a variable that can take
either the value TRUE or FALSE. Here is an example:
> male <- TRUE
where the variable male now is a logical. Notice that there are no quotes around
TRUE since this is not a text. Logicals are often used as the outcome of some
test, which we will see more of soon.
2.3.4 Factors
Since R has been developed by statisticians a separate data type has been
created to handle categorical data. Categorical data are discrete data without
an ordering. Here is an example:
> nationality <- factor ( " norwegian " , levels = c ( " swedish " ,"
norwegian " ," danish " ) )
This example shows several things. First, the variable nationality is created,
and a factor is assigned to it. The factor value is specified as a text ("norwegian
") but then changed to a factor by giving it as input to the command factor
(). This command also takes another input, namely a specification of which
category levels this factor is allowed to have. Hence, the text we use as first
input to factor must be one of those listed in the second input. Factors can
be created from numbers as well, replacing the texts in the example above with
numbers.
16 CHAPTER 2. BASIC OPERATORS AND DATA TYPES
This results in the variable a having the value NA, which means Not Available.
This is the R way of saying that a value is missing. It was simply not possible
to convert the text "12a" to a number, but the variable a is still created and
assigned NA as value. We will talk more about NA later in this chapter.
Now the variable nat is a numeric with value 2. The reason for this value is that
nationality has the value norwegian, which is the second level of this factor.
Levels are numbered by their occurrence in the levels specification. Notice
that even if the factor was specified by the use of characters (texts), the factor
itself only store levels, and these levels can always be converted to numeric.
Hence, we can always convert texts to numbers by first converting to factor and
then to numeric.
2.5 Non-types
It is possible to create a variable without any data type in R:
> a <- NULL
The NULL type is a ’neutral’ or unspecified data type. We create the variable,
give it a name, but does not yet specify what type of content it can take.
We briefly encountered the NA value above, indicating missing information.
This value is special in the sense that all data types can be assigned NA. Any
variable, regardless of data type, can take the value NA without causing any
problems. This makes sense, since any type of data can be missing.
The first four are ’larger/smaller than’ and ’larger/smaller or equal to’. The fifth
operator is ’equals’. Notice the double equal symbol. The last is ’not equal’.
The outcome of all these comparisons are the logical values TRUE or FALSE. We
can store the outcome of such tests just as any other value:
> a . is . greater . than . b <- a > b
and the variable a.is.greater.than.b now has the value TRUE or FALSE depending
the values of a and b.
In R we have a clear distinction between assignment, which is done by the
<- and comparison by ==. This distinction is important, and not always clear to
people new to programming. Typing a<-b means we copy the value of b into a.
Typing a==b just compare if the value of a equals the value of b, no assignment
or copying is made.
Variables of the data type logical are usually the result of some comparison.
First we create the variable a and assign to it the value 2. This makes a a
numeric. But, then we assign a text to this variable, and R does not complain,
but immediately changes the type of a to character! There is no warning or
2.9. IMPORTANT FUNCTIONS 19
error, and it is easy to see how such re-assignments can cause problems in larger
programs.
You are hereby warned: Always keep track of your variables, R will not help
you!
2.10 Exercises
2.10.1 Volume of cylinder
We will compute the volume of a cylinder. Create two numeric variables named
height and radius, and assign values to them. Let them have values 10 and
1, respectively. Then, let the variable volume be the volume, and use height
and radius to compute the corresponding value for volume. HINT: The number
π (3.1415...) is always available in R as pi. Give new values to height and/or
radius, and recompute the volume. HINT: Use the up-arrow on your keyboard
(multiple times) to repeat previous commands in the Console window.
Scripting
3.1 R sessions
So far we have typed commands directly in the Console window. This is not
the way we usually work in an R session.
3.1.1 Scripts
Instead we create a file (or files), and put all our commands in this file, line by
line as we did in the console window. Such files are called scripts. Then, we save
the file under a name using the extension .R as a convention. In order to execute
the commands, we can either copy them from the file and paste them into the
Console window, or more commonly, use the source() function. This means we
spend 99% of the time in the Editor window, and only visit the Console window
each time we want to execute our script.
In RStudio we have a separate editor for writing scripts. From the menu
you choose File - New, and then R Script to create a new script-file. It will
show in the Source pane. Here you can type in your commands and save the
file as in any simple editor. In the header of the editor window you find some
shortcut buttons to run the code. The Souce-button will run the entire script,
and corresponds exactly to the source() function mentioned above. The Run-
button is used to run only parts of the script. Either the single line of code
where your cursor i located, or you first select the code-lines using your mouse,
and the press Run to execute only the selected lines of code.
21
22 CHAPTER 3. SCRIPTING
of this text we will use RHOME to symbolize the Working Directory, i.e. replace
RHOME with the full path to your Working Directory.
It is quite customary to create a folder for each project where you are going
to make R-scripts and other R-files that we will see later. In RStudio you can
define such projects by using the Project on the menu. Try to create a new
project, and you will see that the first thing you must decide is the directory
(folder) where it resides. Get used to organizing everything in directories. It
is one of the most common errors for R-beginners to store their files in one
directory and then run R in another, and of course nothing works!
In RStudio there are several ways to change the Working Directory during
a session. If you have a script in the editor, and this has been saved to some
directory, you can make this the Working Directory by the menu Session, then
Set Working Directory and then To Source File Location. Another option is
to use the Files-window (look for the Files tab in the lower to panes). In this
window you can maneuver through your file-tree. In the header of this window
there is a pop-up menu named More. From this you can choose Set As Working
Directory and the folder in the Files-window is now be your working folder.
Your Working Directory is always listed in the header of the Console-window.
Note that in R we use the slash / and not the backslash \ whenever we specify
a file-path, even in Windows.
and save the file under the name script_3_1.R. NOTE: Lines of code in this
text that does not start with a > is supposed to be written in a file (script), not
directly in the Console window.
Click the Source button. If your script is error free it will produce no output,
and nothing really seems to happen. Well, something did happen. The four lines
of code were executed from the top down. First the variable v1 was created,
and given the value 2. Next, the variable v2 was created, and so on. The four
variables still exist in the R memory, even if they are not displayed. If you want
to list all existing variables in the Console window, use the command ls(). If
you type ls() in the Console-window it should look something like this now:
> ls ()
[1] " v1 " " v2 " " w " " d "
which means there are 4 variables existing, having the names listed. If you type
the name of any of them in the Console window, R will output its value. These
variables will now exist in the R memory until you either delete them or end
you R session.
3.3. THE GLOBAL ENVIRONMENT 23
where the first line deletes variable v1 only and the second deletes everything
listed by ls(). The latter command has a shortcut in RStudio. In the header of
the Workspace-window there is a broom (a Harry Potter flying device) button,
and if you push it the entire workspace is cleared.
When you end R (use the command q() or the File - Quit RStudio menu)
you are always asked if the ’workspace image’ should be saved (workspace means
Global Environment here). In most cases we do not save this. The reason is
that we try to build programs such that everything is created by the running
of scripts (and functions), and the next time we start R we can re-create the
results by just running the same scripts again. This is the whole idea of scripting.
We store the recipes, and re-compute the results whenever needed. There are
exceptions, if the results will take very long time to re-compute you should of
course store them!
When we write R-programs (scripts) it is often a good practice to create
all variables within the program, not relying on that certain variables already
exist in the Global Environment. It is actually a good habit to clear the Global
Environment before you re-run a script, as existing variables may cause your
program to behave erroneously. Make a habit of always clearing the Global
Environment before you Source your scripts.
3.4 Comments
As soon as we start to build larger programs, we will need to make some com-
ments in our script files. A comment is a note we leave in our program to make
it easier to read, both for others and ourselves. It is amazing how much you
have forgotten if you open a script you made a month ago! Comments are not
read by R, they do not affect the program execution at all, they are only there
for our eyes. In R you use the symbol # to indicate the start of a comment, and
the rest of the that line will be ignored by R when the script is run. Sometimes
24 CHAPTER 3. SCRIPTING
we use this to comment out some lines in our script, usually in order to test the
remaining code to search for errors. Here is an example in the script we just
made:
a <- 2
b <- 3
# c <- a * b
d <- a ^ b # d should be 8
If you run this script line 3 is skipped because it starts with a #, and the variable
c is never created. The comment in the last line does not affect anything because
it is after the code of that line.
The RStudio editor recognizes an R-comment, and will display it in a differ-
ent font and color, making it easy to spot comments in larger programs.
3.6 Exercises
3.6.1 Directories
Make a directory on your computer for this course, and make RStudio use
this as startup-directory. You may find it convenient to create subdirectories
under this later, e.g. separate subdirectories for exercises, data, the compulsory
project etc.
4.2 Vectors
4.2.1 Creating a vector
The basic data structure in R is the vector. It is a linear structure of elements,
and corresponds very well to what we think of as vectors in mathematics and
statistics. However, a vector in R can be of any data type, i.e. not restricted
only to numbers but also be filled with either texts, logicals or factors. Note
the ’either-or’ here. In a vector we can have either numbers, or texts or...etc.
We cannot mix different data types inside the same vector.
A vector can be created by the function c() (short for concatenate). Let us
make a new script-file, and fill in the following lines of code:
# Creating vectors using the c () function
month <- c ( " January " ," February " ," March " )
days <- c (31 ,28 ,31 ,30)
where the first vector is of data type character and has 3 elements, and the sec-
ond is of type numeric and has 4 elements. Note that each element is separated
by a comma. Save this file under the name script_vectors.R", and run it (use
the source-button i RStudio or type source("script_vectors.R") in the console
window). Verify that the two vectors are created, they should appear in the
Workspace-window of RStudio.
Vectors are indexed which means every element has a ’position’ in the vector
and that we can refer directly to an element by its position-number, the index.
If we write month[2] we refer to element number 2 in the vector month. The
brackets [] are only used for indexing in R. We will look at this in detail below.
25
26 CHAPTER 4. BASIC DATA STRUCTURES
and the vector month now has 4 elements. The new element is added at the end.
Had we written c("April",month) it would have been put first. In general, the
c() function can be used to splice together several vectors.
which fills idx with the integer values 1,2,3,4. To create more general sequential
vectors, use the function seq(). Then you can specify the starting value, the
ending value and the step-length, and none of them need be integers (see ?seq).
Another systematic vector can be created by rep(). Here is an example of
two frequent uses of this command:
> rep ( idx , times =3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4
> rep ( idx , each =3)
[1] 1 1 1 2 2 2 3 3 3 4 4 4
NOTE: Whenever we start a line of code with a > as above, it just indicates this
is done directly in the Console window, not in a script-file! The output directly
below the statement is what you will see in the Console window.
and here we see an example of the data type integer. We mentioned already in
Chapter 2 that numbers can be either numeric or integer. When we create a
vector as a systematic sequence, like we did above, R will make it an integer. If
we made it ’manually’ it would be a numeric:
> idx <- c (1 ,2 ,3 ,4)
> class ( idx )
[1] " numeric "
The reason for this distinction may be more apparent after the next subsection.
Vectors can have names! This means every element in a vector can have
its separate name. This is something we make use of from time to time when
handling data. The function names is used both to extract the names from a
vector and to assign (new) names to a vector. We can give names to the numeric
vector , add this to the script (and re-run):
names ( days ) <- month # adding names to a variable
which means each element of days now also has a name in addition to the value.
Type days in the console to see its values and names. Notice that the names
are themselves a vector, a character vector, that is stored along with the ’real’
vector. Remember, the value of, say, days[2] is still 28. Its name is just a label.
Notice that the new vector produced, with the increased values, ’inherits’ the
names from the vector days. This is a general rule in R, if the input is named,
the output inherits the names, if possible.
The same arithmetic operator can also be used between vectors, as long as
they have the same number of elements, e.g.
> days - days
January February March April
0 0 0 0
28 CHAPTER 4. BASIC DATA STRUCTURES
and the operators are used elementwise. This means the result is a new vector,
having the same number of elements as the two input vectors, and where every
element is the result of using the operator on the corresponding pair of input
elements.
The same applies to comparisons, see chapter 2 for an overview. We can
compare two vectors as long as they are of the same length, and the result is
a new vector where each element is the result of comparing the corresponding
input elements. Comparing a vector to a single value is also straightforward:
> days >30
January February March April
TRUE FALSE TRUE FALSE
i.e. every element of the vector is compared to the value, and the result of
this comparison is returned in the corresponding element of the logical vector
produced. Notice again how the element names are inherited from days.
4.3.1 Indexing
We can retrieve an element by referring to its position using the brackets []:
> a . month <- month [3]
which means we create a variable called a.month and assign to it the value of
element number 3 in month. We can also assign new values to an element in a
similar way:
> days [2] <- 29
resulting in months now being a vector of length 2, having the same values as
elements 2 and 3 of month. We should stop for a second at this example, because
it is important to realize exactly what we are doing here. In fact, the example
can be illuminated by replacing the statement above by
idx <- 2:3
months <- month [ idx ]
4.3. VECTOR MANIPULATION 29
Add these two lines to the script, clear the workspace, and re-run.
We first create a vector called idx containing the values 2,3. Next, we use
this vector to index a subset of species, and create months to store a copy of this
subset. Let us refer to idx here as the index vector. The index vector should
typically only contain integers, since it will be used to refer to the elements in
another vector. It should also contain integers in the range from 1 to length(
species), i.e. no negative values, no zeros and no values above 4 in this case.
Apart from this, there are no restrictions. How about if we decided to make
> idx <- c (1 ,2 ,3 ,3 ,2 ,3 ,2 ,2 ,1)
Could we still use idx to extract elements from month? The answer is yes:
> month [ idx ]
[1] " January " " February " " March " " March " " February "
" March " " February " " February " " January "
Notice that the index vector can be both shorter or longer than the vector
we retrieve from, as long as it only contains integers within the proper range.
Notice also that it is the length of the index vector that determines the length
of the resulting vector. If idx has 100 elements, the result here will also have
100 elements, regardless of how many elements month had.
It should now be apparent that any index vector should be of data type
integer, since it should only contain integers. In the last example above, we
created idx in a ’manual’ way, and if we look at its data type, it is a numeric,
not an integer. However, as soon as we use it as an index vector in month[idx]
it is converted to integer. Let us illustrate this by creating this strange index
vector:
> idx <- c (1 ,2.5 ,3 ,2)
> month [ idx ]
[1] " January " " February " " March " " February "
Notice how our index vector idx now contains a non-integer, which is silly really,
there is no element number 2.5. However, as soon as we use it as an index vector
it is converted to integer, and this is what happens:
> as . integer ( idx )
[1] 1 2 3 2
and then use this as we used the index vector, month[lgic]. Notice that in this
case the logical vector must have the same number of elements as the vector
we are indexing, in this case 4. If the logical vector is longer than the vector
we index, R will fill in NA in the extra positions of the resulting vector. If the
logical vector is shorter than the vector we index, R will ’circulate’ the logical
vector. We will look into this in more detail below.
Logical vectors are typically created by performing some kind of comparison.
Let us consider the numeric vector weight from above. Add these lines to the
script:
is . long <- days >30 # is . long will be TRUE for elements
# in days larger than 30
long . month <- month [ is . long ]
save, clear workspace and re-run. The variable long.month should now be a
vector of length 2, containing the texts "January" and "March".
The function which will produce an index vector from a vector of logicals.
Instead of using is.long directly for indexing, we could have done as follows
> idx <- which ( is . long )
> idx
January March
1 3
Notice how which will return a vector with the index of all elements in is.long
being TRUE. We can now use idx as an index vector to produce the same result
as before (month[idx] instead of month[is.small]). It is in many ways easier to
read the code if we use which. The line which(is.long) is almost meaningful in
our own language: Which (month) is long? The answer is month number 1 and
3.
and then use it to index month. This sounds difficult, since month has 4 elements
and lgic only 2. But, in this case R will re-use the elements in lgic enough
times to extend it to the proper length. Let us try:
> month [ lgic ]
[1] " February " " April "
4.4. MATRICES 31
and we see that element 2 and 4 is extracted. Implicitly R creates the vector
c(lgic,lgic), having the proper 4 elements, and use this as the logic vector.
The same applies when we use operators on vectors. Let us create two simple
numeric vector, and compute their difference
> a <- 1:5
> b <- 1:3
> a-b
[1] 0 0 0 3 3
Warning message :
In a - b : longer object length is not a multiple of shorter
object length
4.4 Matrices
4.4.1 Creating a matrix
The other basic data structure in R is a matrix. It is a two-dimensional data
structure, and we may think of a matrix as a vector of vectors. We create a
matrix from a vector like this:
> avec <- 1:10
> amat <- matrix ( avec , nrow =5 , ncol =2 , byrow = TRUE )
> amat
[ ,1] [ ,2]
[1 ,] 1 2
[2 ,] 3 4
[3 ,] 5 6
[4 ,] 7 8
[5 ,] 9 10
Since our vector a has 10 elements we need not specify both nrow and ncol,
if we just state that nrow=5 R will understand that ncol must be 2 (or vice
versa). The argument byrow should be a logical indicating if we should fill inn
the matrix row-by-row or column-by-column:
> amat <- matrix ( avec , nrow =5 , byrow = FALSE )
> amat
[ ,1] [ ,2]
[1 ,] 1 6
[2 ,] 2 7
[3 ,] 3 8
[4 ,] 4 9
32 CHAPTER 4. BASIC DATA STRUCTURES
[5 ,] 5 10
We can bind together matrices using the functions rbind and cbind. If we
have two matrices, A and B, with the same number of columns, say a 10 × 4 and
a 5 × 4, we can use rbind(A,B) to produce a third matrix of dimensions 15 × 4.
The function rbind binds the rows of the input matrices, i.e. put them on top
of each other, and they need to have the same number of columns. Opposite,
cbind will bind the columns, put the matrices next to each other, and they need
to have the same number of rows. Here is an example:
> A <- matrix ( " A " , nrow =3 , ncol =2)
> B <- matrix ( " B " , nrow =3 , ncol =4)
> cbind (A , B )
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6]
[1 ,] " A " " A " " B " " B " " B " " B "
[2 ,] " A " " A " " B " " B " " B " " B "
[3 ,] " A " " A " " B " " B " " B " " B "
> rbind (A , B )
Error in rbind (A , B ) :
number of columns of matrices must match ( see arg 2)
It always returns a vector with 2 elements, the number of rows and the number
of columns.
Instead of just names as we had for a vector, a matrix has colnames and
rownames. Again, such names are just labels we may put on columns/rows if we
like.
The class function works just as before. Notice that all elements in a matrix
must be of the same data type, just as for vectors. All basic data types can be
stored in matrices, not just numbers, but in real life matrices usually contain
numbers.
The latter is an example of how we can refer to an entire row. We specify the
row number, but leave the column number unspecified. This means ’all columns’
and we get the entire row. The same applies to columns, try amat[,1] and you
should get the entire column number 1.
It is fruitful to think of a matrix as a vector of vectors, and realize that
indexing is similar, only in two dimensions instead of one. It is in fact possible
to refer to an element in a matrix by a single index only, just as if it was a vector.
If we retrieve element number 7 in our matrix amat, R will count elements in
the first column (5 elements) and then proceed in the second column until the
index has been reached. Since amat has 5 rows and 2 columns we reach 7 at
element [2,2]. This illustrates how ’vector-like’ a matrix really is.
Again things are very similar to vectors. The rule is that operators work element
by element. We can add/subtract/multiply/divide by scalars and other matrices
of similar dimensions. Basic mathematical functions like sqrt or log will also
work elementwise.
Matrices can be transposed, i.e. flipping row and columns, and the function
for this is t:
> t ( amat )
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
[1 ,] 1 2 3 4 5
[2 ,] 6 7 8 9 10
The data structure matrix can be seen as a table, but this is not really a proper
description. We will see tables in later chapters, and there is a distinction, e.g.
a matrix cannot contain a mixture of data types (a table can).
A matrix can also be seen as a mathematical object, and in this case it must
be filled with numbers to make sense. We have functions for computing inner
products, determinants and eigenvalues etc. in R, and in such cases a matrix is
clearly a mathematical object.
In this text we will most often see matrices as mathematical objects, and
rarely use them as ’containers’ for storing data. Vectors we meet all the time,
matrices only occasionally.
34 CHAPTER 4. BASIC DATA STRUCTURES
4.6 Exercises
4.6.1 Sequential vectors
Create a vector called x that takes the values from 0.0 to 10.0 with steps of 0.1
(use the seq function). Then compute the sine (sinus) of x using the function
sin (see ?sin) and store this in y. Plot y against x using the basic plotting
function plot. Try plot(x,y,type="l",col="blue"). Read about plot.
Extend the script by also reversing the order of the elements of y. Use an
index vector to do this. Make a new plot with y in reversed order.
Extend the script by making new variables x3 and y3 that contains every
third element of x and y, respectively. Then plot above, but add the line points
(x3,y3,pch=16,col="red"). Read about points.
Type > monthdata in the console and return, or click the names monthdata
in the Workspace window to open a display of the data.frame in RStudio. It
looks like a matrix, but notice how we can have texts in the first column and
number in the second. This was not possible in a matrix. In a data.frame
all elements in the same column must be of the same data type, but different
columns can have different data types. This column-orientation reflects the data
table convention used in all statistical literature: The columns of the data table
are the variables, and rows are the samples. Note that it is not required to
have a line-break between the column-specifications as above, it is just added
for convenience here.
We gave names to the columns directly in the above example. Names on
the columns are not required, but is of course convenient, and we should make
efforts to always have informative names on the columns of a data.frame. The
function colnames that we used for matrices works just as fine on data.frames,
in case you would like to add/change names on the columns. The same applies
to rownames, but quite often the rows are just numbered and any description
of the samples are included as separate columns inside the data.frame.
We can also add new rows or columns to an existing data.frame, here we
add a row:
37
38 CHAPTER 5. MORE DATA STRUCTURES
Notice how we create a new data.frame on the fly here, having a single row, but
the same columns as the old monthdata, and then use rbind to ’glue’ it below
the existing monthdata rows, producing a new, extended, data.frame.
New columns can be added in a similar way, but perhaps more common is
this approach: We want to add a column indicating the season to our existing
fishdata, and do it simply by
# Adding a new column
monthdata $ Season <- c ( " Winter " ," Winter " ," Spring " ," Spring " ,"
Spring " )
Here we create a character vector using the c function again, and assign this to
the column named Season in monthdata. Since this column does not exist, it is
created on the fly as an extra column. This is a short and simple way of adding
a column and giving it a name at the same time.
Here we extracted the text columns 1 and 3 of monthdata, and converted them
to a matrix (verify that monthmat is a matrix). Notice how the column names
were inherited. The statement monthdata[,c(1,3)] produces a new data.frame
consisting of the two specified columns:
> class ( monthdata [ , c (1 ,3) ])
[1] " data . frame "
Notice that when we retrieve a single column, the result is a vector (see the
example with w from above), but if we retrieve two or more columns it is still a
data.frame. Can you imagine why?
We can also convert a matrix to a data.frame, which is always straightfor-
ward:
> atable <- as . data . frame ( monthmat )
> class ( atable )
[1] " data . frame "
Since R has been made by statisticians, they have decided that text entered in
a data table should probably be used as a factor in the end, and the default
behavior of the function data.frame is to convert all texts to factors. Personally,
I rate this as a design failure. Text should be text, and we should convert it
to a factor when we need to. However, it is possible to override the default
behavior. We can either add the argument stringsAsFactors=FALSE to the data
.frame command during creation, or we can state
40 CHAPTER 5. MORE DATA STRUCTURES
in the console or at the top of our script, and the default conversion to factors is
turned off for the entire R session. Note the capital letters in stringsAsFactors!
Put the above statement at the top of the script, and re-run.
5.2 Lists
A list is the most general data structure we will look into here. Briefly, a list
is a linear structure, not unlike a vector, but in each element we can store any
other data structure.
which is very similar to how we created vectors, except that we use list instead
of c before the comma-separated listing of elements.
Here we created a list with 4 elements. The first element contains the numeric
2, the second the character "Hello", the third a numeric vector and the fourth
element contains another list. Thus, any list element can contain any data
structure irrespective of the other list elements. This makes lists extremely
flexible containers of data. If we have several variables, of different types, that
in one way or another belong together, we can bundle them all into the same
list.
The list elements can be given names, just like the columns of a data.frame
(actually, a data.frame is just a special list). Let us specify names:
> a . list <- list ( Value =2 , # a single number
Word = " Hello " , # a text
Vector = c (1 ,1 ,1 ,2 ,3) ,# a vector
List2 = list ( " STIN300 " ,5 , TRUE ) # another list
)
where the 4 list elements now have the names Value,Word,Vector and List2.
Just as for data.frames we should always try to use informative names on list
elements. We are free to choose names as we like, the convention of having a
capital first letter is just a personal habit I have inherited from Java program-
ming.
i.e. we must use the double brackets. The distinction between a list element
(which is itself a single-element list) and its content is important to be aware
of.
A list cannot take an index vector in the same elegant way as vectors. We
can refer to a sub-list of the three first elements by a.list[1:3], but we cannot
write
> a . list [[1:3]]
Error in alist [[1:3]] : recursive indexing failed at level 2
42 CHAPTER 5. MORE DATA STRUCTURES
If we want to retrieve the contents of all list elements, we simply have to loop
through the list and retrieve them one by one. This means lists are less suited
for doing heavy computations, their use is as containers of data.
If the list elements have names we refer to the content of a list element
directly by using its name:
> a . list $ Word
[1] " Hello "
The nice thing about using the name is that we do not need to keep track of
the numbering of the elements. As long as we know there is an element named
Word, we can retrieve it no matter where in the list it occurs.
5.3 Arrays
An array is simply an extension of a matrix to more than 2 dimensions. We can
see it as a vector of vectors of vectors... We can create a 3-dimensional array
like this:
> a . box <- array (1:30 , dim = c (2 ,3 ,5) )
If you type a.box in the console window you will see it is displayed as 5 matrices,
each having dimensions 2 × 3.
Arrays are very similar to matrices, just having more indices. We refer to
the elements just like for matrices, e.g. a.box[2,2,4] refers to the element in
row 2, column 2 of matrix 4 in the array. We don’t use arrays very often, but
some functions may produce them as output.
5.5 Exercises
5.5.1 Lists
Make a list containing the month-data from the text. Each list element should
contain the data for one month. Use only the Name and Days data. The list
should have length 5. Add the following single-element list to this list: list(
Name="June",Days=30). Verify that the extended list has 6 elements.
Make another list where the first element has a vector containing the number
3 three times, the second element has the number 2 two times and the last
element has the number 1 once. Use the function unlist on this list. What
happens? Try unlist on the month-data list as well.
1. Create a logical vector of the same length as Month and with TRUE on all
January days.
3. The output from which is the index vector to specify the rows.
44 CHAPTER 5. MORE DATA STRUCTURES
Compute the mean of every column of the matrix, using colMeans (read its Help
file).
Save the script under the name script_bears.R (or choose your own file name).
Clear the memory and run the script. In the Workspace window of RStudio
you should now see the beardata variable, which is a data.frame containing 24
observations (samples) of 11 variables. If you type beardata in the console you
get the data set listed. Alternatively, you can click on the beardata variable in
the Workspace window, and RStudio will display the data in a nicer way in the
editor.
The first and basic input to read.table is the name of the file you want to
read. If this file is in a different folder than your R session you need to specify
the entire path of that file, e.g. if the file was in a subfolder called data we would
have to enter "data/bears.txt" as the first input. Notice that the file separator
is / in R (just like in UNIX), not \, even if you run R under Windows.
Additional inputs to read.table are optional, and see ?read.table for a full
description. The argument sep indicates how the columns are separated in the
45
46 CHAPTER 6. INPUT AND OUTPUT
file. Here it is just a single blank, but other separators may be used. If the
columns are tab-separated we must specify sep="\t" which is the symbol for a
tab. The logical header indicates if the first row should be seen as headers, i.e.
column names.
If you have specified options(stringsAsFactors=F) before you read the data,
you don’t need this argument here. The read.table function also have other
ways of overriding the default conversion of texts to factors. The arguments
as.is can be used to specify which columns not to convert to factor, while
colClasses can be used to specify the data type of every column. Read about
them in the help files.
We need to first specify the data.frame to output, then the file to put it into.
After that we have a number of options we may use if we like. Here we specify
that columns should be tab-separated (sep="\t") and that no row numbers
should be output. The write.table is in many ways quite similar (inverse) to
read.table, see help files for details.
Notice that we first list all variables we want to save, and then finally file= and
the name of the file to create. It is a convention that these type of files should
have the .RData extension.
Files saved from R like this are binary files and can only be read by R. If
you want to send data to other people in such files, you must make certain they
have R and know how to read such files.
Reading these files is very easy:
load ( " dataset1 . RData " )
6.3.1 Connections
A connection we can think of as a pipe that R sets up between its memory and
some data source, typically a file or a URL on the internet. There are several
functions for creating such connections, see ?connections. Once a connection
has been set up, we open it, read or write data, end finally close it:
> conn <- file ( " influenza . gbk " , open = " rt " )
> # here we can do some reading ...
> close ( conn )
In the first line we create a connection using the function file. This function
opens a connection to a file (surprise, surprise...). The first argument is therefore
the file name, in this case "influenza.gbk". The second argument opens the
connection. This argument takes a text describing the mode of the connection.
The string "rt" means read text, i.e. this pipe can only be used for reading (not
writing) and it will only read text. There are several other functions besides
file that creates and opens connections, but we will only focus on file and url
here.
This should result in the entire file being read, line by line, into the variable
lines. This variable is now a character vector, with one element for each line
of text. Retrieving the relevant information from these text lines is a problem
we will look at later, when we talk more specifically about handling texts.
Instead of reading from a file, we can also read directly from the internet.
The following code retrieves the document from the Nucleotide database at
NCBI for the same file that we read above:
co <- url ( " http :// www . ncbi . nlm . nih . gov / nuccore / CY134463 .1 " ,
open = " rt " )
lines <- readLines ( co )
close ( co )
In this case the variable lines contains all codes needed to format the web-page
in addition to the text we saw in the file.
48 CHAPTER 6. INPUT AND OUTPUT
6.5 Exercises
6.5.1 Formatted text
Download the file bears.txt from Fronter and save it. Make a script that reads
this file into R. Note that if you save the file in a directory different from your
working directory (quite common) you need to specify the path in the file name,
e.g. read.table("data/bears.txt") if the file is in the subdirectory data.
Use the function summary to get an overview of this data set. Read the
Help-file for summary. When we are looking at a new data set, we usually want
to make some plotting before we proceed with more specific analyses. We will
look into plotting in more detail later, but we will take a ’sneek-peak’ at some
possibilities right away.
The first 8 columns of the data.frame you get by reading bears.txt are
numerical observations from a set of 24 bears. We can quickly make pairwise
plots of each of these variables against every other variable. Do this by plot(
beardata[,1:8]) (here I have assumed beardata is the variable name). Which
columns correlate?
6.5. EXERCISES 49
Control structures
7.1 Loops
Fundamental to all programming is the ability to make programs repeat some
tasks under varying conditions, known as looping. R has these basic facilities
as well. The short words written in blue below are known as keywords in the
language, and we are not allowed to use these words as variable names. They
usually appear in blue font in RStudio, hence the coloring here.
Here we specify that the variable i should take on the values 1,2,...length(
alist), and that for each of these values the code between { and } should be
repeated. The keywords here are for and in.
The ’skeleton’ of a for loop looks like this: for(var in vec){}. The variable
named var here is the looping variable. This variable is created by the for
statement, and takes on new values for each iteration. We often use very short
names for the counting variable, e.g. i, j etc., but any name could be used.
The values it will take are specified in the vector vec. The length of the vector
vec determines the number of iterations the loop will make. Any vector can be
used here, but most often we construct a vector on-the-fly just as we did in the
example above (remember 1:length(alist) will create a systematic vector).
Notice how already before we start the loop, we know there will be need
for exactly length(alist) iterations. The for loop is the most common type of
loops in R programs.
51
52 CHAPTER 7. CONTROL STRUCTURES
Here we give the value 1 to the variable x. Then, we start a loop where we inside
the loop let the new value of x be the old value times 2. We also make a print of
x in the console window (the cat function). The while loop has in the first line
a test. After the keyword while we have some parentheses, and inside them is
something that produces a logical, TRUE or FALSE. As long as we produce a TRUE
the loop goes on. Once we see a FALSE it halts. The function is.finite takes a
numeric as input and outputs a logical TRUE if the input is finite, or FALSE if it
is infinite. Run these lines of code, it gives you an impression of what R think
is ’infinite’...
Notice that in the test of the while above we have to involve something that
changes inside the loop. Here the testing involves x, and x changes value at each
iteration of the loop. If we wrote y <- x*2 instead of x <- x*2 inside the loop,
it would go on forever because x no longer changes value! This is a common
error when people make use of while loops. Such never-ending loops can be
terminated by Session-Interrupt R in the RStudio file menu.
Vectorized code
Create two numeric vectors, a and b, each having 100 000 elements. Then
compute c as the sum of a and b. We have learnt that as long as vectors have
identical length, we can compute this as
c <- a + b
for ( i in 1: length ( c ) ) {
c [ i ] <- a [ i ]+ b [ i ]
}
On my computer the latter took 390 times longer to compute! The concept of
vectorized code simply means we use operations directly on vectors (or other
data structures) as far as possible, and try to avoid making loops traversing
through the data structure. This is often possible, but not always. If you want
to do something to the contents of a list you have to loop, there is no way to
’vectorize’ a list.
will do exactly the same looping, but faster since the ’hard labour’ of the looping
is done by some compiled code inside this function. As you can see, the code is
also much shorter.
However, sometimes we deliberately use explicit loops also in R programs.
Programs are usually easier to read for non-experts when they contain loops.
The gain in speed is sometimes so small that we are willing to sacrifice this for
more readable code. It should also be said that for smaller problems, where
loops run only over a smallish number of iterations, the speed consideration can
be ignored altogether. The avoiding of loops is most important when we build
functions, because these may themselves be used inside other loops, and the
speed issue becomes a more real problem.
7.2 Conditionals
As soon as we have loops, we will need some conditionals. Conditionals means
we can let our program behave differently depending on what has happened
previously in the program.
for ( i in 1:10) {
cat ( " i = " , i , sep = " " )
if (i >5) {
cat ( " which is larger than 5! " )
}
cat ( " \ n " )
}
The loop runs with i taking the values, 1,2,...,10. Inside the loop we have
an if statement. There is a test if(i>5) where the content of the parentheses
must produce a logical. If this logical has the value TRUE, the content of the
statement (between { and }) will be executed. If it is FALSE the code between
{ and } will be skipped. Note that all code outside the if statement is always
executed, independent of the test. Only the code between the curly braces are
affected.
Most if statements are seen inside loops, where the execution of the loop
depends on some status that changes as the loop goes on. But, we can put if
statements anywhere we like in our programs.
The only difference is that this statement has two sets of braces. The code
inside the first set is executed if the test is TRUE, and the other if it is FALSE.
It is possible to extend the else branch with another if, like this:
if ( test1 ) {
# code here
} else if ( test2 ) {
# code here
} else {
# code here
}
7.2. CONDITIONALS 55
and this can be extended in as many cases as you need. Often a branching like
this can be better handled by a switch statement, see below. Note the space
between else and the following if, there is no keyword called elseif in R!
The loop has a text n as the looping variable. This takes on the different values
in names, hence the loop runs 5 iterations. Inside the loop we assign values to
the variable age. The EXPR of the switch is a text, one of the names. The three
following code-lines are each identified by a label (e.g. Lars), and switch will
try to match the text in n to one of these labels. If a specific match is found, the
corresponding line of code is executed. Note that these code-lines do nothing
but assign a value to the variable age. If no match is found, the last line of
code, with no label, is the one to be executed. This is the default choice when
no specific match is found.
56 CHAPTER 7. CONTROL STRUCTURES
Thus, after this statement the variable var1 has either value x or y. Since this
type of conditional assignments occurs quite frequently in programs, there is a
short-version for writing this kind of statements in R:
var1 <- ifelse ( test ,x , y )
It means var1 is assigned the value x if the test comes out as TRUE and y if it
comes out as FALSE. Notice that the function name is ifelse without any space
inside.
The comparison in the last two lines are straightforward since both a, b and c
have 5 elements, and the result is in each case a logical vector of 5 elements.
Make a script and run this example.
The logical operators are operators used on logical variables. The three
operators we will look into here are &, | and !. The first operator is called
AND, and combines two logical variables into a new logical variable. If we
consider the vectors from the example above we can write
lgic3 <- lgic1 & lgic2
7.4. STOPPING LOOPS 57
and the vector lgic3 will now have 5 elements where any element is TRUE if and
only if the corresponding elements in lgic1 and lgic2 are both TRUE. In all other
cases it will have the value FALSE. You can combine as many logical variables
as you like this way (with & between each) and the result will be TRUE only for
elements where all input variables are TRUE.
The second operator is OR, and
lgic3 <- lgic1 | lgic2
will now result in lgic3 being TRUE if either lgic1 or lgic2 is TRUE. Again you
can combine many variables in this way.
The last operator is NOT. Unlike the other two, this is a unary operator (the
others are binary), meaning it involves only one variable, and !lgic1 means that
we flip TRUE and FALSE, i.e. any elements originally TRUE become FALSE and vice
versa.
We can combine the operators into making more complex logical expressions.
What does lgic3 look like now:
lgic3 <- ( lgic1 |! lgic2 ) & lgic2
As usual, expressions inside parentheses are evaluated first. Sort it out in your
head before you run the code and see the answer.
Logical operators are very common, and you should make certain you un-
derstand how they operate.
The keyword break can be used to stop a loop if some condition occurs that
makes the continued looping senseless. Then the only reasonable solution may
be to stop the looping, perhaps issue some warning, and proceed with the rest
of the program. This can be done by the break statement:
# some code up here ...
for ( i in 1: N ) {
# more code here ...
if ( test ) {
break
}
# more code here ...
}
# some code down here ...
Here we see that inside the loop is an if statement, and if the test becomes
TRUE we perform a break. This results in the loop terminating, and the program
continues with the code after the loop.
We do not use the break very often. It is in most cases possible to use a
while loop in those cases where we can think of using a break.
58 CHAPTER 7. CONTROL STRUCTURES
7.4.2 Errors
If we want to issue a warning to the console window, we can use the function
warning taking the text to display as input. Notice that a warning is just a
warning: It does not stop the program from continuing.
If a critical situation occurs, and we have to terminate the program, we use
the function stop. This function also takes a text as input, the text we feel is
proper to display to explain why the program stops. We could use stop instead
of break in the example above. This would result in not only stopping the loop,
but stopping the entire program from running.
For those familiar with other programming languages, the exception handling
might be a familiar concept. This is a way to deal with errors in a better
way than just give up and terminate the program. In R we can also handle
exceptions, but this is beyond the scope of this text.
7.6 Exercises
Simulation is much used in statistics, and it can be a nice way to illustrate
and analyze certain problems. There is a range of methods called Monte Carlo
methods in statistics, and they all rely on simulation of some kind. In order to
simulate we need 1) a way to sample random numbers, and 2) looping to repeat
this many many times. We will have a first look at stochastic simulation in the
following exercises.
store in the umeans vector, one by one. After the loop you make a histogram of
umeans. Does it look like a normal distribution?
To compare our observations to the fair die we compute the likelihood ratio:
6
X ρi
λ= yi log
i=1
ρ̂i
Notice that if all yi are equal then ρ̂i = 1/6 = ρi and λ = 0. The more λ
deviates from 0 the more likely it is that the die is dishonest. But how much
can it deviate from 0 and still be a fair die?
According to the theory −2λ should have an approximate χ2 distribution
(chi-square) with 6−1 degrees of freedom. We will investigate this by simulation.
In R, roll a fair die 30 times, compute the vector y (use table) and compute
λ. Repeat this 1000 times, and store the λ values from each round. Make a
histogram of −2λ and compare to the χ2 density (use dchisq). (Tip: Use the
option freq=FALSE in hist).
We can now make probabilistic statements around the deviation that λ has
from 0. We want a fixed value for λ that indicates a dishonest die, and accept
that a fair die will have a 5% probability of also crossing this limit. Which value
of λ is this? Hint: Use qchisq.
Chapter 8
Building functions
61
62 CHAPTER 8. BUILDING FUNCTIONS
having any idea of what it does? Next, we need to know what type of input
it takes. We may categorize input into two levels: Essential input and options.
The essential input must always be provided, without this the function will not
work properly, or at all. The options are inputs we may give, but they are not
absolutely necessary. They can be used to change the behavior of this machine.
Finally, we need to know what is the output of the function. An example is the
function plot that we have used many times already. If we write plot(x,type="
l",col="blue") the first input, x, is essential. This is the data to plot, without
this the function has no meaning. The last two inputs, type="l",col="blue" are
options. The function would work without these inputs, but we may provide
them to change the function behavior. In the help files for R functions we find
a system:
Description
First we have a short Description of what the function does. Sometimes this
is too short to really give us the whole picture, but it gives you an impression
of what this machine does.
Usage
Next, we find Usage. This is what we refer to as the function call, i.e. how
do you call upon this function to do a job. It gives you a skeleton for how to
use this function, what to type in order to invoke it. Often only some of the
possible inputs are shown in this call, and ... (three dots) indicates there are
other possibilities.
Arguments
Next we usually find Arguments. This lists all possible inputs. Arguments is
just another word for input here.
Value
Next we may find Value, which is a description of the output from the function.
Details
There may also be a section called Details. Here you find more detailed de-
scription of how the function works.
Other sections
There may be other sections as well, depending on what the maker of the func-
tion found necessary. At the end of the help files you always find some example
code.
• Re-use of code. Once we have written a function we can use and re-use
this in all other programs without re-writing it.
• Hiding code. When we use a function we do not need to look inside it.
The function itself is often stored away on a separate file. This means
programs get much shorter.
• Saving memory. All variables and data structures we make use of inside a
function only exists for the very short time that the function is running,
and once it is finished they are automatically deleted. Only the output
from the function remains available to us.
We start out by specifying the function name in the first line. This is always
followed by the assignment operator <- and the keyword function. Then we
have the ordinary parentheses in which we specify the names of the arguments.
In this example we have only one argument. This first line is the signature of
the function. After this we have the curly braces indicating the start and end
of the function body. Inside the body we can write whatever code we like, but
the function should always end with the keyword return and the name of the
variable (in parenthesis) to use as output from this function.
64 CHAPTER 8. BUILDING FUNCTIONS
A function in R can only return one single variable, hence we can only name one
variable in the parentheses after return. If you want to return many different
variables as output, you have to put them into a list, and then return the list.
Notice that it is what we specify in the return statement that is used as output.
Any code you (for some silly reason) enter after the return statement will never
be executed, the function terminates its job as soon as it reaches the return.
which means we have no essential input, only one option. If we call this function
without specifying n.dice it will take the value 1, but if we provide an input
value this will override the default value. We can call it without any input, like
this: roll.dice(). You will see there are many functions in R that takes no
input. This means they can do their job without any input. Try for instance
the function Sys.time() in the console window.
in the console window, and the function should return, at random, a single
integer from 1 to 6.
Notice that no function can be called until it has been sourced into memory.
Also, if you make changes to your function you must remember to save the file
and then re-source it into memory. It is only the last version that has been
sourced that is used. This is a very common mistake for beginners, editing
their functions, but forgetting to save and re-source them before trying them
out again.
When we call a function we have to provide it with values for the essential
arguments. In our roll.dice function there are no essential arguments, and we
could get away by calling it without any input. If we want to roll five dice we
have to specify
> roll . dice (5)
indicating that the argument n.dice should take the value 5. Sometimes we also
write
> roll . dice ( n . dice =5)
which is identical to the previous call. The difference is that in the latter case
we specified that it is the input argument n.dice that should take the value
5. For this function this was not necessary, since there is only one argument.
But, most functions have multiple arguments, and specifying the name of the
argument is a good idea to make certain there is no confusion about which
argument should take which value. We have seen this when we used the plot
function. We wrote something like plot(vec,type="l",col="red"), where the
second and third argument is specified by their name.
If you do not specify the names of the arguments during call, R expects
the values you give as input to come in the exact same order as the arguments
were defined in the function. This makes it cumbersome for functions with
many input arguments, one mistake leads to error. However, if you name each
input argument, their order makes no matter at all! Naming the arguments also
allows you to leave many optional arguments unspecified. Try the following uses
of plot
> plot ( x =1:5 , y = c (1 ,2 ,3 ,2 ,1) , pch =16 , cex =3 , col = " red " )
> plot ( x =1:5 , y = c (1 ,2 ,3 ,2 ,1) , col = " red " , cex =3 , pch =16 )
Notice they both produce exactly the same plot, even if some of the arguments
are in a different order in the second call. As long as we name them there is no
confusion. The conclusion is that it is a good idea to always name arguments
in function calls.
functions in the same file, but some prefer to have one function per file, and
even name the file similar to the function. This is your choice.
I have a habit of naming all my scripts along this pattern; script_blablabla
.R. Files containing functions lack the prefix script_. In this way I can quickly
see which files in my folder are containing functions and which are scripts. I
prefer to have several, related, functions in the same file, simply to have less
files.
Typically, a script makes use of several functions. If I make a script about
Yatzy-simulation I would start the script-file like this:
source ( " yatzyfun . R " )
and source it. Upon completion you will find that the variables number.of.dice
and roll7 exist in the workspace, but not the local variables inside the roll.dice
function. This is what happens:
First we create number.of.dice and give it the value 7. Next, we call the
function roll.dice with number.of.dice as argument. This means the value
of our variable number.of.dice is copied into the function and assigned to the
argument n.dice. Thus, when the function starts to execute its body code lines,
the variable n.dice exists and has the same value as number.of.dice. Note, these
are two different variables, we just copied the value from one to the other.
Inside the function the job is completed, and the variable d.sorted has some
values. Then, as the function terminates the values of d.sorted is copied to the
workspace-variable roll7. We specify by the line
roll7 <- roll.dice(number.of.dice) that the variable roll7 should be created,
and assigned the value that is output from the call roll.dice(number.of.dice).
It is imperative that you understand the copying going on here. The value of
number.of.dice is copied into the function, and the value of d.sorted is copied
out of the function.
Why is it designed like this? Why could not the variables inside the function
be visible in the workspace, just like the variables we create in our scripts? We
will not dwell on this issue here, but a short answer is to save space and to make
functions easier to make use of. Remember a function is a machine, like a car.
It is much more convenient to use a car if all the details are hidden away, and
the interface we have to relate to is simple. All machines are designed like this,
even virtual machines.
8.7 Exercises
Let us expand on the file yatzyfun.R and create some more functions relating
to the Yatzy game. In a regular game of Yatzy you always roll the five dice up
to three times in each round. The first part of the game consists of six rounds
where you collect as many as possible of the values 1,2,...,6. In the first round
you must collect the value 1, i.e. after each roll you hold all dice having the
value 1, and you roll the remaining dice. After three rolls you count the number
of dice with value 1, which is your score in this round. In the second round it
is similar but you must now collect the value 2. The score you get after three
rolls is the number of 2’s times 2. Similar applies to the values 3, 4, 5, and 6.
After these six rounds you sum up the scores so far, and if you have more than
62 points you get 50 extra bonus points.
times, and holds dice after each roll to get as many as possible of the target
value. The function should return the score you achieve on the round. You
should make use of the function roll.dice from the text.
Make a script that uses the functions to simulate the first six rounds of Yatzy
thousands of times, and compute the score to see if the total after six rounds
qualify for bonus. What is your estimate of the probability of getting the bonus?
How does the score-distribution look like after this first part, before you include
the bonus points?
Plotting
In this chapter we will look closer at how to visually display data and results,
which is an important part of statistics and data analysis in general. We will
introduce certain aspects of R graphics along the way. There is no way we can
dig into the depths of R graphics, there are simply way too many possibilities.
We will restrict this to the most common ways of plotting.
9.1 Scatterplots
A scatterplot is simply a plot of the values in one vector against those in an-
other vector. The vectors must be of the same length. The scatterplot shows
a two-dimensional coordinate system, and a marker symbol is displayed at each
coordinate pair. The commands we use to make scatterplots in R are the func-
tions plot and points that we have already seen. Let us make an example with
some random values from the normal distribution that shows some possibilities:
many . x <- rnorm (1000) *2
many . y <- rnorm (1000) *2
medium . x <- rnorm (30)
medium . y <- rnorm (30)
few . x <- rnorm (5) *0.75
few . y <- rnorm (5) *0.75
plot ( x = many .x , y = many .y , pch =16 , cex =0.5 , col = " gray70 " ,
xlab = " The x - values " , ylab = " The y - values " ,
main = " A scatterplot " )
points ( x = medium .x , y = medium .y , pch =15 , col = " sandybrown " )
points ( x = few .x , y = few .y , pch =17 , cex =2 , col = " steelblue " )
A result of running this code is shown in Figure 9.1. Notice how we use both
plot and points. The reason is that each time you call plot any previous plot
is erased before the new is displayed. Thus, if you want to add several plots
to the same window, you use plot the first time, and then add more by using
points. You cannot use points unless there is already an existing plot.
There are many options we can give to the plot and points functions. The
option pch takes an integer that sets the marker type. In the example above
the gray markers are round, the blue are squares and the red are triangles (pch
69
70 CHAPTER 9. PLOTTING
A scatterplot
●
6
●
●
● ●
●
●
●
● ● ●
● ●
● ● ●
● ● ● ● ●●
● ● ●● ●
●● ● ●●
4
● ● ●
● ● ● ●
● ● ●
●● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ●● ●
● ● ● ●● ● ● ● ●● ●●
●● ●
● ● ● ●
●● ●● ●● ● ●
●
●
● ●● ● ● ● ●● ● ●
●● ● ● ● ●● ● ● ● ● ●● ●●●
● ● ●● ●
●● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●●●
● ● ● ● ●● ●● ● ●● ●●●● ●
● ● ● ● ●
●
●
● ●● ● ● ●
2
● ● ● ●
● ● ● ●
The y−values
● ● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ● ● ●●● ●●●● ● ● ● ● ● ●
● ●
●
●●● ● ● ●● ●● ●
● ●●
●
● ●
●
● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ● ● ●● ● ● ● ●
●● ● ● ●● ●
● ●● ● ●●● ●●● ●
● ●
●● ● ●
● ● ●●● ● ● ●
●
● ●
●●● ● ●
●● ● ● ● ●● ●●●
●● ●●● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●
●
● ● ●●
● ●●
● ● ●● ●
●
● ●● ● ●● ●●● ●●●● ● ● ●
● ●● ● ●●● ● ●●
● ● ●
● ●● ●
●● ● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ● ●● ● ● ●● ● ●● ●
●
● ●● ● ●● ● ●
● ● ●● ● ● ● ●● ●
● ● ● ● ● ●●● ● ●● ● ● ●
●●
●
●● ● ● ● ● ● ●● ● ●● ● ●●●
● ●●
0
● ● ● ● ● ● ● ●
●● ●●●
● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ●● ●●●
●● ● ●● ● ●● ● ● ●● ● ●● ● ● ●
●● ● ● ●● ● ● ● ● ●● ●● ●●● ● ●●● ●● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ●● ● ●●
● ● ● ● ●
●●●
●● ● ●● ● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ●● ● ●● ●● ●●● ●
●
●
●
● ● ● ● ● ● ● ● ●●● ●● ●●
●● ● ● ●●
● ● ● ●● ● ●●● ● ●●● ● ●
● ● ● ● ●● ● ●● ● ●● ●●● ●●●●●
● ●● ● ●● ●
● ● ●● ●●● ●●● ●● ● ● ● ● ● ●
● ●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ●● ● ●● ●
● ● ●● ● ● ● ●● ● ●● ●●● ● ● ● ●
●
● ●● ● ● ● ● ●●
● ●● ● ●
● ●●● ● ●
−2
● ●●● ● ●●● ● ● ●
●
●
● ● ● ● ●● ● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ● ●●●● ● ●
●● ●
● ● ●●● ●● ●● ●
● ● ●● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ●● ●●● ●●
● ● ●● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●● ●
● ● ● ●
● ● ● ● ● ●
●● ● ●●
● ● ●
−4
● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●●●
● ● ●
● ●
● ●
●
−6
−6 −4 −2 0 2 4 6
The x−values
Figure 9.1: A scatterplot with three different markers and colors for three groups
of data.
values 16, 15 and 17 respectively). See ?points for the details on which marker
corresponds to which integer. The option cex scales the size of the marker.
By default it is 1.0, i.e. no scaling. The gray markers have been scaled down
(cex=0.3) and the red up (cex=2.0). The color is set by col, type colors() in
the console window to get a listing of all available named colors.
When we make the first plot there are some options we can set here that
cannot be changed by later additional use of points. The range of the axes are
two. The options xlim and ylim specify the outer limits of the two axes. If these
are not specified they are chosen to fit the data given to plot. This is fine, but
if you want to add more plots you must make certain the axes limits are wide
enough to include later data as well. Before you start plotting multiple plots
you must always check for the smallest and largest x- and y-values that you are
going to see, and specify the xlim and ylim accordingly. The function range can
be handy for checking this. Other options you set in the first plot are the texts
to put on the axes, i.e. xlab and ylab, and the title main.
9.2. LINEPLOTS 71
9.2 Lineplots
If the values on the x-axis (horizontal axis) are sampled from a continuous
variable, like time-points, it is more common to use lines than markers. It also
requires the x-data to be ordered. Here is an example of some plots of this type:
time <- 1:10
time <- 1:10
y1 <- rnorm (10 , mean = time )
y2 <- y1 +1
y3 <- y1 +2
y4 <- y1 +3
plot ( x = time , y = y1 , type = " l " , lwd =1.0 , lty = " solid " ,
col = " saddlebrown " , xlim = range ( time ) ,
ylim = range ( c ( y1 , y2 , y3 , y4 ) ) , xlab = " Time ( days ) " ,
ylab = " y values " , main = " A lineplot " )
points ( x = time , y = y2 , type = " l " , lwd =2.0 , lty = " dashed " ,
col = " slategray3 " )
points ( x = time , y = y3 , type = " l " , lwd =1.5 , lty = " dotted " ,
col = " purple4 " )
points ( x = time , y = y4 , type = " b " , lwd =5 , lty = " longdash " ,
col = " seagreen " )
The result is shown in Figure 9.2. The functions we use are the same as for the
scatterplot, but some of the options are different. We specify type="l" for line
or type="b" for both (both marker and line) as in the last code line. Instead of
cex we use lwd to specify line width, but the scaling principle is the same. The
option lty is used to specify the line type.
There are many more options you can give to plot and points, and a good
way to explore them is to explore the function par. This is used to set graphics
parameters, see ?par. Options to par can also be used as options to plot and
points.
9.3 Histograms
We have already seen histograms. A histogram takes a single vector of values
as input, and split the range of values into a set of intervals. Then counts the
number of occurrences in each interval. The function we used is called hist.
Here is an example:
v1 <- rnorm (1000)
hist ( x = v1 , breaks =30 , col = " lemonchiffon2 " ,
border = " lemonchiffon4 " , main = " A histogram " )
The result is shown in Figure 9.3. The option breaks indicate how many intervals
we should divide the range into, i.e. how many bars we get in the plot. It is
just a suggestion, the histogram function makes the final decision. The default
is breaks=10. The more data you give as input, the more intervals you should
use. The col option specifies the color of the interior of the bars, while border
sets the colors of the bar borders. If you set these to the same color the bars
will ’grow’ into each other. You give texts to the title and the axes as for plot.
72 CHAPTER 9. PLOTTING
A lineplot
12 ● ●
●
10
●
●
8
y values
●
●
6
● ●
4
2
2 4 6 8 10
Time (days)
9.4 Barplot
A barplot is somewhat similar to the lineplot in Figure 9.2, but we use bars if
the x-axis is discrete (not continuous). It is different from a histogram in the
sense that a histogram bar represent an interval, but a barplot bar represents
a single point on the x-axis. For this reason, the histogram bars have no ’air’
between them, while barplot bars always should have some space between each
bar. Let us look at an example:
parties <- c ( " R " ," SV " ," Ap " ," Sp " ," V " ," Krf " ," H " ," Frp " ," Other " )
poll <- c (1.4 ,4.9 ,28.4 ,4.7 ,4.5 ,5.3 ,32.9 ,15.6 ,2.2)
barplot ( height = poll , names . arg = parties , col = " brown " ,
xlab = " Political parties " , ylab = " Percentage votes " ,
main = " A barplot " )
The result is shown in Figure 9.4. The option names.arg should be a vector
of texts to display below each bar, thus it must be of the same length as the
9.4. BARPLOT 73
A histogram
80
60
Frequency
40
20
0
−2 0 2 4
v1
In Figure 9.5 we show the result, where the brighter bars show the distribution
of the fair die, and the darker bars of the unfair die, having larger probabilities
of getting a 1 or a 6. We used the option beside=TRUE to put the bars beside
74 CHAPTER 9. PLOTTING
A barplot
30
25
Percentage votes
20
15
10
5
0
Political parties
each other. Try to run this code with beside=FALSE to see the effect. We also
used the option horiz=TRUE to put the bars horizontally.
See in Figure 9.6 how it looks like. Pie charts can be used for giving an overview
9.6. BOXPLOTS 75
Die=6
Die=5
Die=4
Die=3
Die=2
Die=1
0
50
100
150
200
250
300
Outcomes
Figure 9.5: A barplot with two sets of count data as bars beside (above!) each
other.
9.6 Boxplots
A boxplot, or box-and-whisker plot, is a way to visualize a distribution. We
can have several distributions in the same boxplot. Let us sample data from
three different distributions: The standard normal distribution, the student-
t distribution with 3 degrees of freedom and the uniform distribution on the
interval (0,1). Then we make boxplots of them.
76 CHAPTER 9. PLOTTING
A pie chart
Ap
Sp
Krf SV
R
Other
Frp
dist . samples <- list ( Normal = rnorm (25) , Student = rt (30 , df =3) ,
Uniform = runif (20) )
boxplot ( dist . samples , main = " A boxplot " ,
col = c ( " bisque1 " ," azure1 " ," wheat1 " ) )
In Figure 9.7 you can see the boxplot produced. For each group of data we have
a box. The horizontal line in the middle is the median observation. The width
of the box and the notches reaching out from the box describe the spread of the
data. Their exact meaning can be adjusted by options, see ?boxplot for details.
Finally, extreme observations are plotted as single markers. We see from the
current boxplot that the uniform distribution has the smallest spread, and is
also centered above zero. The normal distribution is centered close to zero,
and the same applies to the Student-t distribution. The latter has 1 extreme
observation.
9.7. SURFACES 77
A boxplot
●
3
2
1
0
−1
9.7 Surfaces
We can also plot surfaces in 3D. A surface is a matrix of heights (the z-
coordinate) where each row and column corresponds to a location in two di-
rections (x-coordinate and y-coordinate). Here is an example:
vx <- 1:20
vy <- (1:30) /5
vz <- matrix (0 , nrow =20 , ncol =30)
for ( i in 1:20) {
for ( j in 1:30) {
vz [i , j ] <- max (0 , log10 ( vx [ i ]) + sin ( vy [ j ]) )
}
}
persp ( x = vx , y = vy , z = vz , theta = -60 , phi =30 , main = " A surface " ,
col = " snow2 " )
The surface can be seen in Figure 9.8. The essential arguments to persp are the
78 CHAPTER 9. PLOTTING
A surface
vz
vx
vy
data (x-, y- and z-coordinates, where the latter is a matrix). The two additional
options specified here control the angle from where we see the surface. Try
different values for theta and phi and see how you can see the surface from
various perspectives, or replace the line persp(x=vx,y=vy,z=vz,theta=-60,phi
=30) with
for ( i in 1:100) {
th <- i -150
ph <- (100 - i ) /2
persp ( x = vx , y = vy , z = vz , theta = th , phi = ph )
Sys . sleep (0.1)
}
9.8 Contourplots
Even if surfaces may look pretty, they are difficult to read, and a better way of
displaying them is often by a contourplot. The function filled.contour can be
used for this purpose, and here we use exactly the same data as in the surface
example to make such a plot:
vx <- 1:20
vy <- (1:30) /5
vz <- matrix (0 , nrow =20 , ncol =30)
for ( i in 1:20) {
for ( j in 1:30) {
vz [i , j ] <- max (0 , log10 ( vx [ i ]) + sin ( vy [ j ]) )
}
}
filled . contour ( x = vx , y = vy , z = vz , main = " A contourplot " )
The plot it produces is shown in Figure 9.9. A contourplot sees the surface
from ’above’ and divides it into levels of different colors, just like a map. Until
now we have just used named colors in our plots. This is no longer the case for
contourplots. It is time we take a closer look at colors.
See the colors in the barplot in Figure 9.10. If you inspect the vector cols you
will see it contains texts, and these texts are actually hexadecimal numbers that
R converts to colors. In the example above we used these color-codes instead of
the named colors as input to barplot. In all cases we have used named colors we
could have used these color-codes instead. Notice that it was our choice to look
at 80 colors in the example above, you could just as well have chosen another
integer.
The function filled.contour from above takes as one of its options a color
palette. The default is a palette called cm.colors, but let us try the palette
called terrain.colors:
80 CHAPTER 9. PLOTTING
A contourplot
5 2.0
4
1.5
3
1.0
0.5
1
0.0
5 10 15 20
Figure 9.9: A contourplot is like a map, and usually a more informative display
than a 3D surface.
vx <- 1:20
vy <- (1:30) /5
vz <- matrix (0 , nrow =20 , ncol =30)
for ( i in 1:20) {
for ( j in 1:30) {
vz [i , j ] <- max (0 , log10 ( vx [ i ]) + sin ( vy [ j ]) )
}
}
filled . contour ( x = vx , y = vy , z = vz , color . palette = terrain . colors ,
main = " A contourplot with terrain . colors " )
Figure 9.10: The colors produced by a built-in palette. Each vertical bar has a
different color.
5 2.0
4
1.5
3
1.0
0.5
1
0.0
5 10 15 20
Figure 9.11: A contourplot can take a palette function as input, and display the
colors produced by the palette.
base . cols <- c ( " blue4 " ," blue2 " ," blue " ," green " ," yellow " ,
" orange " ," red4 " ," brown " )
my . colors <- colorRampPalette ( base . cols )
filled . contour ( x = vx , y = vy , z = vz , color . palette = my . colors ,
main = " A contourplot with our own colors " )
Here we assume the data for making the contourplot is still available in the
workspace, if not, run the previous examples first. The result is shown in Figure
9.12.
5 2.0
4
1.5
3
1.0
0.5
1
0.0
5 10 15 20
Figure 9.12: A contourplot where we have used our own palette, see text for
details.
the first value of mfrow is the number of row, the second the number of columns.
If we want to make two plots beside each other, it means we divide the window
into one row and two columns, hence we set mfrow=c(1,2). Here is an example
where we divide into 2 rows and 3 columns:
par ( mfrow = c (2 ,3) )
plot (1:10 , rnorm (10) , pch =15 , col = " red4 " )
plot ( rnorm (1000) , rnorm (1000) , pch =16 , cex =0.5 , col = " steelblue " )
plot (1:10 , sin (1:10) , type = " l " )
barplot ( c (4 ,5 ,2 ,3 ,6 ,4 ,3) )
hist ( rt (1000 , df =10) , breaks =30 , col = " tan " )
pie (1:5)
See Figure 9.13 for the result. Notice that for each plot command the next
position in the window is used. The alternative option mfcol will do the same
job, but new plots are added column-wise instead of row-wise.
84 CHAPTER 9. PLOTTING
1.0
●
3
●
●
●
● ● ●●● ●
● ●
●●
2
● ● ●
2
● ● ● ● ● ●
● ● ● ●●
● ●●●●● ● ● ●● ●
●
0.5
● ● ●
● ● ●● ● ●
● ●●●● ● ●●● ●●● ● ●
●● ●
● ● ● ●●●
● ● ● ●●● ●
●●●●● ●● ●●
● ●●● ● ● ●
● ●● ●● ● ●● ●● ● ● ● ●●●● ● ● ● ●●
●●● ●●●● ●
● ●●●● ● ● ●●●
●●● ● ● ●● ● ●●●●●
rnorm(1000)
1
● ● ● ●●●●● ●● ●● ●● ● ● ●
● ● ● ● ●
●● ● ●● ●●●● ●● ●●●●●● ●●
● ●● ●●●● ●
● ●
rnorm(10)
●
● ● ●● ● ●● ●● ●●● ● ●
● ●
●● ● ●● ●● ● ●
sin(1:10)
●●●
● ●● ●● ●●
●●●●
● ●●● ●●● ●● ●●● ● ● ● ●
● ● ● ● ●●● ●●●● ● ●
● ●● ● ● ●● ●●●●
● ●● ●● ● ●● ●●●●● ● ●●
● ●●●●●●
1
0.0
● ● ● ● ●
● ● ●●●●
●●●● ● ● ●●
●●●●●● ●
● ●●● ● ●● ●●●●●●● ●●●
0
● ●● ● ●
●
● ●● ● ● ●
● ● ● ●
● ● ●●●●● ●
●
●●
● ● ●●●● ●
● ●
●●●●●● ●●● ●●●●●● ●●
● ●●● ● ●●
● ●
●● ●●●
●●●
●
● ●●● ●● ● ●
●● ● ●●●
● ●● ●● ●●●
● ● ● ●● ●
●
●
●● ●
● ● ● ●● ●●● ● ●
●● ●● ● ●● ● ● ●
●● ● ● ●●
● ●
●● ●
●●● ●● ●
●
●● ● ● ● ●●
● ●● ●
●●
●●● ● ● ●●●●
●●●
●● ● ● ● ● ● ●●
●●
● ●●● ● ●● ● ●● ●●●●●●●●
●●● ● ● ● ● ● ● ●●●
● ● ●●●●● ●●●● ●● ●●●● ●●●●● ●● ●●
● ●●● ● ●● ●
● ●
● ●
●●●●●●● ● ●
● ● ● ●●●● ● ●●●● ●●● ●
● ● ●● ● ● ●● ●
● ● ●●●● ● ● ● ●
−1
● ●
●● ●● ● ● ● ●
● ● ●● ●● ●●●● ●●●●● ● ●●●● ●
● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ●● ●●● ● ● ●
0
●
● ● ●● ●●● ●●●● ●● ● ● ● ●● ●
●●
●● ● ●●
● ● ● ● ●●●● ●●● ●● ● ● ●
−0.5
●
● ● ● ● ● ● ● ●● ● ●●●● ●
● ● ●● ●●
● ● ●●● ● ● ● ●
●● ● ● ●● ● ●
● ●●● ●
● ●
● ● ●
● ● ● ● ● ●●● ●●
●
●●● ● ●● ●
−2
● ● ● ●● ●
● ● ●
● ● ●
● ● ● ●
●
−1
−3
●●
●
−1.0
●
●
●
2 4 6 8 10 −3 −2 −1 0 1 2 3 2 4 6 8 10
200
5
3
150
2
4
Frequency
1
100
3
4
2
50
5
1
0
0
−4 −2 0 2 4
rt(1000, df = 10)
Figure 9.13: Plotting several plots in the same window, in this case the plots
are arranged into two rows and three columns.
Another approach is to use the layout function. This allows us to divide the
graphics window into different regions where the plots should appear. Again
we divide into rows and columns, but a plot can occupy several ’cells’. This is
indicated by a matrix, where all cells having the same value makes up a region.
Let us divide the window into 2 rows and 2 columns, but let both cells in the
first row be a region, while the cells of the second row are separate regions. This
means the first plot we add takes up the top half of the window, the next two
plots share the bottom half:
lmat <- matrix ( c (1 ,1 ,2 ,3) , nrow =2 , byrow = T )
layout ( lmat )
barplot ( rnorm (20) , col = " pink " )
plot ( rnorm (1000) , rnorm (1000) , pch =16 , cex =0.7)
hist ( rchisq (1000 , df =10) , col = " thistle " , border = " thistle " ,
main = " " )
Have a look at the matrix lmat in the console, and compare it to the result in
Figure 9.14. Notice how the regions are specified in lmat and how the plots
appear along the same pattern.
9.11. MANIPULATING AXES 85
1
0
−1
−2
200
3
● ●
● ●●
● ● ●
● ●● ● ●● ●
●● ●● ●
● ●
● ●●
2
●● ● ●● ● ●
● ● ●● ● ● ●●● ●●● ● ●
●● ●● ● ● ●
● ● ●●●● ●● ●● ●● ●● ●●●
●
rnorm(1000)
● ●●● ●● ●● ●
● ● ● ●●● ●● ● ●●●● ●● ●● ● ●● ●●●● ●
●●● ●
Frequency
● ● ●●
●● ●
●●●● ●● ●● ●●●● ● ●●
● ● ●● ●● ●
1
●● ●
●●●●●● ● ●
●
●●●●●●●●
●● ●● ●●●
●
● ●●●●●●
●
● ●●
● ●●● ●●
●●●●●●●
●●
●
●
●●●
●● ●● ●●● ●
●●●
● ●●●● ● ●●
● ●● ● ● ● ● ●●
●● ● ● ●
● ● ●
● ●● ●●
●●●● ●●●
●
●●● ●● ● ●
●● ● ●●●● ●
●● ●● ● ● ●●
●●● ●●●●●●
●● ●●●●● ● ● ●●●
● ● ●
●
●●
●●●●
● ●●●●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●●●●
●
●●
●●●
● ●
●●
●
● ●● ●
●
●●●
●●●
●● ●
●●●●● ●●●● ●
●●● ●
100
●● ●● ● ●
●● ● ● ●● ●
● ●
● ● ●
●●● ●●●● ●
●
●● ● ●
●●
●●●
●● ●●
●●
●
●●●●●
●
●●●
● ●●●●●● ● ●● ● ●● ● ●
● ●●●● ● ●●
●●●
●
●●●
●● ●
●●●
●
●●●● ●●
●●
●●
●
● ●
●●●●●
●
● ●
●●● ●●●● ● ●●●●●●● ● ● ●
●●
●
● ●●● ●●●● ● ● ● ●●
● ●
●● ●
●●●●
● ●● ●●●
●●●● ●●● ● ●●
●
●
● ● ●● ●● ● ●●● ●●●●● ● ●
●
● ●● ●
● ●●●
●●●● ●●
●
●
●
●●
●●●●
●●
●
●●
●
●
●
●●●
●
● ●
●
●
●●●
●
● ●
●●●●
●
●●●
●● ●
●●●● ●
●● ●●
● ● ●●●●●● ● ● ●●● ●● ● ●●
●●●● ●● ●● ● ●● ●● ● ● ●●● ● ●
−1
●●● ●●
● ●
●●
● ●
●●● ●
● ●● ●
●●● ● ●
●● ●●● ●
●● ●
● ● ● ● ●
●●●● ● ●
● ●●● ●
● ●● ●●● ● ● ●
●
● ●● ●●● ● ● ●●
● ●
●●● ●● ● ●
●●●●● ●● ●● ●●
●●●
●●●
●●● ● ●●●●●●
●●● ●●●●
●●
●●
●●
●●●●●●● ●●
● ● ●●●●●●●●● ●
50
● ● ●●● ●● ●
● ● ●●● ●
● ●●●
●● ●
●● ● ●●●●● ●
● ●●● ●
●
●● ● ●● ● ● ●
● ● ●● ●●● ● ●●
● ● ● ● ●
● ● ● ●●
● ●
●
−3
●
0
● ●
−3 −2 −1 0 1 2 3 0 5 10 15 20 25 30
Figure 9.14: Here we have created regions of different size for the plots, with
one wide region at the top and two smaller below.
This produces a plot completely void of axes. Let us now add a horizontal axis
at the top of the plot.
axis ( side =3 , at = c ( -3 , -2 , -1 ,0 ,1 ,2 ,3) ,
labels = c ( " This " ," is " ," a " ," silly " ," set " ," of " ," labels " ) ,
col = " red " )
Notice how we specified side=3 to place it at the top, and how we can completely
specify where to put the tick marks (at) and how to label them (labels). We
also added color to the axis (col) as well as the tick labels (col.axis). Let us
add another axis to the right
axis ( side =4 , at = c ( -2.5 ,0 ,1) , labels = c ( " A " ," B " ," C " ) ,
col = " green " , col . axis = " blue " , las =2)
By not specifying anything but the side we get an axis just like we usually have:
axis ( side =1)
axis ( side =2)
Try to run these lines of code, and observe the results. Study the help file ?axis.
When you start to manipulate axes you will soon also need to manipulate
the margins surrounding the plot to make more or less space for tick labels
and axis labels. See the mar option to the par function for how to manipulate
margins. Again the four sides of a plot is numbered clockwise starting at the
bottom.
9.12. ADDING TEXT 87
Notice the option pos that directs where to put the text relative to the point
specified. Here we used pos=3 to put the text above the point. You can place
many texts at many positions by using vectors as input to x, y and labels. See
?text for more details.
A legend is a box with some explanatory text added to a plot. The function
legend allows you to do this in a simple way. Here is an example:
The result is shown in Figure 9.15. Notice how we specify the upper left corner
of the box to be located at (2.7,0.95). This must usually be fine-tuned by trial
and error to avoid it from obscuring the lines or markers of the plot.
1.0
Cosine
Sine
0.5
0.0
s
−0.5
−1.0
0 2 4 6 8 10
9.15 Exercises
9.15.1 Plotting function
Make a function that creates a scatterplot of y versus x, but in addition to this
also put histograms showing the distribution of y and x in the margins. See
Figure 9.16 for an example.
● ●
10
9.5
●
8.5
8
7.5
●
●
6.5
6
●● ●
5.5
●
y
●
●
4.5
●
● ● ●
●
4
●● ●
●
3.5
● ●
● ●
● ●
●
2.5
●● ●
●●● ●
● ●● ●
2
● ● ● ●
●
1.5
● ● ● ●● ●
● ● ● ●●
● ●
● ● ● ●●
0.5
●
● ● ● ●● ●
●
0
−2 −1 0 1 2
Handling texts
R is a tool developed for statistics and data analysis, and computations and
’number-crunching’ is an essential part of this. However, in many applications
we also need to handle texts. Texts are often part of data sets, as categorical
indicators or as numeric data ’hidden’ in other formats. In the field of bioinfor-
matics the data themselves are texts, either DNA or protein sequences, usually
represented as long texts. Thus, in order to be able to analyze data in general
we should have some knowledge of how to handle texts in an effective way.
This function can also take as input a vector of texts, and will return the length
of each in a vector of integers:
> words <- c ( " This " ," is " ," a " ," test " )
> nchar ( words )
91
92 CHAPTER 10. HANDLING TEXTS
[1] 4 2 1 4
Notice the difference between length(words) and nchar(words). The first give
you the length of the vector (in this case 4), while the second gives you the
number of characters in each element of the vector.
The functions tolower and toupper will convert to lowercase and uppercase
text, respectively. Again they can take a vector as input, producing a vector as
output:
> toupper ( c ( " This " ," is " ," a " ," test " ) )
[1] " THIS " " IS " "A" " TEST "
Notice how we gave 2 vectors of texts as input, and got a vector of texts as
output. The first elements are merged with a single space between them, and
similar for all other elements. The sep option specifies the separating symbol
(a single space is the default choice):
> paste ( name1 , name2 , sep = " _ " )
[1] " Lars _ Snipen " " Hilde _ Vinje " " Thore _ Egeland "
Here our first input is a single text (vector with 1 element). The second input is
a numeric vector with 4 elements. The first vector is then re-used 4 times. The
second vector is converted to text, and finally they are merged, using a single
space as separator.
If we want to merge all elements in a vector into a single text, we use the
collapse option:
10.3. SPLITTING TEXTS 93
If you want no separator between the elements, again use the empty text:
> paste ( c ( " A " ," A " ," C " ," G " ," T " ," G " ," T " ," C " ," G " ," G " ) ,
collapse = " " )
[1] " AACGTGTCGG "
Notice from the output here that it is a list of length 1, having a vector of 4
elements as its content. This function will always output a list. The reason is
that it can split several texts, and since they may split into a different number
of subtexts, the result must be returned as a list. Here is a short example:
> strsplit ( c ( " This is a test " ," and another test " ) , split = " " )
[[1]]
[1] " This " " is " "a" " test "
[[2]]
[1] " and " " another " " test "
94 CHAPTER 10. HANDLING TEXTS
Both input texts are split, resulting in 4 subtexts in the first case and 3 subtexts
in the second.
Notice also that the splitting symbol (here we used a space " ") has been
removed from the output, only the texts between the splitting symbols are re-
turned.
If we give one single text as input to strsplit, the ’list-wrapping’ is pointless
and we should eliminate it to simplify all further handling of the output. To
do this we use the unlist function. This takes as input a list, and returns the
content of the list after all ’list-wrapping’ has been peeled off:
> unlist ( strsplit ( " This is a test " , split = " " ) )
[1] " This " " is " "a" " test "
The unlisting of lists should only be done when you are absolutely certain of
what you are doing. It is in general safe to use when the list has 1 element only.
If the list has more than one element it can produce strange and undesired
results.
In many cases the result of strsplit is a list of several elements, and we
should in general not use the unlist function. We will discuss later in this chap-
ter how to deal with lists in an effective way, since many other text-manipulating
functions also produce such lists.
Here we extract character number 5,6,7,8,9 and 10 from the input text. Both
substr and substring will extract subtexts this way, but the latter is slightly
more general, and the one we usually use. Here is an example of how we can
extract different parts of a text in a quick way:
> dna <- " ATGTTCTGATCT "
> starts <- 1:( nchar ( dna ) -2)
> stops <- 3:( nchar ( dna ) )
> substring ( dna , starts , stops )
[1] " ATG " " TGT " " GTT " " TTC " " TCT " " CTG " " TGA " " GAT " " ATC " "
TCT "
The first subtext is from position 1 to 3, the second is from 2 to 4, the third is
from 3 to 5, etc. The function substring will circulate the first argument, since
this is a vector of length 1, while argument two and three are vectors of many
elements. It is possible to provide substring with many input texts, and then
specify equally many starts and stops, extracting different parts of every input
text.
10.5. REGULAR EXPRESSIONS 95
The first argument is the pattern, in this case simply the text "is". The second
argument is the vector of texts in which we want to to search. The pattern
"is" is found in element 1 and 2 of this vector (in "This" and in "is"), and the
output is accordingly.
The grep function does not tell us where in the texts we found the pattern,
just in which elements of the vector it is found. The function regexpr is an
extension to grep in this respect:
> regexpr ( pattern = " is " , text = c ( " This " ," is " ," a " ," test " ) )
[1] 3 1 -1 -1
attr ( , " match . length " )
[1] 2 2 -1 -1
attr ( , " useBytes " )
[1] TRUE
The basic output (first line) is a vector of four elements since the second argu-
ment to regexpr also has four elements. It contains the values 3, 1, -1 and -1.
This indicates that in the first text vector element the pattern is found starting
at position 3, in the second text vector element it is found starting position 1
and in the last two text vector elements it is not found (-1).
After the basic output we see attr(,"match.length") and then another vector
of four elements. This is an example of an attribute to a variable. All R variables
can have attributes, i.e. some extra information tagged to them in addition to
the actual content. We have seen how variables can have names, and this is
something similar. In this case the output has two attributes, one called "match
.length" and one called "useBytes". The first indicates how long the pattern
match is in those cases where we have a match. Since our pattern is the text
"is" the "match.length" must always be 2 (or -1 if there are no hits). See ?
regexpr for more on "useBytes". Note that attributes like these are just extra
information attached to the variable, the main content are still the four integers
displayed in the first line.
The function regexpr also has a limitation, it will only locate the first occur-
rence of the pattern in each text. In order to have all occurrences of the pattern
in every text we use gregexpr. Consider the following example:
96 CHAPTER 10. HANDLING TEXTS
> DNA <- c ( " ATTTCTGTACTG " ," CCTGTAACTGTC " ," CATGAATCAA " )
> gregexpr ( pattern = " CT " , text = DNA )
[[1]]
[1] 5 10
attr ( , " match . length " )
[1] 2 2
attr ( , " useBytes " )
[1] TRUE
[[2]]
[1] 2 8
attr ( , " match . length " )
[1] 2 2
attr ( , " useBytes " )
[1] TRUE
[[3]]
[1] -1
attr ( , " match . length " )
[1] -1
attr ( , " useBytes " )
[1] TRUE
We first notice that the output is a list (since we have the double-brackets [[1]]
). The list has the same number of elements as we had in the vector of the second
input argument (three). Each list element contains a result similar to what we
saw for regexpr, but now it is one result for each hit in the corresponding input
text. Like element two, showing that in the second input text "CCTGTAACTGTC"
we find the pattern "CT" at position 2 and 8.
A frequent use of regular expressions is to replace a pattern (subtext) with
some text. The function sub will replace the first occurrence of the pattern,
while gsub replaces all occurrences, and it is this latter we use in most cases.
We can illustrate using the previous example:
> DNA <- c ( " ATTTCTGTACTG " ," CCTGTAACTGTC " ," CATGAATCAA " )
> gsub ( pattern = " CT " , replacement = " X " ,x = DNA )
[1] " ATTTXGTAXG " " CXGTAAXGTC " " CATGAATCAA "
The text we use as replacement can be both shorter or longer than the pattern.
In fact, using replacement="" (the empty text) will just remove the pattern
from the texts. Replacing a pattern with nothing (removing) is perhaps the
most frequent use of gsub.
Here are some of the frequently used elements of the regular expression
syntax in R:
"AG[AG]GA" The brackets mean either or, i.e. either "A" or "G". This pattern
will match both "AGAGA" and "AGGGA".
"AG[^TC]GA"The ’hat’ inside brackets mean not, i.e. neither "T" nor "C".
This will match the same as above, plus subtexts like "AGXGA"
or "AG@GA".
"AG.GA" The dot is the wildcard symbol in R, and means any symbol.
Be careful with ".", it could make the pattern too
unspecific! (matches everywhere)
"The[a-z]" A range of symbols, here all lower-case English letters.
Other frequently used ranges are "A-Z" and "0-9".
"^This" A ’hat’ starting the expression means matching at the start
of the text only.
"TA[AG]$" A ’dollar’ ending the pattern means matching at the end of
the text only.
"NC[0-9]+" A ’plus’ means the previous symbol or group of symbols can be
matched multiple times. This is typically used to include an
unspecific number in a pattern, i.e. both "NC1" and "NC001526"
will match here.
There are of course many more possibilities, and ?regex will show you the help
file on this. The last example above also shows why we can have matches of
different lengths, hence the need for the "match.lengths" attribute in regexpr
and gregexpr.
We saw from the example on gregexpr above that each list element it returns
has a vector indicating with positive numbers each hit. If there are no hits, the
98 CHAPTER 10. HANDLING TEXTS
vector contains only the value -1. Thus, the simple sum(x>0) will give us the
number of hits. Notice that we expect the input to this function (x) to be the
content of a list element produced by gregexpr.
After this function has been sourced into the R workspace, we can use it in
sapply:
> DNA <- c ( " ATTTCTGTACTG " ," CCTGTAACTGTC " ," CATGAATCAA " )
> lst <- gregexpr ( pattern = " CT " , text = DNA )
> sapply ( lst , count . hits )
[1] 2 2 0
The pattern "CT" is found twice in input text one and two, and zero times in
input text three. The function sapply loops through the list, sending the content
of each list element as input to the provided function count.hits. This function
returns exactly one integer for each list element, and then sapply outputs these
results as a vector of integers.
Notice that the output from sapply is a vector, not a list. This makes it
possible to extract something (depending on the function you apply) from each
list element, and put the result into a vector. This of course requires that
one quantity or text is extracted from every list element, regardless of what it
contains.
Notice how the in-line function has no name, it exists only during the call to
sapply. This way of writing R code makes programs less readable for newcomers.
The detailed version above is easier to understand, but this latter approach is
something you will meet in real life.
and we can read them all by the same call to read.table, we only need to change
the filename each time.
We then make a for-loop going from 10 to 120 by step 10, and use the
paste-function to create the names:
for ( k in seq ( from =10 , to =120 , by =10) ) {
fname <- paste ( " Data " ,k , " . txt " , sep = " " )
dta <- read . table ( fname ) # additional options to read . table
may be needed
# extract what you need from dta before next iteration
# because then dta will be over - written ...
}
Actually, it is often a good idea to first read one file, and then create the data
structures you need to store whatever you want from the file. Then you proceed
with the loop (starting at the second file) and read the files, adding their content
to the existing data structure.
you will get listed all files and folders in the current working directory. By spec-
ifying a folder, e.g. dir("C:/Download/Data") the function returns a text-vector
containing the names of all files and folders in C:\Download\Data. NOTE! In
R we use the slash (/) but in Windows we use the backslash (\) when speci-
fying a path. Here are some lines that indicates how to read the data-files in
C:\Download\Data:
all . files <- dir ( " C :/ Download / Data " )
idx <- grep ( " txt $ " , all . files ) # use regular expression to
# locate the txt - files
data . files <- all . files [ idx ]
for ( fname in data . files ) {
dta <- read . table ( fname )
# extract what you need from dta before next iteration
# because then dta will be over - written ...
}
Notice that we used grep to select only the filenames ending with .txt. In some
cases this is not enough, there may be many .txt-files, and we only want to
read some of them. Then we need to make another grep where we look for some
pattern found only in the names of those files we seek.
100 CHAPTER 10. HANDLING TEXTS
10.9 Exercises
10.9.1 Poetry
Can R understand poetry? Well, at least we can make R-programs recognize
poetry to some degree. We will pursue this exercise when we come to modelling
in later chapters.
In the file poem_unknown.txt you find a poem written by some unknown
author. According to some expert in English literature the author of this poem
is very likely to be either Shakespeare, Blake or Eliot. To investigate this, we
need to convert this poem into a ’numerical fingerprint’.
Make a script that reads the file poem_unknown.txt line by line using the
readLine function from chapter 6. Also, load the file called symbols.RData. This
creates the vector symbols in your workspace. This vector has 30 elements, all
single character texts.
Make a function called poem2num that takes as input a vector of symbols, like
symbols and a poem (text vector). The function should return the number of
occurrences of each of the symbols in the poem. Remember to convert all text
in the poem to lower-case. Use the poem and the symbols from above to test
the function. It should return a vector of 30 integers. This is our (simplistic)
version of the poems numerical fingerprint.
10.9.2 Dates
In the weather data we have seen each date is registered by its Day, Month and
Year as three integers. In the raw data files this was not the case. Most years
the date is given like "01.01.2001" (day-month-year). Typically, the format has
changed over the years, and in some cases the year is only written with the last
two digits, e.g. "01.01.93" for 1. January 1993. In some years the format is
"01011998" (no dots). It is quite typical that rawdata are messy like this, and
some kind of programming is needed to get it all into identical formats.
Make a function that takes as input texts like all the three mentioned above,
and extracts the date as three integers, Day, Month and Year, and return these
in a vector of length 3. The year must be the 4-digit integer (1993, not just
10.9. EXERCISES 101
93). It means the function should recognize these three formats, and behave
accordingly.
102 CHAPTER 10. HANDLING TEXTS
Chapter 11
Packages
103
104 CHAPTER 11. PACKAGES
set. These are the same Help-files you see if you type ? and the function name
in the Console window.
where the keyword library is used to load the pls package. You can also use
require in the same way.
106 CHAPTER 11. PACKAGES
If I make a script where I know that I will use some functions from a certain
package, I always make certain the required library statements are entered at
the start of the script. This guarantees that the packages I need have been
loaded before the functions are being used later in the script. You do not need
to load the package more than once, but there are no problems executing the
library statement repeatedly, it is just ignored by the system if the packages
has already been loaded.
In the Package-vane in RStudio all currently loaded packages have their
check-box ticked.
The R subdirectory
This is where you put all R-programs of the package. All functions in the
package must be defined in files with the .R extension and put into this directory.
Nothing else should be added to this subdirectory.
This is where you put all the Help-files. A huge part of any package development
is to create the Help-files required for every function and every data set in the
package. The Help-files must all follow a certain format. In RStudio we can
create a skeleton for a Help-file from the File menu. Select New File, but instead
of choosing R Script as before, we scroll down to the Rd File option. A small
window pops up asking for the Topic. This is usually the name of the function
that we want to document. Type in the name, and a file is created containing
the skeleton for an R documentation file. You will recognize most of the sections
from any Help-file in R. Filling in files like these is a significant job in every R
package development. In the man directory there will typically be one such file
for each function, but in some cases (as we have seen) very related functions
may be documented within the same file.
Other subdirectories
We may add other subdirectories to our package-directory, but there are some
rules we must obey. If the package contains data these are usually saved in
.RData files (using the save() function, see chapter 6). Such files should be
placed in a subdirectory called data. Such data sets can be accessed by the
function data() once the package has been loaded. If we have loaded a package
containing the data set named daily.weather.RData, we can load it by
> data ( daily . weather )
We can also have subdirectories for other types of data, for external (non-R)
programs etc, but you will have to read about them from sources outside this
text.
11.8 Exercises
11.8.1 Imputation of data
Larger data sets will often have some missing data, for various reasons. If we
want to apply some multivariate statistical methods to our data, these will in
general not tolerate missing data. Think of the data set as a data.frame or
matrix. If the data in cell [i,j] is missing we must either discard the entire
column j, the entire row i or impute the value in this cell. The latter means
finding some realistic and non-controversial value to use in this cell.
Load the matrix in the file weather.matrix.RData. This is a small subset of
the weather data we have seen before. It is a matrix (not data.frame!) with 5
columns and 22 rows. Notice that column 3 (Humidity) has two missing data
(NA).
Imputation by KNN
However, the assumption that the only contribution to variation is a completely
random day-to-day fluctuation seems too severe. It is reasonable to assume that
Humidity varies systematically and not at random, by how the other weather
variables behave, at least to some degree. For instance, on a rainy day (much
Precipitation) it is reasonable to assume that the Humidity should be quite
high compared to a clear day. Notice that other weather variables have been
observed on those days where Humidity is missing, and we should make use of
this information.
The K-nearest-neighbour (KNN) imputation method uses the values for the
other variables to find the most likely missing values, i.e. based on the values
for Air.temp, Soil.temp.10, Precipitation and Wind, find other days in the data
set with similar values, and use the Humidity for those days to compute the
imputation value (e.g. the mean of these Humidity-values).
In the Bioconductor package impute you find a function for doing this. Install
the package, load it and read the Help-file for the function impute.knn. This
has been purpose-made for gene expression data, but any matrix can be used
11.8. EXERCISES 109
Create Help-files for the function plotTrends above, as well as the data set in
daily.weather.RData. Save these in the man subdirectory.
Build and load the package, first locally in RStudio and then build the
package-archive.
110 CHAPTER 11. PACKAGES
Chapter 12
111
112 CHAPTER 12. DATA MODELING BASICS
variables. In this course we will only consider problems with a single response
variable.
By convention, the explanatory variables are stored in a matrix named X
in statistical literature. If we have all data in a data.frame named D, and we
want to use columns 1 and 3 as explanatory variables, we can think of X as
X <- as . matrix ( D [ , c (1 ,3) ])
the training set the values of the response ytrain is always known. When we
talk about the number of objects n we usually refer to the number of objects in
the training set only.
The test set consists of the data objects (ytest , Xtest ), which are different
observations of the same variables as in the training set. In principle the response
ytest is missing/unobserved in this case, and we want to find it. We have the
observed data in Xtest , and based on the model we trained on the training data,
we want to combine this with Xtest to predict the values of ytest .
Both data sets are sampled from the same population, i.e. any relation that
holds between Xtrain and ytrain should also hold between Xtest and ytest , and
vice versa. Think of them as two subsets of the same data.frame.
In some cases the values of ytest are also known. We then pretend they are
not, and predict them as if they were indeed missing. Then we can compare this
prediction to the true values of ytest . This is a valuable exercise for evaluating
the ability of a model to make good predictions, as we will see below.
The distinction between training data and test data is again purely opera-
tional, i.e. data objects used in the training set can in another exercise be used
in the test set, and vice versa. In some cases the training data and test data
are identical, i.e. we use the same data for both purposes. This is possible, but
then we have to remember that the ’predictions’ of ytest are no longer actual
predictions, since the same data were used for training as well.
12.2 Regression
Let us repeat the regression idea for a very simple data set. In Chapter 6 we
met a data set on bears, in the file called bears.txt. This data.frame contains
some measurements made on 24 different bears. More specifically, there is a
variable named Weight, which is the body weight of each bear. Weighing a bear
can be quite cumbersome (and dangerous?), and it would be very nice if we
could predict the weight of a bear from its body length. Measuring the length
of a bear is, after all, much easier. Hence, we want to predict Weight based on
some observed value of Length. We consider only these two variables from the
data set, and the response variable y is the Weight and the only explanatory
variable in X (which means p = 1) is Length. Notice that both variables are
continuous, i.e. they may take any numerical value within a reasonable range
of values.
We first use all n = 24 data objects as training data, and we will split into
training and test data later. First, we read the data into R:
beardata <- read . table ( file = " bears . txt " , header = TRUE )
where β[1] and β[2] are unknown coefficients and e is some noise or error term.
We usually refer to β[1] as the intercept and β[2] as the slope. The simple
formula β[1] + Xtrain β[2] describes a straight line, i.e. our model says that the
relation between Xtrain and ytrain is a linear relation. This means that if we
had no noise (if e was absent) we could have plotted ytrain against Xtrain and
found all points on the same straight line. For real data there are always some
deviations from the straight line, and this we assume is due to the random term
e. The basic step of the data modelling is to find proper values for β[1] and
β[2], i.e. to estimate these unknown coefficients from data. This is the training
step.
In our case we now use the column Weight as ytrain and Length as Xtrain .
We will not go into any details on the theory of estimation, these are topics in
other statistics courses. Fitting the above linear model to data can be done by
the function lm in R.
fitted . mod <- lm ( Weight ~ Length , data = beardata )
●
250
●
200
● ●
●
150
●
Weight (kg)
●
● ● ●
100
● ●
●
●● ● ●
●
50
●
● ●
●●
0
Length (cm)
Figure 12.1: A scatter plot of Weight versus Length from the bears data set.
Each marker corresponds to an object, having one Weight-value and one Length-
value. The straight line is the fitted linear regression line, see the text for the
details.
We can think of this as an ordinary list, but with some added properties. For
those familiar with object oriented programming, we would say that the lm-
object has inherited list. The summary function can take an lm-object as input,
and produce some helpful print-out:
116 CHAPTER 12. DATA MODELING BASICS
Call :
lm ( formula = Weight ~ Length , data = beardata )
Residuals :
Min 1 Q Median 3Q Max
-64.76 -26.78 -9.42 30.74 80.67
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) -212.8522 47.2818 -4.502 0.000177 ***
Length 2.1158 0.3027 6.989 5.15 e -07 ***
---
Signif . codes : 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1
In the Coefficients section of this print-out we find some t-test results for each
of the coefficients of the model. These tests are frequently made to see if the
explanatory variables have a significant impact on the response variable. In
this particular case we would be interested in the null-hypothesis H0 : β[2] = 0
versus H1 : β[2] 6= 0. If H0 is true it means Length has no linear relation to
Weight. The print-out row starting with Length displays the result for this test.
Clearly the p-value (Pr(>|t|)) is small (5.15 · 10−7 ), and there is a significant
relation.
Another useful function for lm-objects is anova:
> anova ( fitted . mod )
Analysis of Variance Table
Response : Weight
Df Sum Sq Mean Sq F value Pr ( > F )
Length 1 78560 78560 48.85 5.147 e -07 ***
Residuals 22 35380 1608
---
Signif . codes : 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1
In this case we perform an F-test on the overall fit of the model. Read about
this test in AISL.
We can also inspect the lm-object as if it was a straightforward list. First
we take a look inside:
> names ( fitted . mod )
[1] " coefficients " " residuals " " effects " " rank "
[5] " fitted . values " " assign " " qr " " df . residual "
[9] " xlevels " " call " " terms " " model "
12.2. REGRESSION 117
We can also retrieve the fitted values, and use these to plot the fitted line.
Here is the code that produced the plot in Figure 12.1:
attach ( beardata ) # Read in ? attach to repeat ...
plot ( Length , Weight , pch =16 , cex =1.5 , xlab = " Length ( cm ) " ,
ylab = " Weight ( kg ) " , ylim = c ( -20 ,250) )
points ( Length , fitted . mod $ fitted . values , type = " l " , lwd =2 ,
col = " brown " )
# Alternative to two last lines above :
# abline ( fitted . mod , col =" red " , lwd =1.5)
The residuals are the differences between observed Weight and predicted
Weight (the straight line). We often plot the residuals versus the explanatory
variables, to see if there are any systematic deviations. Remember, the assump-
tion of the linear model is that all deviations from the straight line are due to
random noise. We can plot the residuals by:
residuals <- fitted . mod $ residuals
plot ( Length , residuals , pch =16 , xlab = " Length ( cm ) " ,
ylab = " Residual ( kg ) " )
points ( range ( Length ) ,c (0 ,0) , type = " l " , lty =2 , col = " gray " )
As seen in Figure 12.2 there is a tendency that the residuals are negative for
medium Length values, and positive at each end. This indicates the assumptions
of the linear model are too simple, and in this case it is the linear relation
between that Length and Weight that is too simple.
●
●
50
●
●
●
Residual (kg)
●
●
0
●
●
●
● ●
● ●
●
●
●
●
●
●
−50
Length (cm)
Notice that predict takes as input the fitted model and a data.frame with the
test set data objects. This data.frame must contain the same column names as
we used in the model, in our case Length. You can have data for as many new
data objects (bears) you like, and predict will compute the predicted y-value
(Weight) for each.
where the matrix Xtrain now has two columns. The second column is just the
first column squared. Using the bear data again, let us create a new data.frame
containing only the variables of interest:
beardata2 <- beardata [ ,1:2] # retrieving Weight and Length
beardata2 $ Length . Sq <- ( beardata2 $ Length ) ^2 # adding Length ^2
We can fit a new model including the second order term like this
fitted . mod2 <- lm ( Weight ~ Length + Length . Sq , data = beardata2 )
If you run a summary on this object, you will see the second order term is highly
significant. Notice also that the first order coefficient β[2] has changed its sign!
It is in general difficult to interpret coefficients as soon as you add second (or
higher) order terms.
In Figure 12.3 we have plotted the fitted values as we did before, and it
seems clear this model gives a better description of how weight is related to
length. It is no longer a straight line, due to the second order term, and here
are the lines of code we used to create this plot:
attach ( beardata2 )
plot ( Length , Weight , pch =16 , cex =1.5 , xlab = " Length ( cm ) " ,
ylab = " Weight ( kg ) " , ylim = c ( -20 ,250) )
idx <- order ( Length )
points ( Length [ idx ] , fitted . mod2 $ fitted . values [ idx ] , type = " l " ,
lwd =2 , col = " brown " )
Notice the use of the function order. This gives us an index vector that when
applied to Length will arrange its elements into ascending order. When making
a line-plot, the values along the x-axis must be in ascendig order, otherwise the
line will cross back and forth, making it look like a spiders web! Try to make a
plot of the residuals of this model versus Length.
Instead of using a higher order term, we could of course have used some of
the other variables in the original data set in our model. The variable Chest.G
(chest girth) measures how ’fat’ the bear is. Together with Length this says
120 CHAPTER 12. DATA MODELING BASICS
250 ●
●
●
200
● ●
●
150
●
Weight (kg)
●
● ● ●
100
● ●
●
●● ● ●
●
50
●
● ●
●●
0
Length (cm)
Figure 12.3: The fit of the model that includes a second order term of Length.
something about the size of the bear. Think of a cylinder, its volume is given
by
V = π · r2 · h (12.4)
where r is the radius and h is the height (length) of the cylinder. Imagine a
bear standing up, and visualize a cylinder where this bear just fits inside. Then
Length is the height h and Chest.G is π · 2r. Thus, the term Length*Chest.G
should come close to describing the volume of this cylinder, and the bear’s
volume is some fraction of this. If we can (almost) compute the bear’s volume,
the weights is just a matter of scaling. Thus we try the model
volume . mod <- lm ( Weight ~ Length + Chest . G + Length * Chest .G ,
data = beardata )
Run the anova on volume.mod and compare it to similar outcome for fitted.mod
and fitted.mod2 from above. In Figure 12.4 we have plotted how the ’volume
model’ fits the data. Here is the code that produced this plot:
12.2. REGRESSION 121
● ●
250
250
● ●
● ●
200
200
● ● ● ●
● ●
150
150
● ●
Weight (kg)
Weight (kg)
● ●
●● ● ● ●●
100
100
● ● ● ●
● ●●
●● ● ● ●●
● ●
50
50
● ●
● ● ●●
●● ●
0
Figure 12.4: The fit of the model that includes both Length and Chest.G,
referred to as the ’volume model’ in the text. Here we have displayed the fit in
two panels, once against Length and once against Chest.G.
attach ( beardata )
par ( mfrow = c (1 ,2) )
plot ( Length , Weight , pch =16 , cex =1.5 , xlab = " Length ( cm ) " ,
ylab = " Weight ( kg ) " , ylim = c (0 ,250) )
W . predicted <- volume . mod $ fitted . values
idx <- order ( Length )
points ( Length [ idx ] , W . predicted [ idx ] , type = " l " , lwd =2 ,
col = " brown " )
plot ( Chest .G , Weight , pch =16 , cex =1.5 , ylim = c (0 ,250) ,
xlab = " Chest girth ( cm ) " , ylab = " Weight ( kg ) " )
idx <- order ( Chest . G )
points ( Chest . G [ idx ] , W . predicted [ idx ] , type = " l " , lwd =2 ,
col = " brown " )
Call :
lm ( formula = Weight ~ Length + Gender , data = beardata )
Residuals :
Min 1Q Median 3Q Max
-53.861 -26.893 2.048 14.629 70.982
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) -206.000 43.496 -4.736 0.000112 ***
Length 1.935 0.289 6.695 1.26 e -06 ***
GenderM 35.882 15.854 2.263 0.034333 *
---
Signif . codes : 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1
Notice that GenderM is listed as a ’variable’. The lm function will consider level
1, F, of the factor as the ’default’ level. The estimated intercept and effect of
Length applies to bears of type F. The estimated value for GenderM is added effect
of level M, i.e. if a male and a female bear has the exact same Length, the male
bear is predicted to weigh 38.882 kilograms more. The following code produces
the plot in Figure 12.5.
attach ( beardata )
plot ( Length , Weight , pch =16 , cex =1.5 , xlab = " Length ( cm ) " ,
ylab = " Weight ( kg ) " , ylim = c ( -20 ,250) )
Weight . predicted <- gender . mod1 $ fitted . values
is . female <- ( Gender == " F " )
points ( Length [ is . female ] , Weight . predicted [ is . female ] ,
type = " l " , col = " magenta " , lwd =2)
points ( Length [! is . female ] , Weight . predicted [! is . female ] ,
type = " l " , col = " cyan " , lwd =2)
We can of course include interaction terms as well, i.e. higher order terms
where we mix numerical and factor variables. The following model
gender . mod2 <- lm ( Weight ~ Length + Gender + Length * Gender ,
data = beardata )
12.2. REGRESSION 123
●
250
●
200
● ●
●
150
●
Weight (kg)
●
● ● ●
100
● ●
●
●● ● ●
●
50
●
● ●
●●
0
Length (cm)
Figure 12.5: The fit of the model where Gender has an additive effect. The
magenta line applies to female bears, the cyan line to male bears.
includes both the additive and the interaction effect of Gender, and the fitted
model is shown in Figure 12.6. This actually corresponds (almost) to fitting two
separate models here, one for male and one for female bears.
It is quite common to also have models with only factors as explanatory vari-
ables. This is typical for designed experiments, where the explanatory variables
have been systematically varied over some levels. If you take some course in
analysis-of-variance (ANOVA) of designed experiments, this will be the topic.
The lm function can still be used.
250 ●
●
●
200
● ●
●
150
●
Weight (kg)
●
● ● ●
100
● ●
●
●● ● ●
●
50
●
● ●
●●
0
Length (cm)
Figure 12.6: The fit of the model where Gender has both an additive and
interaction effect. The code for making this plot is the same as for Figure 12.5,
just replace the fitted models.
longitudinal data, pedigree information, etc.). In the next chapter we will return
to the regression problem when we take a look at localized models.
12.3 Classification
A classification problems differs from a regression problem by the response y
being a factor instead of numerical. In principle this factor can have many
levels, but in many cases we have just two. Classification problems are common,
perhaps even more common than regression problems.
One example of a classification problem is to diagnose a patient. A patient
belongs to either the class ”Healthy” or ”Disease A” (or ”Disease B” or ”Disease
C” or...etc). A number of measurements/observations are taken from the pa-
tient. These make up the data object for this patient (Xtest , containing a single
row). Based on these we want to classify the patient into one of the pre-defined
12.3. CLASSIFICATION 125
classes. To accomplish this, we need a model (or Gregory House) that tells us
how to classify (predict) given the current Xtest -values. The model must be
fitted to a training set, Xtrain and ytrain , which is a set of patients where we
know the true class and where we have measured the same explanatory vari-
ables. Thus, it is almost identical to a regression problem, but computations
must be done different due to the factor response.
In the bear data set the variable Gender is a factor, and we will use it as
our response. Let us try to classify bears into male or female based on data for
Weight and Chest.G (we assume the beardata have been read into the workspace,
see previous section):
library ( MASS ) # the MASS package is loaded
fitted . mod <- lda ( Gender ~ Weight + Chest .G , data = beardata )
The prior is the vector of prior probabilities, and these are estimated by the
proportion of each class in the training data:
> fitted . mod $ prior
F M
0.4166667 0.5833333
Since male bears are slightly more common than female in this (small) data
set, this will make our model classify more bears as males in the future. If
we want both genders to have the exact same prior probability we can specify
this when we fit the model. The lda() has an option for priors, and if we add
prior=c(0.5,0.5) to the call we achieve this.
The class centroids are also interesting to inspect:
> fitted . mod $ means
Weight Chest . G
F 74.6000 79.80000
M 139.7857 97.28571
As expected, male bears are characterized by being heavier and with larger chest
girth.
$ posterior
F M
1 0.3908931 0.6091069
$x
LD1
1 -0.0249895
First we see that the list contains a variable named class, and this is the predic-
tion for this bear, in this case it is classified as M. Then, we have the posterior
probabilities for this bear. It is approximately 60% − 40% that this is a male,
indicating a substantial uncertainty attached to this prediction.
In Figure 12.7 we have plotted the training data in the left panel and the
predicted class for all bears, including the new bear, in the right panel. Here is
the code that produced this plot:
attach ( beardata )
par ( mfrow = c (1 ,2) )
is . male <- ( Gender == " M " )
xlims <- range ( Weight )
ylims <- range ( Chest . G )
plot ( Weight [ is . male ] , Chest . G [ is . male ] , pch =15 , cex =1.5 ,
col = " cyan " , xlab = " Weight ( kg ) " , ylab = " Chest girth ( cm ) " ,
main = " Training data " )
points ( Weight [! is . male ] , Chest . G [! is . male ] , pch =15 , cex =1.5 ,
col = " magenta " )
legend ( x =20 , y =140 , legend = c ( " Male " ," Female " ) , cex =0.8 , pch =15 ,
pt . cex =1.5 , col = c ( " cyan " ," magenta " ) )
fitted . lst <- predict ( fitted . mod ) # using training as test
data
is . male <- ( fitted . lst $ class == " M " ) # predicted class for each
bear
plot ( Weight [ is . male ] , Chest . G [ is . male ] , pch =0 , cex =1.5 ,
col = " cyan3 " , xlab = " Weight ( kg ) " , ylab = " Chest girth ( cm ) " ,
main = " Predicted class " )
points ( Weight [! is . male ] , Chest . G [! is . male ] , pch =0 , cex =1.5 ,
col = " magenta " )
points ( new . bear $ Weight , new . bear $ Chest .G , pch =17 , cex =1.5 ,
col = " cyan3 " )
legend ( x =100 , y =70 , legend = c ( " Predicted male " ,
" Predicted female " ,
" New bear " ) ,
cex =0.8 , pch = c (0 ,0 ,17) , pt . cex =1.5 ,
col = c ( " cyan3 " ," magenta " ," cyan3 " ) )
128 CHAPTER 12. DATA MODELING BASICS
140
Male
Female
120
120
Chest girth (cm)
100
80
80 Predicted male
Predicted female
New bear
60
60
Figure 12.7: Classifying bears as male or female based on their weight and
chest girth, using LDA. The left panel are the training data, with the correct
separation of males and females. The right panel shows how the LDA-model
would classify them. The blue square in the right panel is a new bear, which is
classified as male by this model.
This table has 2 rows and 2 columns (since we have 2 classes). Here the rows
correspond to the correct classification and the columns to the predicted. We
see that in 8 cases a female bear is predicted as female and in 9 cases a male
bear is classified as male (the diagonal elements). These are the 17 correct
classifications behind the accuracy we computed above. There are 7 errors, and
we see that 5 of these are male bears being predicted as female, while 2 female
bears have been incorrectly assigned as males.
We can now compute the sensitivity and the specificity of the fitted model.
In this example we may define sensitivity as the ability to detect a female
bear, and specificity as the ability to detect a male, but we could just as well
reverse it. Here both classes are equally interesting, but in many situations one
outcome is of special interest (think of the diagnosing example) and then we
define sensitivity as the ability to detect this class.
If we consider the confusion table above, sensitivity is simply the number of
correctly classified females divided by the total number of females, i.e. 8/(8 +
2) = 0.8. The specificity is similar for the males, 9/(9 + 5) = 0.64. Here is the
R code that computes it:
130 CHAPTER 12. DATA MODELING BASICS
Notice that we sum the rows since we gave the correct classes as the first input
to table(). Had we reversed the order of the inputs, the correct classes would
have been listed in the columns, and we should have summed over columns
instead.
Despite the elevated prior probability for males, this fitted model seems to
have a lower ability to detect male bears compared to female bears, as seen from
the numbers above. We must stress that since we here have used the same data
first as training data and then as test data, the actual errors is most likely larger
than we see here. Quantities like accuracy, sensitivity and specificity should be
computed from real test data. The results we get here are ’best case scenarios’,
in a real situation they will most likely be lower.
12.5 Exercises
12.5.1 LDA on English poems
This is the continuation of the poetry exercise from chapter 10. It gives you a
flavour of the power of pattern recognition and statistical learning...
12.5. EXERCISES 131
In order to reveal the identity behind the unknown poem we need to train
a classification model on some training data. This means we need some poems
that we know are written by Blake, Eliot and Shakespeare, and we must compute
the numerical fingerprint of each poem.
Load the file training_poems.RData. This contains a list of 28 poems, and
a vector indicating which author has written each poem. Also load the file
symbols.RData from the poetry exercise in chapter 10. Make a vector y.train, a
factor based on the authors. Make a matrix X.train with one row for each poem
and one column for each symbol. Make a loop counting the number of symbols
in each poem, putting these into the rows of X.train. Use the symbol-counting
function from the poetry exercise in chapter 10.
Fit an LDA-model based on y.train and X.train. Make a plot of the result-
ing lda-object. What does this tell you?
Finally, predict which author wrote the unknown poem. Look at the poste-
rior probabilities. What is your conclusion?
should all be "1_1988", the next 29 elements should be "2_1988" and so on.
Then use this in tapply to compute monthly average temperatures from the
daily temperatures. Call this vector Temp.daily. It should now have names
corresponding to the factor levels, i.e. the months. The statement
Time . daily <- names ( Temp . daily )
Try out the match function and make certain you understand how it works, this
is a very handy function! Try this small example in the Console window: match
(c(2,4,1),c(2,2,4,3,6,4,3,2,1)). Make a plot of Temp.longest.overlap versus
Temp.daily, these should be temperatures from the exact same months.
If you make the plot, you will see the temperatures are not identical, i.e.
we cannot just use the Temp.daily temperatures directly to fill in the holes in
Temp.longest.overlap. Instead we fit a linear model, predicting Temp.longest
.overlap from Temp.daily. Create a data.frame called train.set containing
the non-missing data in Temp.longest.overlap and the corresponding values in
Temp.daily. Then, fit a linear model with Temp.longest.overlap as response
and Temp.daily as explanatory variable. Next, make another data.frame called
test.set similar to train.set except that you use only the missing data in Temp
.longest.overlap and the corresponding values of Temp.daily. Finally, predict
the missing values in test.set using the fitted linear model.
Before we are truly finished, we need to put the imputed data back into their
correct positions in Temp.longest. Can you do this? (requires very good insight
in the use of index vectors...)
REMARK: Notice how small part of this exercise is the actual statistical
modelling, and how much more labour is related to the handling of data. This
is quite typical of real-life problems!
Chapter 13
Local models
In the previous chapter we introduced data modelling problems and the response
variable y and the matrix of explanatory variables X. In this chapter we will
focus on the prediction part of data modelling. We seek to establish some
relation (fitted model) between the response and the explanatory variables in
such a way that if we observe a new set of explanatory variables (new rows of
X) we can use the fitted model to predict the corresponding outcome of y.
The number of data objects in a data set, i.e. the number of rows in X, is
by convention named n. The number of variables (columns) in the X matrix is
named p. In many branches of modern science we often meet large data sets.
This could mean we have a large number of objects (large n) and/or it could
mean we have a large number of variables (large p). The cases where we have
n > p, and even n >> p (>> means ’much larger than’), usually calls for a
different approach to data analysis than if we have n < p. In this chapter we
will consider problems where we have n > p, i.e. the matrix X is ’tall and slim’.
These type of data sets arise in many situations. In some cases it is some
automated measurement system that samples some quantities over and over
again, producing long ’series’ of data. Observations of weather and climate
are examples of such data. A data object contains typically the measurements
from the same time (e.g. same day). Another example is image data. Each
pixel corresponds to a data object, with data for location (row and column
in the image) and some color intensities. A third example is modern biology,
where a large number of persons/animals/plants/bacteria have been sequenced
to produce some genetic markers for each individual. From these markers we
can obtain some numbers, corresponding to a data object for that individual,
and we can do this for a large number of individuals.
Due to ever-improving technologies, we often find data sets containing thou-
sands or even millions of data objects. These are situations where ’push-button’
solutions in standard statistical software packages often do not work very well,
if at all. The only solution is to do some programming.
133
134 CHAPTER 13. LOCAL MODELS
says that the response variable is related to the explanatory variables through
some function f . It also has some contribution from an error term e. The
function f could be any function, and this makes the model very general. In
the case of simple linear regression in the previous chapter, we assumed
However, relations are rarely exactly linear, and sometimes very far from it, and
the more data we have the more apparent this becomes.
Instead of searching for the ’true’ function f that produced the data, we
often approximate this function by splitting the data set into smaller regions,
and then fit a local model within that region. These local models can be simple,
often linear models, and even if the function f is far from linear, it can usually
be approximated well by splicing together many local models. In Figure 13.1
we illustrate the idea.
1.5
1.0
y
0.5
0.0
Figure 13.1: An example of the local model idea. The thick blue curve is the
true function relating X and y. Note that this function is in general unknown
to us! It is clearly not linear. The gray dots are the training data sampled,
and due to the error term (noise) they fluctuate around the blue curve, but the
non-linear shape of the relation is visible in the data as well. Instead of trying to
find the (perhaps complex) function behind that blue curve, we approximate it
by splitting the data set into regions, and then fit a simple linear model in each
region. The red lines are the fitted local models in each region. Here we chose
to split into 4 regions, but this will depend on how much the data fluctuate and
the size of the data set.
136 CHAPTER 13. LOCAL MODELS
Let us predict the Weight of the bear in the test-set to illustrate the procedure.
In the algorithm above we start by computing the distances from X.test[i,]
to all data objects in X.train. In this case the test-set has only one object, but
for the sake of generality we still use the index i, and set it to i<-1.
What is a distance? We will talk more about distances below, but for now we
say that distance is simply the absolute difference in Length (a distance cannot
be negative). We compute all distances in one single line of code:
i <- 1
d <- abs ( X . test [i ,] - X . train ) # abs gives absolute value
13.4. LOCAL REGRESSION 137
This results in the vector d with 23 elements, one for each data object in X.train.
If we look at this distance vector
> as . vector ( d )
[1] 41 34 20 16 28 28 2 23 32 19 8 8 30 61 5 15 23 5
8 64 5 28 9
we notice that data object 7 has distance 2 to X.test[i,], meaning this bear
has a Length 2 cm longer or shorter than our bear of interest. We also notice
that data object 15 and 18 have small distances to our X.test[i,]. Next, we
must find which distances are the smallest. Notice, we are not interested in the
actual distances. We are only interested in which data objects have the smallest
distances (not how small they are).
To find this index vector (called Ii above) we use the function order in R.
This will take as input a vector, and return as output the index vector that,
when used in conjunction with the input vector, will re-arrange the elements in
ascending order. This may sound complicated, let us illustrate:
> idx <- order ( d )
> idx
[1] 7 15 18 21 11 12 19 23 16 4 10 3 8 17 5 6 22 13
9 2 1 14 20
> d [ idx ]
[1] 2 5 5 5 8 8 8 9 15 16 19 20 23 23 28 28 28 30
32 34 41 61 64
We give d as input, and the output is stored in idx. Notice that idx specifies
that element 7 in d should be put first, then element 15, then element 18, then
21, and so on, to obtain a sorted version of d. We also use the index vector to
produce this sorting, by d[idx], and indeed we see the elements are now shuffled
in a way that sorts them in ascending order. You should really make certain
you understand how order works, this is one of the most convenient functions
in R!
In the moving average method there is one parameter, the number of neigh-
bours we should consider. This is the K in the KN N name. In this example
we will use K = 3, but will return to this choice later. The prediction is made
by adding the following lines of code:
idx <- order ( d )
K <- 3
y . test . hat <- mean ( y . train [ idx [1: K ]])
Notice how we use the K first elements of idx to directly specify which elements
in y.train to compute the mean from. The whole procedure is illustrated in
Figure 13.2.
Notice that the moving average method does not fit an overall model to the
training data first, and then use some predict function, as we saw for lm and
lda in the previous chapter. Instead, we use the Xtest data object and make a
look-up in the training data for similar data objects, and fit a very simple model
to these. This model is only valid at this particular point, and as soon as we
move to some other Xtest values, we must repeat the procedure.
138 CHAPTER 13. LOCAL MODELS
●
250
●
●
200
●
●
150
●
y.train
●
●
● ●
100
● ● ●
● ●
●●
●
50
●
● ●
●●
X.train
Figure 13.2: The filled dots (blue and red) are the training data. The red
vertical broken line marks the length of the test-set bear. Based on its length,
we find the K = 3 bears in the training data with most similar lengths, these
are the red filled dots. The green triangle is the predicted weight, computed as
the mean of the weights behind the red dots. The open red dot is the actual
observed weight for the test-set bear.
13.4. LOCAL REGRESSION 139
If we source this function, we can use it to predict the Weight of all bears in the
data set by the following script:
y . hat <- rep (0 ,24)
for ( i in 1: length ( y . hat ) ) {
y . hat [ i ] <- locLinReg ( beardata $ Length [ i ] ,
beardata $ Weight [ - i ] ,
beardata $ Length [ - i ] , K =6)
}
250 ●
●
●
●
200
●
●
Weight (kg)
150
●
●
● ●
100
● ●
●
●● ● ●
●
50
●
● ●
●●
Length (cm)
Figure 13.3: The blue dots are the observed and the brown triangles the pre-
dicted weights of bears using the local linear regression function from the text.
●
250
●
●
200
●
●
Weight (kg)
150
●
●
● ●
100
● ●
●
●● ● ●
●
50
●
● ●
●●
Length (cm)
Figure 13.4: The curve shows the predicted weights of bears with lengths from
91 to 187 cm using the loess function as described in the text.
1. For a data object in the test-set (Xtest [i, ]), compute the distance to all
data objects in the training-set (Xtrain ).
2. Find the K training-set objects having the smallest distance to the test-set
data object. We call these the nearest neighbours, and the index vector
Ii tells us where we find them among the training-set objects.
3. Classify ytest to the most common class among the nearest neighbours,
i.e. a simple majority vote.
Thus, the only difference is that instead of computing a mean value (which is
senseless since the response is no longer numerical) we count how often we see
142 CHAPTER 13. LOCAL MODELS
the various classes (factor levels) among the nearest neighbours, and choose the
most common one.
In the class package in R you find a standard implementation of this method
in the knn function. There are, however, good reasons for us to make our own
version of the KNN method. First, it is the best and perhaps only way of really
understanding the method. In fact, unless you can build it, you haven’t really
understood it (who said this?). Second, the knn function computes unweighted
euclidean distances only, which is something we may want to expand, as we will
see below. In the exercises we will return to the KNN classification.
13.6 Distances
Common to all the local methods we have looked at here is the need for comput-
ing distances. The nearest neighbours are simply defined by how we compute
the distances, and this step becomes the most important in all these methods.
In the bears example we used a single explanatory variable. In general, we will
more often use several, and the matrices Xtrain and Xtest can have two or more
columns. In such cases we need to give some thought to how we compute a
distance between two data objects.
p
!1/q
X
q
d[i, j] = |X[i, k] − X[j, k]| (13.1)
k=1
where the exponent q > 0. This is a general formula, and by choosing different
values for q we get slightly different ways of computing a distance.
If we use q = 2 the Minkowski formula give us the standard euclidean dis-
tance v
u p
uX
d[i, j] = t |X[i, k] − X[j, k]|2 (13.2)
k=1
which is what we call ’distance’ in daily life. Can you see how this follows from
Pythagoras theorem? Imagine we have two explanatory variables, i.e. p = 2.
Any data object can then be plotted as a point in the plane spanned by these
two variables. The euclidean distance between two points (data objects X[i, ]
and X[j, ]) is the length of the straight line between them.
Now, imagine in this plane you can only travel in straight lines either hori-
zontal or vertical (no diagonals). This is typically the case if you walk in a city
centre like Manhattan. If you are going from one point to another, you have
to follow the streets and avenues, and there is no way you can follow diagonal
straight lines (unless you can walk through concrete walls). Thus, the distance
between two points is no longer euclidean! Instead, if we use q = 1 in the
13.6. DISTANCES 143
which is the distance between two points X[i, ] and X[j, ] trapped inside a grid.
Notice how we just sum the absolute difference in each coordinate.
We can also use q = 3 or larger. If we allow q to approach infinity we get
p
!1/q
X
q
d[i, j] = lim |X[i, k] − X[j, k]| = max |X[i, k] − X[j, k]| (13.4)
q→∞ k
k=1
and we see that the second data object in Xtrain is the nearest neighbour of
Xtest . If we compute the Manhattan distance from the same data we get
|0.0 − 1.0| + |0.0 − 1.1| 2.1
d = |0.0 − 0.4| + |0.0 − 1.2| = 1.6
|0.0 − 0.0| + |0.0 − 1.5| 1.5
and this time the third data object is the nearest neighbour. Finally, if we use
the maximum-distance
max1,2 (|0.0 − 1.0|, |0.0 − 1.1|) 1.1
d = max1,2 (|0.0 − 0.4|, |0.0 − 1.2|) = 1.2
max1,2 (|0.0 − 0.0|, |0.0 − 1.5|) 1.5
which means the first data object produces the smallest distance from Xtest .
This shows that the choice of distance metric may have an effect, and that we
should be conscious of what we use.
There is a function in R for computing distances between the objects, called
dist. It takes as input a matrix and computes distances between all pairs of
data objects of this matrix. However, keep in mind that we do not really seek
the distances between the data objects of the training-set (or test-set). Instead,
we want to have the distance from a test data object, Xtest [i, ], to all training
data objects in Xtrain . Since we are now experts in R programming, we can
just as well make this function ourselves!
144 CHAPTER 13. LOCAL MODELS
13.8 Exercises
13.8.1 Classification of bacteria
We have a data set where thousands of bacteria have been assigned to three
different Phyla. All bacteria are divided into a hierarchy starting at Phylum
(division), then Class, ..., down to Species. Thus, Phylum is the top level
within the kingdom of bacteria. For each bacterium we have also measured the
13.8. EXERCISES 145
size of the genome (DNA) in megabases, and the percent of GC in the genome.
DNA consists of the four bases A,C,G and T, and GC is simply the percentage
of G and C among all bases.
Is it possible to recognize the Phylum of a bacterium just by measuring the
genome size and the GC-percentage?
In order to answer this we need to do some pattern-recognition (which is
another term for classification).
There will be two values in both cases, since we have two explanatory variables.
Now, we subtract the cntrs values from the corresponding columns of phyla.tst
and then divide each column og phyla.tst by the corresponding value in sdvs.
You should make a plot as in part A above, of the training-set using open
circles as markers, and the test-set with crosses as markers, just to verify that
they are found more or less in the same part of the scaled Size-GC-plane. Notice
how the numbers on the axes are now comparable in size.
Part C - LDA
Fit an LDA-model to the training-set, and predict the classes of the test-set (see
chapter 12). Compute the accuracy of this fitted model.
Make a plot to compare the predicted and the true classes. Make two panels,
and plot the predicted classes as in part A above in the left panel and the true
classes in a similar way in the right panel.
146 CHAPTER 13. LOCAL MODELS
Part D - KNN
Next, we classify each test-set object by KNN. You can use the function knn in
the class package for this. Again, compute the accuracy, and make plots like
for LDA above. How does it compare to LDA?
Cross-validation
147
148 CHAPTER 14. CROSS-VALIDATION
Searching, and finding, the optimal value for K is often referred to as model
selection.
It should be noted that the bias-variance trade-off applies to all models, not
only the local models of the previous chapter. In multiple linear regression or
LDA we include several explanatory variables to predict the response. As we saw
briefly in chapter 12, these explanatory variables can be either distinct variables
or just higher-order terms of some other variable, or interactions between two or
more variables (with higher order terms...etc). Anyway, we can choose to include
or exclude an explanatory variable in such a model. The inclusion/exclusion of
variables in this setting corresponds exactly the choosing the value of K above:
• Include many variables, which means smaller bias, but larger variance.
• Include few variables, which means larger bias, but smaller variance.
Again there is always some optimal level of inclusion/exclusion, and finding the
best set of explanatory variables to include in the model is again called model
selection.
Notice that with the local methods, the model as such is fixed but the data
set used for fitting varies by the size of the neighbourhood. If we use a moving
average, it has one single parameter, the mean inside the given neighbourhood.
This parameter must be estimated from the data in the neighbourhood, and the
more data we have, the better it is estimated. For LDA and multiple regression,
the data set used for fitting is fixed, but the number of parameters (e.g. the β’s
in the regression model) increase/decrease depending on how many variables we
include/exclude .
Bias
Variance
Total
Error
Neighborhood size
Figure 14.1: A schematic overview of the trade-off between bias and variance.
The horizontal axis indicates the choice of neighborhood size K in the local
models. Using a small K will produce a small bias (blue curve), but a large
variance (red curve). The sum of them (green curve) will be quite large. At the
other end of the K-axis we get opposite effects, and again the sum is quite large.
Somewhere in between we often find a balance where both bias and variance
are fairly small, and their sum is minimized. Note: The shape of the total error
(green curve) is in reality never as smooth as shown here, but this illustrates
the principle.
150 CHAPTER 14. CROSS-VALIDATION
where m is the number of data objects in the test-set, and ŷtest [i] is the predicted
value of ytest [i].
For classification problems the MSEP is replaced by the Classification Error
Rate (CER):
m
1 X
CER = I(ytest [i] 6= ŷtest [i]) (14.2)
m i=1
where the function I() takes the value 1 if its input is TRUE, and 0 otherwise.
It just counts the number of cases where we have mis-classified. In chapter
12 we mentioned the accuracy of a classification, and the CER is 1 minus the
accuracy.
Both the M SEP and the CER are quantities we can compute, and they
can both be seen as a sum of bais+variance. Notice there are some important
aspects here:
1. The training- and the test-sets are strictly separated. When we fit the
model to (ytrain , Xtrain ) we do not involve any information about (ytest , Xtest ).
The fitted model is based only on (ytrain , Xtrain ).
2. When predicting and computing ŷtest [i], we use the fitted model and Xtest .
No information about ytest is used, we pretend it is unknown.
3. Both training- and test-set data must come from the same population, i.e.
all objects in the test set could have been part of the training set, and vice
versa, and we think of it as random which objects are in which subsets.
As you can see the fitting and prediction has been left as comments here, and
must be filled in, depending on the method we use. The cross-validation itself
will work for any method plugged in.
Notice that if the full data set has L = 1000000 data objects, the loop above
will run a million times. The model is (re-)fitted inside the loop, and if this
takes some time, the looping may take a long time. If the fitting takes 1 second,
then 1 million iterations will take a week to complete!
Also, if the data set is huge, the training-set is almost identical in each
iteration, and the fitted models will be almost identical. In a huge data set
almost nothing changes by leaving out a single data object. Leaving out larger
subsets of the data is required to get some real variation between the fitted
models.
The vector segment will now have one element for each data object. Each element
is an integer from 1 to C and those objects with the same integer belong to the
same segment. Using this segment vector, two consecutive data objects will never
belong to the same segment, and since we first sorted them this will guarantee
maximum spread over the segments.
The C-fold cross-validation algorithm can then be sketched as
err <- rep (0 , L )
for ( j in 1: C ) {
# Split into ( y . train , X . train ) and ( y . test , X . test )
idx . test <- which ( segment == j )
y . test <- y [ idx . test ] # elements idx . test in y
X . test <- X [ idx . test ,] # rows idx . test in X
y . train <- y [ - idx . test ] # elements NOT idx . test in y
X . train <- X [ - idx . test ,] # rows NOT idx . test in X
The only difference to the leave-one-out code is the way we find idx.test and
that we compute the error for several data objects in each iteration.
●
30
● ●
● ● ● ● ●● ●● ● ●● ● ● ●
● ●●
● ● ●●● ● ● ●● ●● ● ●● ●
● ● ● ● ●●●● ● ●
●● ● ● ●● ●
● ● ●
● ● ● ●● ●● ● ●
●● ●
●● ●●●●
●● ●
● ●● ●
● ● ●●● ●● ●●● ● ●●● ● ● ●● ● ● ● ● ● ●
●●●●●●
● ●● ●●● ● ● ● ● ● ● ● ●●●
● ● ● ●● ●● ● ● ●● ●
● ●●●
●●
● ● ● ●
●● ● ●●●● ● ● ● ●
●●●● ●●● ●
● ●●● ●● ● ●● ● ●
● ● ●
● ● ● ● ●● ● ●
●
●● ● ●● ● ● ● ●
● ●●● ●
●
● ● ●● ●●●● ●
● ●●●
● ●
●
●
●
● ●● ● ●
●
●●●● ● ● ● ●
●●
●
●
●
●●
●
●●●●● ● ●
● ●
●
●●●
● ● ●●●
●
●
●●●●●●
● ●
● ●
● ● ● ●
●
● ●
● ● ●●● ●● ● ● ● ●
● ● ● ● ● ●●● ● ●● ●● ● ●● ●●
●● ●●●● ● ●
●●●● ●
●●●●
●● ● ●●●
●● ● ● ● ● ●●● ●●● ●● ●● ● ●●●●● ●●●●●●● ● ●●●●
●● ●
●● ●
●●
●●●●●
●●●●●●● ●●●●● ●●● ● ●●●
● ●
● ● ●●●●●●● ●● ●●●●●● ● ● ●●
● ●●●
●
●
●
●●
●●●●●●● ●● ●● ●●●● ●
●●● ●
● ●
●●●●
●
●
●●
●
●
●●●
●● ●●●
●
● ●●
●
●● ●
● ●
●●●●● ●● ●
● ●● ● ●●●● ● ●
●
● ●●●●● ● ●●●●● ●
●● ●● ●
●● ●●
●
●●●●●● ●●●
● ●●● ●
●●
● ●●●
●●●
●
●
● ●●● ●
●● ●● ●●●●●●
●
●
●
●●
●●
●
●
● ●
●
●●●●● ●●●●●●●●
●●
● ●●
●
●●●
●
● ●
●
●●●
●● ●
●●
●● ● ●●●●●
● ● ●
● ●● ● ●● ●● ● ● ●● ●● ●● ●●
● ●● ● ●●● ●● ●●
●●
● ●●
●● ● ●● ●
● ●●
● ● ●● ●●●● ● ● ●● ●●● ●●●●● ●●●
● ● ●
●●
● ● ●●●
●● ●● ● ●●●
● ●
●
●
●●●
● ●●●●
●
●●●●●●●
●●
●●●●●●●●●
● ●
● ●● ●●●●●● ● ● ●● ●●●
●
●
●●●●●●●●●
●●●● ●●●●● ●
●● ●
● ● ●
● ●
●
● ●● ● ●● ●●●● ● ●●
● ●●● ●●● ● ●●
●●●
● ●●●
●
●● ●●●
●●
●●
● ● ●
●
●●●●
●●
● ●●
●●
● ● ●●●●
●
●● ●
● ● ● ●●●
●●●
●
●● ●● ●●●
●
●●
●●●
●●
●● ●●●
●●
● ●●●
●● ●
●
●●● ●●●
●●●●
● ●
●
●
●●
●●
●●
●●● ●
●●● ● ●●● ●●● ●●
●●●● ●●
● ●●● ●● ● ●● ●●
● ●●● ●
● ●●
● ●● ● ●
●●
●
● ● ●●● ● ●●●● ● ● ●● ●
●●
●●
●● ●● ● ● ● ●●
20
● ●● ●●● ● ●● ●
●● ●●●
● ●● ● ● ● ● ●●●
● ●
● ●
●
●●
● ●
●
●●
●●●●● ●●
●●●
●
●
● ●●●● ●●●●●●●●●●●●
● ● ●● ●
●● ●●● ● ●
●● ●●●●●●●●●●● ●
●●● ●
● ●●●● ●●● ●●●●●● ●●●●●●●
● ●● ●
● ● ●●●●
● ● ● ●● ● ●
● ●●
●●
●●
●
●● ●●
●●
● ● ● ●●
●● ●●
● ●
●●● ●
● ● ●
●
●● ●● ●●
●●
●
●●● ● ●
●●● ● ●●
●●
●● ●●● ●● ●●●●●●
● ●●●●● ● ●
●● ● ●●●●
●●●●●●● ●● ●●●
●●●● ● ●●●● ●●●●●
●
●
●●
●●
●●●
●●●
● ●
●
●● ●
●●
●
● ●
●●
●●
●●●●
●
● ●●●●●●●
●
●●
●
●●●●●
●●
●
●
●●
●● ●
●●●
●
●
●●●
●
●●
●
●●
●● ●●
●
● ●
●●
●● ●●
●●● ●●
●●
●●●●
●● ●●
●●●● ●
●●
● ●
●●
●●
●●
●●
●●●
● ● ●●●● ●●● ●●● ●
●
●● ●● ● ●● ●●●
●● ● ● ● ● ● ● ● ● ●●● ● ●●● ●●
●● ● ●
● ● ●
● ● ●●●
● ● ● ●● ●●
Maximum air temperature
● ●● ● ● ● ●
● ●●● ● ● ●
● ● ● ● ●● ●
●
●
●●
●
●●●●●
●● ●
●●●
●● ●●● ●●● ●
●● ● ●● ●●●●
●
●●
●●● ●
●
● ●
●●
●● ●●
●●●●
●
●●
●
●●
●●
●●●●
●
● ●●● ●●● ●● ●●
●●
●● ●●●
●●●
●● ●
● ●●●
● ●● ●● ●●●●
●●●●● ●●
●●
●● ● ● ●● ●● ●● ● ● ● ● ● ●●●●●●● ●●●
● ● ●● ●● ●● ● ● ●● ● ●
● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ●●● ●●
●● ● ●●●●
●●●● ●●●●● ●●
●●
●●●●●
●●●● ●●
●
● ●
●● ●
●●
●● ●
●●●●
●
●●
●●
●● ●
● ● ●
●● ●● ●●●●●●
● ●
●●●●
● ●
●
●●●
●●●
●
●●●
●●●
●●●● ●●●●
●● ●
●●
● ●●
● ●● ● ●●●●
●●● ● ●
●
● ●● ●●●
● ●● ●●● ●●●
● ●● ● ●● ● ●
●
●●●● ●●
●● ●●● ● ●●● ●
●
●
●●●
●● ● ●
●● ● ●●
●● ●●● ● ●
●
●● ● ●
●●
●●
●● ●●● ●
●
●
●
●●●●
●● ●●●
●●
●●●
●
●
●
●
●●
●
●●
●
● ●
●
●
●
●●
●●●
●●●●
●●●●●●●
●
●●
● ●
●
●● ●
●
● ●
●●●
●●
●
●●● ●●●●●●
●●
●
● ●● ● ● ●●
●● ● ●●
● ●
●●●●●
●●●●●●● ● ●●
●● ● ●●
●
●●
●● ●
●●●●
●●
●● ● ● ●●●● ●●
●● ●●● ●●●●● ●
●●● ●●● ● ●
● ●●
●●
●●●
●●●●
● ●●●
● ● ●● ●●
● ● ●
●
● ●●●●●
●●● ●●
●●●●●
●● ●●
● ●
●●
●● ● ●● ●●● ● ●
●●●●●●●●●●●
●● ●●
●●● ●● ●
● ●● ● ●●● ●●●●● ●
●●●
●
●
●●●
●
●●●
●
●
●●
●●
●
●
●●●
●●
●●
●●●●
●●●
●● ●●
●●● ●●
●
●
●
●
●●●●●●
●●●
●
●
●
●●
●●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●●●●●
●
●
●●
●●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●●
●
●●
●
● ●
●
●●
● ●
●●●●●●●
●●
● ●
●
●●
●●
●
●●●●
●● ●
●●
●
●●
●
●
●
●
●
●●●●
●●
●●● ● ●
●
●●●●●
●
●●●●
●
●●●●
●● ●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●●●
●
● ●
●
●
●
●●
●
● ●
●●
●
●
●●●
●
● ●●●
●● ●
●
● ● ●●
● ●● ● ● ● ●
● ●
● ●
●●● ● ● ●●●● ●●● ● ●
● ●
●●● ● ●●●
●●
● ●
●● ●
●●
● ●●●
● ●
●●●●● ●●●●
●● ● ●
● ●
●● ●
●●
●●● ●● ●●
● ● ●●
●
●● ●
● ●
● ● ●● ●●● ●
●
●● ●● ● ● ●● ●
● ● ●
● ●●●● ● ● ● ● ● ● ●
●● ●
● ●● ●●● ●● ●●●●
●●● ●● ●
●● ●●
● ●●●● ● ●●● ●● ●
● ●●
●●● ●
●●
●
●●●●
●
● ●●●●●● ● ●● ●●●●● ●● ● ●●●●●●●●
●
● ●● ● ● ●●●
●● ●● ●●●● ●
●
●●●
●
● ●● ●● ●
● ●● ●●● ●●●●●● ●●● ●●● ● ●● ● ● ●●
●
● ● ● ●●●
● ●●
● ●●●●
●●
●●
●
●●
●
●●
●
●●●●●●●● ●●●●
●●
●● ●
●●
●
● ●
● ● ●
●●
● ●
● ● ●
● ●
● ● ●
●●
●●
● ●● ●● ●●
●●
●●
●●
●●
● ●●
● ● ●●● ●
● ●●●●●
● ●● ●
●●● ●●●●●●●● ●●●●
●●●
●●
●●●● ●●●●● ●●●●●●●
●●● ●
●●● ●
●
● ●● ● ●● ●
●
●●
●●
●●
● ●
●● ●
●●●● ●
●
●●
● ●●●●●●
●●●● ●● ●
●●
●●●●
●● ●●
● ● ●●● ●●
●● ●●●●
●●●●●●● ● ●●●● ●● ●●●●●●● ●●●
●●●● ● ● ●
●●●●●●●●●
●● ● ● ● ●
●●●●● ●
●●
●●
● ●●●
● ●●● ●●
●
●●
●● ●●
● ●
●●●
●●●
●●
●●
●
●
●●●
●●
●●●●
● ●●
● ●
●●●●● ●●●
●
●
●●
● ●●●
●
●
●
●●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●●● ● ●●
●●
●
●
●●● ●
●●●●●
●●
●
●●
●●
●●
●●
●●●●
● ●●●
● ● ●● ● ● ●● ●●
●● ●●● ●●●
●●●● ●●
● ● ● ●●●
●●
●●● ●
● ●●●●●● ●
●●● ●● ● ●● ● ●● ●●
● ● ●● ●●● ●●
● ●● ●
● ●● ●
●●
●
●
● ●●
●●●
●●●
●●●
●●● ●●
●
●●
● ●
●●● ●●
● ●●●
●● ●● ● ●●
●●●●
● ●
●● ●
●●● ●●●● ●●
●●
● ●●●●●
● ●●● ●
●●● ●●
●● ●●●● ● ●●
●● ●●● ●
●●
●●
●●
● ●●
●
● ●
●●
●
●●●
●
●●
●
●●●●●
●●
●●●
●
●
●
●
●●●
●●
●●
● ●
●●●●●●●
●
●
●
●
●
●●●●
●●●●
●
●
●●●
●●●● ●
●●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●●
●
●●●●●
●
●●●
●
●● ●●
●
● ●●
● ●●
●●●● ●●
●
●
●
●●
● ● ●
●
●
●
●●●
● ●●● ●●●●
●●
●● ●● ●
●●
●
●
●●
●●
●●●
●
●●●
●●
●
●●
●
●●
●●●●●●●●
●● ●●●
●
●
●●●
●● ● ●●●
● ●
● ●●
●
●
●● ●
● ● ●● ● ● ● ● ● ●●
●● ● ● ●●
● ● ●● ●●
●● ● ● ●● ●
● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ●● ●
●● ●●●●●●
●●
●●●
● ●● ●● ●
●●● ●
● ●
●
●●●●●● ●●
●●●●●● ●● ●
●●●●● ●
●●●● ● ●●●●●●●● ●●●● ● ●
●● ● ●●●
● ●●● ●●
●● ●●●
●
● ●
● ● ●● ●● ●● ● ●
●●● ●●● ● ●● ● ●
● ●●●
●
● ●●● ●●● ● ●●●●
●●
●
●
●
●●●●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
● ● ●●●
●
●
●
●
●
●
●
●
●●●●●●●●●●
●
●●●
●●●
●
● ●●●
●●●●●●●
●●●●●●●●●●
●●
●
●
●●●●
●● ●●●●●●
●●●
●●● ●●
● ●●●
●
●●
●
●●
●●
●●● ●●●●● ●
● ●
●
●
●
●● ●●●●●●●●
● ●
●●●●
●●
●
●
●●●● ●● ● ●● ● ●● ●●●●● ●
●●
● ●
● ●
●
● ●
●● ●●●
●
●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●
10
●
● ●
● ●●
●
●● ●
●●● ●● ● ● ●● ● ●
●● ●● ● ●
●●●
● ● ●●●●●● ●●● ●● ●●●●● ● ●●● ● ●● ●
●●●● ● ●●●●●● ● ●● ●●●
● ● ●● ● ● ● ●
●●●●
● ●●
●
●●
●
●●
●
● ●●●
●●●
●●
● ● ●●
●
●●
●●●●●● ● ●●
●● ● ●
●
● ●●●●●● ●●
●
●
●●● ●● ● ●● ● ●●● ●
●●
●
●●● ●
●●●● ●●
●●●●
●
● ● ●● ● ●● ●● ●●●
●●● ●●● ● ● ●● ● ●
●
●●
●
●
●●●
●
●●●●●
●
●●
●
●●
●
●● ●
●
●●
●
●●
●●
●●●
● ●●
●
●●●
●
●
●
●
●
●●●●●
● ●●●
●
●●
●
●
●●
●●
●
●●●● ●
●●●
●
●
●●
●
●
● ●●● ●●●●
●
●
●
●●●●● ● ●
●●●
●
●●●●●●●●●●●●
● ●●
● ●
●●●●● ●● ●
●●
●
●●
● ●●
●● ●●●● ●●●● ●● ●● ● ●●●● ●
●
●●
●
●
●●●
●
●
●●
●●
●●
●
●●●
●●
●●
●● ●●
●●
●●
●
●●●●●● ●
●
●
●●
●●
●●
●●
●●●●●●
●●●
● ●●
●●●● ●
●
●
●●●
●●●
●●●●●● ●●
●●● ●●● ●● ●
●
●
●●●
●●●●●
●●● ●●
●● ●
●
●● ●● ●●● ● ●●●●● ●●●● ● ● ●●● ● ● ●
● ●●
●
●
●●
●
●
●●●●
●●●
●● ●
●●●
●●●●● ● ● ●
●●
●
●
● ●
●● ●
●● ●●
●●
● ● ● ●
● ●
●● ●●
●●
●●
● ●●● ●●● ● ●
● ● ●●●●
●●● ●● ● ●●● ● ● ●●
● ● ● ● ●●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●●
●
●●●
●●●
●
●
●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
● ●●●●
●
●● ●●●●●
●
●
●● ●
●
●
●
●●●
●●
●
●
●
●
●
●●●
● ●● ●●●●●●●●●●●
●●
●●
●●●● ●● ●●
●●
●
●●
● ●
●●●
●
●
●● ●●●
●
●
●
●●● ●● ●●●●● ●
● ● ●● ●●
●● ● ●●● ●● ●●
●
●
●
●
●●
●
●●●
●●●
●
●●
●
●
●
●●
●●
●●
●
●●
●●
●● ●
●
●●
●●
●
●
● ●
●● ●●●●● ●
● ●
● ●●●●●●
●● ●● ● ●● ●●
●●
●
● ●
● ● ●● ● ●●● ● ● ●●● ●● ● ● ●●
●● ● ● ● ● ●
●●
●
●●
●●
●
●●
● ●
● ●
●●● ● ●●●
● ●
●●
● ● ●●
●●●●
● ●
●●
●●●●
●● ●●
●
●●●● ●● ●● ●● ●●
●●● ●● ●● ●●● ●●●
● ●● ● ●● ●● ●●●● ●●● ●● ●●● ●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
● ●
●
●●
●
●●
●
●
●
●●
●●●●
●●
●●●
●●●
●
●●
●●
●
●●
●● ●●●
●●●●●●
●● ●
●●●● ●●●●
●● ●● ●●●● ●
●
●●● ●●●
●
●● ●
●●
●●●●●●●●● ●● ●●● ●● ● ● ● ● ● ●● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●●●●●
●
●
●
●●
●
●●
●●
●
● ●●
●
●
●●
●●●●●●●●
●●
●
●
●
●●●●●●
●
●
●● ● ●● ●●●●●●
● ●
●●●
●
●
●
●●●● ●
●●●●● ●●
●● ● ●
● ●●●● ● ●● ● ●● ● ●●● ●●●● ●●
●
●
●
●●
●
●
●●
●●
●●
●●
●●
●
●
●
●●
●●
●
●
●●●●●
●●●●● ● ● ●●
●
●● ●●
●● ●
●●●●●●
● ●
●●●
● ●
●●● ●●● ●
●● ●●●
●●● ●
●●● ●● ●●
●●●● ● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●●
●●
●●●
●
●●
●●●
●●
●●●●●●
●
●
●
●●●
●●
●●
●
●● ●
●
●
●
● ●●●●●● ●
●
●●
●
●●●
● ●●●
●● ●
●● ●
●
●● ●●
● ●●
●●● ●
●●
●●
●
●
●
●
●●●● ●●● ● ●● ● ●●
● ●
●
●●●●● ● ● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●● ●●●
●●●
●
●●
●
●●
●●●
●
●
●●●●
● ●●●●●●
●
● ●●
● ●● ●●● ● ●●●
●
●●
●
●
●
●●●● ●
●●●●●●●● ● ● ●●● ●●● ●●● ●
●
● ●
●●●●●
●●●
●
●●●●●
● ●●●
● ● ●
● ●
● ●●●●
●● ●●●
●● ● ●●● ● ● ●● ●● ●● ● ●
●●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●
●●
●
●●
●●
●
●●●●
● ●
●●●
●
●●
●●
●
● ●● ●
●
●●●●
●●●●
●● ●●●● ●● ●
● ●●●● ● ●●● ●●●●● ●●● ● ● ●●● ●● ● ●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●
●
● ●
●
●
●●
●●
● ● ● ●●●
●●
●●
● ●●
●●●●●●●●●●
●
●
●●●
● ● ●
●●●●●●●●●● ●●●● ● ●● ●●
● ●● ● ●● ●●●
●● ●● ●●● ●●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
● ●
●
●
●
●
●●
●●
●●●
● ●●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●
●● ●● ●●●● ●●
●
●● ●● ● ●●● ●●
●●● ●● ●●● ●● ●●● ● ●
●
●●
●
●
●●
●
●●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●●●●●
●●
● ●
● ●
● ●●
●
●●●
●
●●●● ●●
●● ●●● ●●● ● ● ● ●
● ● ● ●●
●
●● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●●●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
● ●●
●●
●●●
●●
●
●
●●
●
●
●●
●
●
●●●●●●●
●
●●
●●● ●● ●●●● ●● ●
●●●●● ●●● ● ●●
●● ●
●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●●●
●●
●
●●●●
●
● ●● ●
● ●●
●
●●
●● ● ●●
●●● ● ● ●
● ● ● ● ● ● ●
0
●●
●
●●
●●
●●
●●●
●
●●
●
●
● ●
●● ●●●
●
●●●
●● ●●● ●
●
●●●●● ● ● ●● ●●●● ● ●●●●● ●●●● ●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●●● ●● ●
● ●●
●●● ● ●●●
● ●●● ●● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●●●
●
●
●●
●●●●●●
●●● ●● ●●● ●●● ●●●●
● ● ●●
●
●
●● ●
● ● ●● ● ●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●
●●●●●●
●
●●●● ●●● ●●●●● ● ●● ● ●
●● ●● ●
●● ●
● ●
●
●●
●
●
●●
●●●
●
●●
●●●
● ●
●
● ●●
●●
●●● ●
●● ● ●● ●●●●● ●●● ●● ●
●
● ●
●
●●
●
●
●●
●●
●●
●●●
●
●●
●
●●
●
●●●
●
●●●
● ●
●●
●●
●●
● ●
● ●
● ● ● ●
● ● ●
●
●
●
●●●
●●
●●●
●
●
●
●
●●●● ●●
●●●●● ●●●●● ● ● ● ● ● ●
●●
●●●
●
●
●
●
●
●●
●
●●●
●●
●
● ●●
●
●●
●● ●● ●●●
●●●●●●● ●●● ●●●
● ●●● ●
●
●● ●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●● ● ●
●
●●
●
●
●●●
● ●● ● ● ● ● ● ● ●
●●●
●●
●
●
●●●●
●●●●●●
●
● ●
●
●●
●●●●● ● ●●●● ● ● ● ●●● ●
●
●
●
●
●
●●
●●
●
● ●
●
●●●
●
●●
●●
● ●●●●
●
●● ●● ●
●●
●●● ● ● ●● ●● ●●
● ●
●
●●●●●●
●●
●●
● ●●● ●
●
●● ●
●●● ● ●● ● ●● ●●
●
● ● ●
●
●
●
●●
●
●
●●●●●
●
●●●
●
● ●● ●● ● ●● ● ● ●● ●
●
●
●●
●●
●●●
●●●
●●
●
●●
●●●
●●●
●
●
● ● ●● ●● ● ● ● ●●● ●
●●
●
●
●●● ● ●
● ● ● ● ●
−10
●●●●●●● ● ●● ●
●●
●
●●● ●●
● ●●● ●●● ●
● ●●● ●● ●
●
●
●
●
●
●
● ●● ●●● ●● ●
●
●●
●
●
●
●●●●● ● ●● ●
●● ●●
●●●● ● ●
●
●● ●● ●
●● ●
●● ●
● ●
●●●
●●
●●●
● ●
●
0 5 10 15 20 25 30
Global radiation
Figure 14.2: Each of the 9108 dots represent a daily weather observation of
maximum air temperature and global radiation.
In Figure 14.2 we have plotted the response (maximum air temperature) versus
the single explanatory variable (radiation).
y = β[1] + β[2]x + e
where y is the temperature and x is the radiation. The result looks like this:
154 CHAPTER 14. CROSS-VALIDATION
Call :
lm ( formula = Air . temp . max ~ Radiation , data = daily . maxtemp .
rad )
Residuals :
Min 1Q Median 3Q Max
-21.3020 -3.8528 0.1647 4.1500 16.0160
Coefficients :
Estimate Std . Error t value Pr ( >| t |)
( Intercept ) 2.709310 0.090538 29.93 <2e -16 ***
Radiation 0.818983 0.007475 109.56 <2e -16 ***
---
Signif . codes : 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1
We can see the slope is estimated to 0.82, and in Figure 14.3 we have added the
fitted line to the scatterplot of the data.
●
30
● ●
● ● ● ● ●● ●● ● ●● ● ● ●
● ●●
● ● ●●● ● ● ●● ●● ● ●● ●
● ● ● ● ●●●● ● ●
●● ● ● ●● ●
● ● ●
● ● ● ●● ●● ● ●
●● ●
●● ●●●●
●● ●
● ●● ●
● ● ●●● ●● ●●● ● ●●● ● ● ●● ● ● ● ● ● ●
●●●●●●
● ●● ●●● ● ● ● ● ● ● ● ●●●
● ● ● ●● ●● ● ● ●● ●
● ●●●
●●
● ● ● ●
●● ● ●●●● ● ● ● ●
●●●● ●●● ●
● ●●● ●● ● ●● ● ●
● ● ●
● ● ● ● ●● ● ●
●
●● ● ●● ● ● ● ●
● ●●● ●
●
● ● ●● ●●●● ●
● ●●●
● ●
●
●
●
● ●● ● ●
●
●●●● ● ● ● ●
●●
●
●
●
●●
●
●●●●● ● ●
● ●
●
●●●
● ● ●●●
●
●
●●●●●●
● ●
● ●
● ● ● ●
●
● ●
● ● ●●● ●● ● ● ● ●
● ● ● ● ● ●●● ● ●● ●● ● ●● ●●
●● ●●●● ● ●
●●●● ●
●●●●
●● ● ●●●
●● ● ● ● ● ●●● ●●● ●● ●● ● ●●●●● ●●●●●●● ● ●●●●
●● ●
●● ●
●●
●●●●●
●●●●●●● ●●●●● ●●● ● ●●●
● ●
● ● ●●●●●●● ●● ●●●●●● ● ● ●●
● ●●●
●
●
●
●●
●●●●●●● ●● ●● ●●●● ●
●●● ●
● ●
●●●●
●
●
●●
●
●
●●●
●● ●●●
●
● ●●
●
●● ●
● ●
●●●●● ●● ●
● ●● ● ●●●● ● ●
●
● ●●●●● ● ●●●●● ●
●● ●● ●
●● ●●
●
●●●●●● ●●●
● ●●● ●
●●
● ●●●
●●●
●
●
● ●●● ●
●● ●● ●●●●●●
●
●
●
●●
●●
●
●
● ●
●
●●●●● ●●●●●●●●
●●
● ●●
●
●●●
●
● ●
●
●●●
●● ●
●●
●● ● ●●●●●
● ● ●
● ●● ● ●● ●● ● ● ●● ●● ●● ●●
● ●● ● ●●● ●● ●●
●●
● ●●
●● ● ●● ●
● ●●
● ● ●● ●●●● ● ● ●● ●●● ●●●●● ●●●
● ● ●
●●
● ● ●●●
●● ●● ● ●●●
● ●
●
●
●●●
● ●●●●
●
●●●●●●●
●●
●●●●●●●●●
● ●
● ●● ●●●●●● ● ● ●● ●●●
●
●
●●●●●●●●●
●●●● ●●●●● ●
●● ●
● ● ●
● ●
●
● ●● ● ●● ●●●● ● ●●
● ●●● ●●● ● ●●
●●●
● ●●●
●
●● ●●●
●●
●●
● ● ●
●
●●●●
●●
● ●●
●●
● ● ●●●●
●
●● ●
● ● ● ●●●
●●●
●
●● ●● ●●●
●
●●
●●●
●●
●● ●●●
●●
● ●●●
●● ●
●
●●● ●●●
●●●●
● ●
●
●
●●
●●
●●
●●● ●
●●● ● ●●● ●●● ●●
●●●● ●●
● ●●● ●● ● ●● ●●
● ●●● ●
● ●●
● ●● ● ●
●●
●
● ● ●●● ● ●●●● ● ● ●● ●
●●
●●
●● ●● ● ● ● ●●
20
● ●● ●●● ● ●● ●
●● ●●●
● ●● ● ● ● ● ●●●
● ●
● ●
●
●●
● ●
●
●●
●●●●● ●●
●●●
●
●
● ●●●● ●●●●●●●●●●●●
● ● ●● ●
●● ●●● ● ●
●● ●●●●●●●●●●● ●
●●● ●
● ●●●● ●●● ●●●●●● ●●●●●●●
● ●● ●
● ● ●●●●
● ● ● ●● ● ●
● ●●
●●
●●
●
●● ●●
●●
● ● ● ●●
●● ●●
● ●
●●● ●
● ● ●
●
●● ●● ●●
●●
●
●●● ● ●
●●● ● ●●
●●
●● ●●● ●● ●●●●●●
● ●●●●● ● ●
●● ● ●●●●
●●●●●●● ●● ●●●
●●●● ● ●●●● ●●●●●
●
●
●●
●●
●●●
●●●
● ●
●
●● ●
●●
●
● ●
●●
●●
●●●●
●
● ●●●●●●●
●
●●
●
●●●●●
●●
●
●
●●
●● ●
●●●
●
●
●●●
●
●●
●
●●
●● ●●
●
● ●
●●
●● ●●
●●● ●●
●●
●●●●
●● ●●
●●●● ●
●●
● ●
●●
●●
●●
●●
●●●
● ● ●●●● ●●● ●●● ●
●
●● ●● ● ●● ●●●
●● ● ● ● ● ● ● ● ● ●●● ● ●●● ●●
●● ● ●
● ● ●
● ● ●●●
● ● ● ●● ●●
Maximum air temperature
● ●● ● ● ● ●
● ●●● ● ● ●
● ● ● ● ●● ●
●
●
●●
●
●●●●●
●● ●
●●●
●● ●●● ●●● ●
●● ● ●● ●●●●
●
●●
●●● ●
●
● ●
●●
●● ●●
●●●●
●
●●
●
●●
●●
●●●●
●
● ●●● ●●● ●● ●●
●●
●● ●●●
●●●
●● ●
● ●●●
● ●● ●● ●●●●
●●●●● ●●
●●
●● ● ● ●● ●● ●● ● ● ● ● ● ●●●●●●● ●●●
● ● ●● ●● ●● ● ● ●● ● ●
● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ●●● ●●
●● ● ●●●●
●●●● ●●●●● ●●
●●
●●●●●
●●●● ●●
●
● ●
●● ●
●●
●● ●
●●●●
●
●●
●●
●● ●
● ● ●
●● ●● ●●●●●●
● ●
●●●●
● ●
●
●●●
●●●
●
●●●
●●●
●●●● ●●●●
●● ●
●●
● ●●
● ●● ● ●●●●
●●● ● ●
●
● ●● ●●●
● ●● ●●● ●●●
● ●● ● ●● ● ●
●
●●●● ●●
●● ●●● ● ●●● ●
●
●
●●●
●● ● ●
●● ● ●●
●● ●●● ● ●
●
●● ● ●
●●
●●
●● ●●● ●
●
●
●
●●●●
●● ●●●
●●
●●●
●
●
●
●
●●
●
●●
●
● ●
●
●
●
●●
●●●
●●●●
●●●●●●●
●
●●
● ●
●
●● ●
●
● ●
●●●
●●
●
●●● ●●●●●●
●●
●
● ●● ● ● ●●
●● ● ●●
● ●
●●●●●
●●●●●●● ● ●●
●● ● ●●
●
●●
●● ●
●●●●
●●
●● ● ● ●●●● ●●
●● ●●● ●●●●● ●
●●● ●●● ● ●
● ●●
●●
●●●
●●●●
● ●●●
● ● ●● ●●
● ● ●
●
● ●●●●●
●●● ●●
●●●●●
●● ●●
● ●
●●
●● ● ●● ●●● ● ●
●●●●●●●●●●●
●● ●●
●●● ●● ●
● ●● ● ●●● ●●●●● ●
●●●
●
●
●●●
●
●●●
●
●
●●
●●
●
●
●●●
●●
●●
●●●●
●●●
●● ●●
●●● ●●
●
●
●
●
●●●●●●
●●●
●
●
●
●●
●●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●●●●●
●
●
●●
●●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●●
●
●●
●
● ●
●
●●
● ●
●●●●●●●
●●
● ●
●
●●
●●
●
●●●●
●● ●
●●
●
●●
●
●
●
●
●
●●●●
●●
●●● ● ●
●
●●●●●
●
●●●●
●
●●●●
●● ●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●●●
●
● ●
●
●
●
●●
●
● ●
●●
●
●
●●●
●
● ●●●
●● ●
●
● ● ●●
● ●● ● ● ● ●
● ●
● ●
●●● ● ● ●●●● ●●● ● ●
● ●
●●● ● ●●●
●●
● ●
●● ●
●●
● ●●●
● ●
●●●●● ●●●●
●● ● ●
● ●
●● ●
●●
●●● ●● ●●
● ● ●●
●
●● ●
● ●
● ● ●● ●●● ●
●
●● ●● ● ● ●● ●
● ● ●
● ●●●● ● ● ● ● ● ● ●
●● ●
● ●● ●●● ●● ●●●●
●●● ●● ●
●● ●●
● ●●●● ● ●●● ●● ●
● ●●
●●● ●
●●
●
●●●●
●
● ●●●●●● ● ●● ●●●●● ●● ● ●●●●●●●●
●
● ●● ● ● ●●●
●● ●● ●●●● ●
●
●●●
●
● ●● ●● ●
● ●● ●●● ●●●●●● ●●● ●●● ● ●● ● ● ●●
●
● ● ● ●●●
● ●●
● ●●●●
●●
●●
●
●●
●
●●
●
●●●●●●●● ●●●●
●●
●● ●
●●
●
● ●
● ● ●
●●
● ●
● ● ●
● ●
● ● ●
●●
●●
● ●● ●● ●●
●●
●●
●●
●●
● ●●
● ● ●●● ●
● ●●●●●
● ●● ●
●●● ●●●●●●●● ●●●●
●●●
●●
●●●● ●●●●● ●●●●●●●
●●● ●
●●● ●
●
● ●● ● ●● ●
●
●●
●●
●●
● ●
●● ●
●●●● ●
●
●●
● ●●●●●●
●●●● ●● ●
●●
●●●●
●● ●●
● ● ●●● ●●
●● ●●●●
●●●●●●● ● ●●●● ●● ●●●●●●● ●●●
●●●● ● ● ●
●●●●●●●●●
●● ● ● ● ●
●●●●● ●
●●
●●
● ●●●
● ●●● ●●
●
●●
●● ●●
● ●
●●●
●●●
●●
●●
●
●
●●●
●●
●●●●
● ●●
● ●
●●●●● ●●●
●
●
●●
● ●●●
●
●
●
●●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●●● ● ●●
●●
●
●
●●● ●
●●●●●
●●
●
●●
●●
●●
●●
●●●●
● ●●●
● ● ●● ● ● ●● ●●
●● ●●● ●●●
●●●● ●●
● ● ● ●●●
●●
●●● ●
● ●●●●●● ●
●●● ●● ● ●● ● ●● ●●
● ● ●● ●●● ●●
● ●● ●
● ●● ●
●●
●
●
● ●●
●●●
●●●
●●●
●●● ●●
●
●●
● ●
●●● ●●
● ●●●
●● ●● ● ●●
●●●●
● ●
●● ●
●●● ●●●● ●●
●●
● ●●●●●
● ●●● ●
●●● ●●
●● ●●●● ● ●●
●● ●●● ●
●●
●●
●●
● ●●
●
● ●
●●
●
●●●
●
●●
●
●●●●●
●●
●●●
●
●
●
●
●●●
●●
●●
● ●
●●●●●●●
●
●
●
●
●
●●●●
●●●●
●
●
●●●
●●●● ●
●●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●●
●
●●●●●
●
●●●
●
●● ●●
●
● ●●
● ●●
●●●● ●●
●
●
●
●●
● ● ●
●
●
●
●●●
● ●●● ●●●●
●●
●● ●● ●
●●
●
●
●●
●●
●●●
●
●●●
●●
●
●●
●
●●
●●●●●●●●
●● ●●●
●
●
●●●
●● ● ●●●
● ●
● ●●
●
●
●● ●
● ● ●● ● ● ● ● ● ●●
●● ● ● ●●
● ● ●● ●●
●● ● ● ●● ●
● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ●● ●
●● ●●●●●●
●●
●●●
● ●● ●● ●
●●● ●
● ●
●
●●●●●● ●●
●●●●●● ●● ●
●●●●● ●
●●●● ● ●●●●●●●● ●●●● ● ●
●● ● ●●●
● ●●● ●●
●● ●●●
●
● ●
● ● ●● ●● ●● ● ●
●●● ●●● ● ●● ● ●
● ●●●
●
● ●●● ●●● ● ●●●●
●●
●
●
●
●●●●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
● ● ●●●
●
●
●
●
●
●
●
●
●●●●●●●●●●
●
●●●
●●●
●
● ●●●
●●●●●●●
●●●●●●●●●●
●●
●
●
●●●●
●● ●●●●●●
●●●
●●● ●●
● ●●●
●
●●
●
●●
●●
●●● ●●●●● ●
● ●
●
●
●
●● ●●●●●●●●
● ●
●●●●
●●
●
●
●●●● ●● ● ●● ● ●● ●●●●● ●
●●
● ●
● ●
●
● ●
●● ●●●
●
●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●
10
●
● ●
● ●●
●
●● ●
●●● ●● ● ● ●● ● ●
●● ●● ● ●
●●●
● ● ●●●●●● ●●● ●● ●●●●● ● ●●● ● ●● ●
●●●● ● ●●●●●● ● ●● ●●●
● ● ●● ● ● ● ●
●●●●
● ●●
●
●●
●
●●
●
● ●●●
●●●
●●
● ● ●●
●
●●
●●●●●● ● ●●
●● ● ●
●
● ●●●●●● ●●
●
●
●●● ●● ● ●● ● ●●● ●
●●
●
●●● ●
●●●● ●●
●●●●
●
● ● ●● ● ●● ●● ●●●
●●● ●●● ● ● ●● ● ●
●
●●
●
●
●●●
●
●●●●●
●
●●
●
●●
●
●● ●
●
●●
●
●●
●●
●●●
● ●●
●
●●●
●
●
●
●
●
●●●●●
● ●●●
●
●●
●
●
●●
●●
●
●●●● ●
●●●
●
●
●●
●
●
● ●●● ●●●●
●
●
●
●●●●● ● ●
●●●
●
●●●●●●●●●●●●
● ●●
● ●
●●●●● ●● ●
●●
●
●●
● ●●
●● ●●●● ●●●● ●● ●● ● ●●●● ●
●
●●
●
●
●●●
●
●
●●
●●
●●
●
●●●
●●
●●
●● ●●
●●
●●
●
●●●●●● ●
●
●
●●
●●
●●
●●
●●●●●●
●●●
● ●●
●●●● ●
●
●
●●●
●●●
●●●●●● ●●
●●● ●●● ●● ●
●
●
●●●
●●●●●
●●● ●●
●● ●
●
●● ●● ●●● ● ●●●●● ●●●● ● ● ●●● ● ● ●
● ●●
●
●
●●
●
●
●●●●
●●●
●● ●
●●●
●●●●● ● ● ●
●●
●
●
● ●
●● ●
●● ●●
●●
● ● ● ●
● ●
●● ●●
●●
●●
● ●●● ●●● ● ●
● ● ●●●●
●●● ●● ● ●●● ● ● ●●
● ● ● ● ●●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●●
●
●●●
●●●
●
●
●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
● ●●●●
●
●● ●●●●●
●
●
●● ●
●
●
●
●●●
●●
●
●
●
●
●
●●●
● ●● ●●●●●●●●●●●
●●
●●
●●●● ●● ●●
●●
●
●●
● ●
●●●
●
●
●● ●●●
●
●
●
●●● ●● ●●●●● ●
● ● ●● ●●
●● ● ●●● ●● ●●
●
●
●
●
●●
●
●●●
●●●
●
●●
●
●
●
●●
●●
●●
●
●●
●●
●● ●
●
●●
●●
●
●
● ●
●● ●●●●● ●
● ●
● ●●●●●●
●● ●● ● ●● ●●
●●
●
● ●
● ● ●● ● ●●● ● ● ●●● ●● ● ● ●●
●● ● ● ● ● ●
●●
●
●●
●●
●
●●
● ●
● ●
●●● ● ●●●
● ●
●●
● ● ●●
●●●●
● ●
●●
●●●●
●● ●●
●
●●●● ●● ●● ●● ●●
●●● ●● ●● ●●● ●●●
● ●● ● ●● ●● ●●●● ●●● ●● ●●● ●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
● ●
●
●●
●
●●
●
●
●
●●
●●●●
●●
●●●
●●●
●
●●
●●
●
●●
●● ●●●
●●●●●●
●● ●
●●●● ●●●●
●● ●● ●●●● ●
●
●●● ●●●
●
●● ●
●●
●●●●●●●●● ●● ●●● ●● ● ● ● ● ● ●● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●●●●●
●
●
●
●●
●
●●
●●
●
● ●●
●
●
●●
●●●●●●●●
●●
●
●
●
●●●●●●
●
●
●● ● ●● ●●●●●●
● ●
●●●
●
●
●
●●●● ●
●●●●● ●●
●● ● ●
● ●●●● ● ●● ● ●● ● ●●● ●●●● ●●
●
●
●
●●
●
●
●●
●●
●●
●●
●●
●
●
●
●●
●●
●
●
●●●●●
●●●●● ● ● ●●
●
●● ●●
●● ●
●●●●●●
● ●
●●●
● ●
●●● ●●● ●
●● ●●●
●●● ●
●●● ●● ●●
●●●● ● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●●
●●
●●●
●
●●
●●●
●●
●●●●●●
●
●
●
●●●
●●
●●
●
●● ●
●
●
●
● ●●●●●● ●
●
●●
●
●●●
● ●●●
●● ●
●● ●
●
●● ●●
● ●●
●●● ●
●●
●●
●
●
●
●
●●●● ●●● ● ●● ● ●●
● ●
●
●●●●● ● ● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●● ●●●
●●●
●
●●
●
●●
●●●
●
●
●●●●
● ●●●●●●
●
● ●●
● ●● ●●● ● ●●●
●
●●
●
●
●
●●●● ●
●●●●●●●● ● ● ●●● ●●● ●●● ●
●
● ●
●●●●●
●●●
●
●●●●●
● ●●●
● ● ●
● ●
● ●●●●
●● ●●●
●● ● ●●● ● ● ●● ●● ●● ● ●
●●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●
●●
●
●●
●●
●
●●●●
● ●
●●●
●
●●
●●
●
● ●● ●
●
●●●●
●●●●
●● ●●●● ●● ●
● ●●●● ● ●●● ●●●●● ●●● ● ● ●●● ●● ● ●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●
●
● ●
●
●
●●
●●
● ● ● ●●●
●●
●●
● ●●
●●●●●●●●●●
●
●
●●●
● ● ●
●●●●●●●●●● ●●●● ● ●● ●●
● ●● ● ●● ●●●
●● ●● ●●● ●●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
● ●
●
●
●
●
●●
●●
●●●
● ●●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●
●● ●● ●●●● ●●
●
●● ●● ● ●●● ●●
●●● ●● ●●● ●● ●●● ● ●
●
●●
●
●
●●
●
●●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●●●●●
●●
● ●
● ●
● ●●
●
●●●
●
●●●● ●●
●● ●●● ●●● ● ● ● ●
● ● ● ●●
●
●● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●●●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
● ●●
●●
●●●
●●
●
●
●●
●
●
●●
●
●
●●●●●●●
●
●●
●●● ●● ●●●● ●● ●
●●●●● ●●● ● ●●
●● ●
●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●●●
●●
●
●●●●
●
● ●● ●
● ●●
●
●●
●● ● ●●
●●● ● ● ●
● ● ● ● ● ● ●
0
●●
●
●●
●●
●●
●●●
●
●●
●
●
● ●
●● ●●●
●
●●●
●● ●●● ●
●
●●●●● ● ● ●● ●●●● ● ●●●●● ●●●● ●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●●● ●● ●
● ●●
●●● ● ●●●
● ●●● ●● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●●●
●
●
●●
●●●●●●
●●● ●● ●●● ●●● ●●●●
● ● ●●
●
●
●● ●
● ● ●● ● ●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●
●●●●●●
●
●●●● ●●● ●●●●● ● ●● ● ●
●● ●● ●
●● ●
● ●
●
●●
●
●
●●
●●●
●
●●
●●●
● ●
●
● ●●
●●
●●● ●
●● ● ●● ●●●●● ●●● ●● ●
●
● ●
●
●●
●
●
●●
●●
●●
●●●
●
●●
●
●●
●
●●●
●
●●●
● ●
●●
●●
●●
● ●
● ●
● ● ● ●
● ● ●
●
●
●
●●●
●●
●●●
●
●
●
●
●●●● ●●
●●●●● ●●●●● ● ● ● ● ● ●
●●
●●●
●
●
●
●
●
●●
●
●●●
●●
●
● ●●
●
●●
●● ●● ●●●
●●●●●●● ●●● ●●●
● ●●● ●
●
●● ●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●● ● ●
●
●●
●
●
●●●
● ●● ● ● ● ● ● ● ●
●●●
●●
●
●
●●●●
●●●●●●
●
● ●
●
●●
●●●●● ● ●●●● ● ● ● ●●● ●
●
●
●
●
●
●●
●●
●
● ●
●
●●●
●
●●
●●
● ●●●●
●
●● ●● ●
●●
●●● ● ● ●● ●● ●●
● ●
●
●●●●●●
●●
●●
● ●●● ●
●
●● ●
●●● ● ●● ● ●● ●●
●
● ● ●
●
●
●
●●
●
●
●●●●●
●
●●●
●
● ●● ●● ● ●● ● ● ●● ●
●
●
●●
●●
●●●
●●●
●●
●
●●
●●●
●●●
●
●
● ● ●● ●● ● ● ● ●●● ●
●●
●
●
●●● ● ●
● ● ● ● ●
−10
●●●●●●● ● ●● ●
●●
●
●●● ●●
● ●●● ●●● ●
● ●●● ●● ●
●
●
●
●
●
●
● ●● ●●● ●● ●
●
●●
●
●
●
●●●●● ● ●● ●
●● ●●
●●●● ● ●
●
●● ●● ●
●● ●
●● ●
● ●
●●●
●●
●●●
● ●
●
0 5 10 15 20 25 30
Global radiation
Notice that this simple version of knn-regression only take as input a single
explanatory variable, i.e. both X.test and X.train must be vectors (cannot
have more than one column). For a given choice of K (number of neighbours)
we choose any radiation value within the span of this data set, and compute
the corresponding predicted maximum air temperature by this function. The
problem is how to choose a proper K. We will first use cross-validation to give
us some hint on how to choose a proper K.
First we need to decide upon some values of K that we want to try out. We
choose
K . values <- c (50 ,100 ,300 ,500 ,1000 ,2000 ,3000)
nK <- length ( K . values )
Since computations will take some time, we start out by a smallish number of
K values, and may increase this later. For each value in K.values we want to
compute the M SEP from equation (14.1), as a measure of prediction error.
Thus we need a vector
156 CHAPTER 14. CROSS-VALIDATION
which means the data.frame daily.maxtemp.rad has been sorted by the Air.temp
.max and then got a new column named Segment. All data objects having the
same value of Segment belong to the same segment.
We can then start the looping:
attach ( daily . maxtemp . rad )
# The first loop is over the K - values
for ( i in 1: nK ) {
cat ( " For K = " ,K . values [ i ] , sep = " " )
# Then we loop over the cross - validation segments
err <- rep (0 , L )
for ( j in 1: C ) {
# Split into ( y . test , X . test ) and ( y . train , X . train )
idx . test <- which ( Segment == j )
y . test <- Air . temp . max [ idx . test ]
X . test <- Radiation [ idx . test ]
y . train <- Air . temp . max [ - idx . test ]
X . train <- Radiation [ - idx . test ]
Notice the double looping, since we first have to consider each element in K.
values, and then for each of them consider all possible splitting into test- and
training-sets. This makes the computations slow, and this is also why we add
some cat statements here and there. It is good to see some output during
computations, just to verify that things are proceeding as they should. Here is
the output we get:
14.3. EXAMPLE: TEMPERATURE AND RADIATION 157
●
32.5
32.0
MSEP
●
31.5
●
31.0
● ● ●
Figure 14.4: The brown dots show the M SEP values computed for the corre-
sponding choice of K. The shape is typical in the sense that both a too small
and too large choice of K produce larger errors than some ’optimal’ value. In
this case it seems like the optimal choice of K is somewhere between 500 and
1000.
In Figure 14.4 we have plotted how the prediction error M SEP varies over the
different choices of K. It looks like a K close to 1000 is a good choice for this
particular data set, giving the best balance between a too small K (inflating
the variance) and a too large K (inflating the bias).
158 CHAPTER 14. CROSS-VALIDATION
30 ●
● ●
● ● ● ● ●● ●● ● ●● ● ● ●
● ●●
● ● ●●● ● ● ●● ●● ● ●● ●
● ● ● ● ●●●● ● ●
●● ● ● ●● ●
● ● ●
● ● ● ●● ●● ● ●
●● ●
●● ●●●●
●● ●
● ●● ●
● ● ●●● ●● ●●● ● ●●● ● ● ●● ● ● ● ● ● ●
●●●●●●
● ●● ●●● ● ● ● ● ● ● ● ●●●
● ● ● ●● ●● ● ● ●● ●
● ●●●
●●
● ● ● ●
●● ● ●●●● ● ● ● ●
●●●● ●●● ●
● ●●● ●● ● ●● ● ●
● ● ●
● ● ● ● ●● ● ●
●
●● ● ●● ● ● ● ●
● ●●● ●
●
● ● ●● ●●●● ●
● ●●●
● ●
●
●
●
● ●● ● ●
●
●●●● ● ● ● ●
●●
●
●
●
●●
●
●●●●● ● ●
● ●
●
●●●
● ● ●●●
●
●
●●●●●●
● ●
● ●
● ● ● ●
●
● ●
● ● ●●● ●● ● ● ● ●
● ● ● ● ● ●●● ● ●● ●● ● ●● ●●
●● ●●●● ● ●
●●●● ●
●●●●
●● ● ●●●
●● ● ● ● ● ●●● ●●● ●● ●● ● ●●●●● ●●●●●●● ● ●●●●
●● ●
●● ●
●●
●●●●●
●●●●●●● ●●●●● ●●● ● ●●●
● ●
● ● ●●●●●●● ●● ●●●●●● ● ● ●●
● ●●●
●
●
●
●●
●●●●●●● ●● ●● ●●●● ●
●●● ●
● ●
●●●●
●
●
●●
●
●
●●●
●● ●●●
●
● ●●
●
●● ●
● ●
●●●●● ●● ●
● ●● ● ●●●● ● ●
●
● ●●●●● ● ●●●●● ●
●● ●● ●
●● ●●
●
●●●●●● ●●●
● ●●● ●
●●
● ●●●
●●●
●
●
● ●●● ●
●● ●● ●●●●●●
●
●
●
●●
●●
●
●
● ●
●
●●●●● ●●●●●●●●
●●
● ●●
●
●●●
●
● ●
●
●●●
●● ●
●●
●● ● ●●●●●
● ● ●
● ●● ● ●● ●● ● ● ●● ●● ●● ●●
● ●● ● ●●● ●● ●●
●●
● ●●
●● ● ●● ●
● ●●
● ● ●● ●●●● ● ● ●● ●●● ●●●●● ●●●
● ● ●
●●
● ● ●●●
●● ●● ● ●●●
● ●
●
●
●●●
● ●●●●
●
●●●●●●●
●●
●●●●●●●●●
● ●
● ●● ●●●●●● ● ● ●● ●●●
●
●
●●●●●●●●●
●●●● ●●●●● ●
●● ●
● ● ●
● ●
●
● ●● ● ●● ●●●● ● ●●
● ●●● ●●● ● ●●
●●●
● ●●●
●
●● ●●●
●●
●●
● ● ●
●
●●●●
●●
● ●●
●●
● ● ●●●●
●
●● ●
● ● ● ●●●
●●●
●
●● ●● ●●●
●
●●
●●●
●●
●● ●●●
●●
● ●●●
●● ●
●
●●● ●●●
●●●●
● ●
●
●
●●
●●
●●
●●● ●
●●● ● ●●● ●●● ●●
●●●● ●●
● ●●● ●● ● ●● ●●
● ●●● ●
● ●●
● ●● ● ●
●●
●
● ● ●●● ● ●●●● ● ● ●● ●
●●
●●
●● ●● ● ● ● ●●
20
● ●● ●●● ● ●● ●
●● ●●●
● ●● ● ● ● ● ●●●
● ●
● ●
●
●●
● ●
●
●●
●●●●● ●●
●●●
●
●
● ●●●● ●●●●●●●●●●●●
● ● ●● ●
●● ●●● ● ●
●● ●●●●●●●●●●● ●
●●● ●
● ●●●● ●●● ●●●●●● ●●●●●●●
● ●● ●● ● ●●●●
● ● ● ●● ● ●
● ●●
●●
●●
●
●● ●●
●●
● ● ● ●●
●● ●●
● ●
●●● ●
● ● ●
●
●● ●● ●●
●●
●
●●● ● ●
●●● ● ●●
●●
●● ●●● ●● ●●●●●●
● ●●●●● ● ●
●● ● ●●●●
●●●●●●● ●● ●●●
●●●● ● ●●●● ●●●●●
●
●
●●
●●
● ●
●
●●●● ●
●
●● ●
●●
●
● ●
●●
●●
●●●●
●
● ●●●●●●●
●
●●
●
●●●●●
●●
●
●
●●
●● ●
●●●
●
●
●●●
●
●●
●
●●
●● ●●
●
● ●
●●
●●
●
●●●● ●●
●●
●●●●
●● ●●
●●●● ●
●●
● ●
●●
●●
●●
●●
●●●
● ● ●●●● ●●● ●●● ●
●
●● ●● ● ●● ●●●
●● ● ● ● ● ● ● ● ● ●●● ● ●●● ●●
●● ● ●
● ● ●
● ● ●●●
● ● ● ●● ●●
Maximum air temperature
● ●● ● ● ● ●
● ●●● ● ● ●
● ● ● ● ●● ●
●●
●
●●
●●●●●
●● ●
●●●
●● ●●● ●●● ●
●● ● ●● ●●●●
●
●●
●●● ●
●
● ●
●●
●● ●●
●●●●
●
●●
●
●●
●●
●●●●
●
● ●●● ●●● ●● ●●
●●
●● ●●●
●●●
●● ●
● ●●●
● ●● ●● ●●●●
●●●●● ●●
●●
●● ● ● ●● ●● ●● ● ● ● ● ● ●●●●●●● ●●●
● ● ●● ●● ●● ● ● ●● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●● ●●● ●●
●● ● ●●●●
●●●● ●●●●● ●●
●●
●●●●●
●●●● ●●
●
● ●
●● ●
●●
●● ●
●● ●
●●
●●
●●
●● ●
● ● ●
●● ●● ●●●●●●
● ●
●●●●
● ●
●
●●●
●●●
●
●●●
●●●
●●●● ●●●●
●● ●
●●
● ●●
● ●● ● ●●●●
●●● ● ●
●
● ●● ●●●
● ●● ●●● ●●●
● ●● ● ●● ● ●
●
●●●● ●●
●● ●●● ● ●●● ●
●
●
●●●
●● ● ●
●● ● ●●
●● ●●● ● ●
●
●● ● ●
●●
●●
●● ●●● ●
●
●
●
●●●●
●● ●●●
●●
●●●
●
●
●
●
●●
●
●●
●
● ●
●
●
●
●●
●●●
●●●●
●●●●●●●
●
●●
●●●
●● ●
●
● ●
●●●
●●
●
●●● ●●●●●●
●●
●
● ●● ● ● ●●
●● ● ●●
● ●
●●●●●
●●●●●●● ● ●●
●● ● ●●
●
●●
●● ●
●●●●
●●
●● ● ● ●●●● ●●
●● ●●● ●●●●● ●
●●● ●●● ● ●
● ●●
●●
●●●
●●●●
● ●●●
● ● ●● ●●
● ● ●
●
● ●●●●●
●●● ●●
●●●●●
●● ●●
● ●
●●
●● ● ●● ●●● ● ●
●●●●●●●●●●●
●● ●●
●●● ●● ●
● ●● ● ●●● ●●●●● ●
●●●
●
●
●●●
●
●●●
●
●
●●
●●
●
●
●●●
●●
●●
●●●●
●●●
●● ●●
●●● ●●
●
●
●
●
●●●●●●
●●●
●
●
●
●●
●●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●●●●●
●
●
●●
●●
●
●
●
●
●●
●
●
● ●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●●
●
●●
●
● ●
●
●●
● ●
●●●●●●●●
●●
● ●
●●
●●
●
●●●●
●● ●
●●
●
●●
●
●
●
●
●
●●●●
●●
●●● ● ●
●
●●●●●
●
●●●●
●
●●●●
●● ●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●●●
●
● ●
●
●
●
●●
●
● ●
●●
●
●
●●●
●
● ●●●
●● ●
●
● ● ●●
● ●● ● ● ● ●
● ●
● ●
●●● ● ● ●●●● ●●● ● ●
● ●
●●● ● ●●●
●●
● ●
●● ●
●●
● ●●●
● ●
●●●●● ● ●●●
●● ● ●
● ●
●● ●
●●
●●● ●● ●●
● ● ●●
●
●● ●
● ●
● ● ●● ●●● ●
●
●● ●● ● ● ●● ●
● ● ●
● ●●●● ● ● ● ● ● ● ●
●● ●
● ●● ●●● ●● ●●●●
●●● ●● ●
●● ●●
● ●●●● ● ●●● ●● ●
● ●●
●●● ●
●●
●
●●●●
●
● ●●●●●● ● ●● ●●●●● ●● ● ●●●●●●●●
●
● ●● ● ● ●●●
●● ●● ●●●● ●
●
●●●
●
● ●● ●● ●
● ●● ●●● ●●●●●● ●●● ●●● ● ●● ● ● ●●
●
● ● ● ●●●
● ●●
● ●●●●
●●
●●
●
●●
●
●●
●
●●●●●●●● ●●●●
●●
●●●●
●●
● ●
● ● ●
●●
● ●
● ● ●
● ●
● ● ●
●●
●●
● ●● ●● ●●
●●
●●
●●
●●
● ●●
● ● ●●● ●
● ●●●●
●● ●● ●
●●● ●●●●●●●● ●
●●●
●●●
●●
●●●● ●●●●● ●●●●●●●
●●● ●
●●● ●
●
● ●● ● ●● ●
●
●●
●●
●●
● ●
●● ●
●●●● ●
●
●●
● ●●●●●●
●●●● ●● ●
●●
●●●●
●● ●●
● ● ●●● ●●
●● ●●●●
●●●●●●● ● ●●●● ●● ●●●●●●● ●●●
●●●● ● ● ●
●●●●●●●●●
●● ● ● ● ●
●●●●● ●
●●
●●
● ●●●
● ●●● ●●
●
●●
●● ●●
● ●
●●●
●●●
●●
●●
●
●
●●●
●●
●●●●
● ●●
● ●
●●●●● ●●●
●
●
●●
● ●●●
●
●
●
●●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●●● ●
●●● ● ●●
●●
●
●
●●● ●●●●
● ●
●●
●
●●
●●
●●
●●
●●●●
● ●●●
● ● ●● ● ● ●● ●●
●● ●●● ●●●
●●●● ●●
● ● ● ●●●
●●
●●● ●
● ●●●●●● ●
●●● ●● ● ●● ● ●● ●●
● ● ●● ●●● ●●
● ●● ●
● ●● ●
●●
●
●
● ●●
●●●
●●●
●●●
●●● ●●
●
●●
● ●
●●● ●●
● ●●●
●● ●● ● ●●
●●●●
● ●
●● ●
●●● ●●●● ●●
●●
● ●●●●●
● ●●● ●
●●● ●●
●● ●●●● ● ●●
●● ●●● ●
●●
●●
●●
● ●●
●
● ●
●●
●
●●●
●
●●
●
●●●●●
●●
●●●
●
●
●
●
●●●
●●
●●
● ●
●●●●●●●
●
●
●
●
●●●●●
●●●●
●
●
●●●
●●●● ●
●●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●●
●
●●●●●
●
●●●
●
●● ●●
●
● ●●
● ●●
●●●● ●●
●
●
●
●●
● ● ●
●
●
●
●●●
● ●●● ●●●●
●●
●● ●● ●
●●
●
●
●●
●●
●●●
●
●●●
●●
●
●●
●
●●
●●●●●●●●
●● ●●●
●
●
●●●
●● ● ●●●
● ●
● ●●
●
●
●● ●
● ● ●● ● ● ● ● ● ●●
●● ● ● ●●
● ● ●● ●●
●● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ●● ●
●● ● ● ● ●● ●
●● ●●●●●●
●●
●●●
● ●● ●● ●
●●● ●
● ●
●
●●●●●● ●●
●●●●●● ●● ●
●●●●● ●
●●●
● ● ●●●●●●●● ●●●● ● ●
●● ● ●●●
● ●●● ●●
●● ●●●
●
● ●
● ● ●● ●● ●● ● ●
●●● ●●● ● ●● ● ●
● ●●●
●
● ●●● ●●● ● ●●●●
●●
●
●
●
●●●●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
● ● ●●●
●
●
●
●
●
●
●
●
●●●●●●●●●●
●
●●●
●●●
●
● ●●●
●●●●●●●
●●●●●●●●●●
●●
●
●
●●●●
●● ●●●●●●
●●●
●●● ●●
● ●●●
●
●●●●
●●●
●●● ●●●●● ●
● ●
●
●
●
●● ●●●●●●●●
● ●
●●●●
●●
●
●
●●●● ●● ● ●● ● ●● ●●●●● ●
●●
● ●
● ●
●
● ●
●● ●●●
●
●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●
10
●
● ●
● ●●
●
●● ●
●●● ●● ● ● ●● ● ●
●● ●● ● ●
●●●
● ● ●●●●●● ●●● ●● ●●●●● ● ●●● ● ●● ●
●●●● ● ●●●●●● ● ●● ●●●
● ● ●● ● ● ● ●
●●●●
● ●●
●
●●
●
●●
●
● ●●●
●●●
●●
● ● ●●
●
●●
●●●●●● ● ●●
●● ● ●
●
● ●●●●●● ●●
●
●
●●● ●● ● ●● ● ●●● ●
●●
●
●●● ●
●●●● ●●
●●●●
●
● ● ●● ● ●● ●● ●●●
●●● ●●● ● ● ●● ● ●
●
●●
●
●
●●●
●
●●●●●
●
●●
●
●●
●
●● ●
●
●●
●
●●
●●
●●●
● ●●
●
●●●
●
●
●
●
●
●●●●●
● ●●●
●
●●
●
●
●
●●●
●
●●●● ●
●●●
●
●
●●
●
●
● ●●● ●●●●
●
●
●
●●●●● ● ●
●●●
●
●●●●●●●●●●●●
● ●●
● ●
●●●●● ●● ●
●●
●
●●
● ●●
●● ●●●● ●●●● ●● ●● ● ●●●● ●
●
●●
●
●
●●●
●
●
●●
●●
●●
●
●●●
●●
●●
●● ●●
●●
●●
●
●●●●●● ●
●
●
●●
●●
●●
●●
●●●●●●
●●●
● ●●
●●●● ●
●
●
●●●
●●●
●●●●●● ●●
●●● ●●● ●● ●
●
●
●●●
●●●●●
●●● ●●
●● ●
●
●● ●● ●●● ● ●●●●● ●●●● ● ● ●●● ● ● ●
● ●●
●
●
●●
●
●
●●●●
●●●
●● ●
●●●
●●●●● ● ● ●
●●
●
●
● ●
●● ●
●● ●●
●●
● ● ● ●
● ●
●● ●●
●●
●●
● ●●● ●●● ● ●
● ● ●●●●
●●● ●● ● ●●● ● ● ●●
● ● ● ● ●●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●●
●
●
●
● ●
●●
●
●
●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
● ●●●●
●
●● ●●●●●
●
●
●● ●
●
●
●
●●●
●●
●
●
●
●
●
●●●
● ●● ●●●●●●●●●●●
●●
●●
●●●● ●● ●●
●●
●
●●
● ●
●●●
●
●●● ●●●
●
●
●
●●● ●● ●●●●● ●
● ● ●● ●●
●● ● ●●● ●● ●
●
●
●
●
●
●●
●
●●●
●●●
●
●●
●
●
●
●●
●●
●●
●
●●
●●
●● ●
●
●●
●
●●
●
●●●● ●●●●● ●
● ●
● ●●●●●●
●● ●● ● ●● ●●
●●
●
● ●
● ● ●● ● ●●● ● ● ●●● ●● ● ● ●●
●● ● ● ● ● ●
●●
●
●●
●●
●
●●
● ●
● ●
●●● ● ●●●
● ●
●●
● ● ●●
●●●●
● ●
●●
●●●●
●● ●
●●
●●●● ●● ●● ●● ●●
●●● ●● ●● ●●● ●●●
● ●● ● ●● ●● ●●●● ●●● ●● ●●● ●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
● ●
●
●●
●
●●
●
●
●
●●
●●
● ●
●●
●●●
●●●
●
●●
●●
●
●●
●● ●●●
●●●●●●
●● ●
●●●● ●●●●
●● ●● ●●●● ●
●
●●● ●●●
●
●● ●
●●
●●●●●●●●● ●● ●●● ●● ● ● ● ● ● ●● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●●●
●
●
●●●●●●
●
●
●
●●
●
●●
●●
●
● ●●
●
●
●●
●●●●●●●●
●●
●
●
●
●●●●●●
●
●
●● ● ●● ●●●●●●
● ●
●●●
●
●
●
●●●● ●
●●●●● ●●
●● ● ●
● ●● ●● ● ●● ● ●● ● ●●● ●●●● ●●
●
●
●
●●
●
●
●●
●●
●●
●●
●●
●
●
●
●●
●●
●
●
●●●●●
●●●●● ● ● ●●
●
●● ●●
●● ●
●●●●●●
● ●
●●●
● ●
●●● ●●● ●
●● ●●●
●●● ●
●●● ●● ●●
●●●● ● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●●
●●
●●●
●
●●
●●●
●●
●●●●●●
●
●
●
●●●
●●
●●
●
●● ●
●
●
●
● ●●●●●● ●
●
●●
●
●●●
● ●●●
●● ●
●● ●
●
●● ●●
● ●●
●●● ●
●●
●●
●
●
●
●
●●●● ●●● ● ●● ● ●●
● ●
●
●●●●● ● ● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●● ●●●
●●●
●
●●
●
●●
●●●
●
●
●●●●
● ●●●●●●
●
● ●●
● ●● ●●● ● ●●●
●
●●
●
●
●
●●●● ●
●●●●●●●● ● ● ●●● ●●● ●●● ●
●
● ●
●●●●●
●●●
●
●●●●●
● ●●●
● ● ●
● ●
● ●●●●
●● ●●●
●● ● ●●● ● ● ●● ●● ●● ● ●
●●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●
●●
●
●●
●●
●
●●●●
● ●
●
●●●
●●
●●
●
● ●● ●
●
●●●●
●●●●
●● ●●●● ●● ●
● ●●●● ● ●●● ●●●●● ●●● ● ● ●●● ●● ● ●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●
●
● ●
●
●
●●
●●
● ● ● ●●●
●●
●●
● ●●
●●●●●●●●●●
●
●
●●●
● ● ●
●●●●●●●●●● ●●●● ● ●● ●●
● ●● ● ●● ●●●
●● ●● ●●● ●●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
● ●
●
●
●
●
●●
●●
●●●
● ●●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●
●● ●● ●●●● ●●
●
●● ●● ● ●●● ●●
●●● ●● ●●● ●● ●●● ● ●
●
●●
●
●
●●
●
●●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●●●●●
●●
●● ●●
● ●●
●
●●●
●
●●●● ●●
●● ●●● ●●● ● ● ● ●
● ● ● ●●
●
●● ●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●●●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●● ●●
●●
●●●
●●
●
●
●●
●
●
●●
●
●
●●●●●●●
●
●●
●●● ●● ●●●● ●● ●
●●●●● ●●● ● ●●
●● ●
●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●●●
●●
●
●●●
●●
● ●● ●
● ●●
●
●●
●● ● ●●
●●● ● ● ●
● ● ● ● ● ● ●
0
●●
●
●●
●●
●●
●●●
●
●●
●
●
● ●
●● ●●●
●
●●●
●● ●●● ●
●
●●●●● ● ● ●● ●●●● ● ●●●●● ●●●● ●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●●● ●● ●
● ●●
●●● ● ●●●
● ●●● ●● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●●
●●
●
●●
●●●●●●
●●● ●● ●●● ●●● ●●●●
● ● ●●
●
●
●● ●
● ● ●● ● ●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●
●●●●●●
●
●●●● ●●● ●●●●● ● ●● ● ●
●● ●● ●
●● ●
● ●
●
●●
●
●
●●
●●●
●
●●
●●●
● ●
●
● ●●
●●
●●● ●
●● ● ●● ●●●●● ●●● ●● ●
●
● ●
●
●●
●
●
●●
●●
●●
●●●
●
●●
●
●●
●
●●●
●
●●●
● ●
●●
●
●●●
● ●
● ●
● ● ● ●
● ● ●
●
●
●
●●●
●●
●●●
●
●
●
●
●●●● ●●
●●●●● ●●●●● ● ● ● ● ● ●
●●
●●●
●
●
●
●
●
●●
●
●●●
●●
●
● ●●
●
●●
●●●● ●●●
●●●●●●● ●●● ●●●
● ●●● ●
●
●● ●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●● ● ●
●
●●
●
●
●●●
● ●● ● ● ● ● ● ● ●
●●●
●●
●
●
●●●●
●●●●●●
●
● ●
●
●●
●●●●● ● ●●●● ● ● ● ●●● ●
●
●
●
●
●
●●
●●
●
● ●
●
●●●
●
●●
●●
● ●●●●
●
●●● ● ●
●●
●●● ● ● ●● ●● ●●
● ●
●
●●●●●●
●●
●●
● ●●● ●
●
●● ●
●●● ● ●● ● ●● ●●
●
● ● ●
●
●
●
●●
●
●
●●●●●
●
●●
●
●● ●● ●● ● ●● ● ● ●● ●
●
●
●●
●●
●●●
●●●
●●
●
●●
●●●
●●●
●
●
● ● ●● ●● ● ● ● ●●● ●
●●
●
●
●●● ● ●
● ● ● ● ●
−10
●●●●●●● ● ●● ●
●●
●
●●● ●●
● ●●● ●●● ●
● ●●● ●● ●
●
●
●
●
●
●
● ●● ●●● ●● ●
●
●●
●
●
●
●●●●● ● ●● ●
●● ●●
●●●● ● ●
●
●● ●● ●
●● ●
●● ●
● ●
●●●
●●
●●●
● ●
●
0 5 10 15 20 25 30
Global radiation
Figure 14.5: The curve shows the relation between radiation and maximum air
temperature as described by the local knn-regression method.
skies will have slightly larger radiation (close to 1), but at the same time also
very cold weather with lower maximum temperature. Thus, the negative effect
of radiation on maximum temperature is actually present at one out of six days
during a year here at NMBU! This is not something we can ’see’ by plotting
since the data are so dense, and the linear model, that ’looks’ fine to our eye,
will completely obscure this effect.
As a conclusion, we can say that a fine-tuned (by model selection) local
model may detect relations that we otherwise could have missed.
14.4 Exercises
14.4.1 Classification of bacteria again
Expand last weeks exercise on classification of bacteria, and implement a cross-
validation to estimate the CER (classification error rate). Use the code from
the radiation-maxtemp example above. Try different choices for K to see (ap-
proximately) which choice of K is optimal. HINT: Try small values for K.