Little Book of R For Multivariate Analysis
Little Book of R For Multivariate Analysis
Analysis
Release 0.1
Avril Coghlan
CONTENTS
How to install R
1.1 Introduction to R . . . . .
1.2 Installing R . . . . . . . .
1.3 Installing R packages . . .
1.4 Running R . . . . . . . .
1.5 A brief introduction to R .
1.6 Links and Further Reading
1.7 Acknowledgements . . .
1.8 Contact . . . . . . . . . .
1.9 License . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
5
7
7
10
10
10
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
12
17
22
24
24
30
40
40
40
41
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Acknowledgements
43
Contact
45
License
47
ii
By Avril Coghlan, Wellcome Trust Sanger Institute, Cambridge, U.K. Email: [email protected]
This is a simple introduction to multivariate analysis using the R statistics software.
There is a pdf version of this booklet available at: https://media.readthedocs.org/pdf/little-book-of-r-formultivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf.
If you like this booklet, you may also like to check out my booklet on using R for biomedical statistics, http://alittle-book-of-r-for-biomedical-statistics.readthedocs.org/, and my booklet on using R for time series analysis,
http://a-little-book-of-r-for-time-series.readthedocs.org/.
Contents:
CONTENTS
CONTENTS
CHAPTER
ONE
HOW TO INSTALL R
1.1 Introduction to R
This little booklet has some information on how to use R for time series analysis.
R (www.r-project.org) is a commonly used free Statistics software. R allows you to carry out statistical analyses
in an interactive mode, as well as allowing simple programming.
1.2 Installing R
To use R, you first need to install the R program on your computer.
New releases of R are made very regularly (approximately once a month), as R is actively being improved all the
time. It is worthwhile installing new versions of R regularly, to make sure that you have a recent version of R (to
ensure compatibility with all the latest versions of the R packages that you have downloaded).
3. Click on the Start button at the bottom left of your computer screen, and then choose All programs, and
start R by selecting R (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of
programs.
4. The R console (a rectangle) should pop up.
5. Once you have started R, you can now install an R package (eg. the rmeta package) by choosing Install
package(s) from the Packages menu at the top of the R console. This will ask you what website you
want to download the package from, you should choose Ireland (or another country, if you prefer). It will
also bring up a list of available packages that you can install, and you should choose the package that you
want to install from that list (eg. rmeta).
6. This will install the rmeta package.
7. The rmeta package is now installed. Whenever you want to use the rmeta package after this, after
starting R, you first have to load the package by typing into the R console:
> library("rmeta")
Note that there are some additional R packages for bioinformatics that are part of a special set of R packages called
Bioconductor (www.bioconductor.org) such as the yeastExpData R package, the Biostrings R package, etc.).
These Bioconductor packages need to be installed using a different, Bioconductor-specific procedure (see How to
install a Bioconductor R package below).
6. This will install a core set of Bioconductor packages (affy, affydata, affyPLM, annaffy, annotate,
Biobase, Biostrings, DynDoc, gcrma, genefilter, geneplotter, hgu95av2.db, limma, marray, matchprobes, multtest, ROC, vsn, xtable, affyQCReport). This takes a few minutes (eg.
10 minutes).
7. At a later date, you may wish to install some extra Bioconductor packages that do not belong to the core set
of Bioconductor packages. For example, to install the Bioconductor package called yeastExpData, start
R and type in the R console:
> source("http://bioconductor.org/biocLite.R")
> biocLite("yeastExpData")
8. Whenever you want to use a package after installing it, you need to load it into R by typing:
> library("yeastExpData")
1.4 Running R
To use R, you first need to start the R program on your computer. You should have already installed R on your
computer (see above).
To start R, you can either follow step 1 or 2: 1. Check if there is an R icon on the desktop of the computer that
you are using.
If so, double-click on the R icon to start R. If you cannot find an R icon, try step 2 instead.
2. Click on the Start button at the bottom left of your computer screen, and then choose All programs, and
start R by selecting R (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of
programs.
This should bring up a new window, which is the R console.
This is the R prompt. We type the commands needed for a particular task after this prompt. The command is
carried out after you hit the Return key.
Once you have started R, you can start typing in commands, and the results will be calculated immediately, for
example:
> 2*3
[1] 6
> 10-3
[1] 7
All variables (scalars, vectors, matrices, etc.) created by R are called objects. In R, we assign values to variables
using an arrow. For example, we can assign the value 2*3 to the variable x using the command:
> x <- 2*3
To view the contents of any R object, just type its name, and the contents of that R object will be displayed:
> x
[1] 6
There are several possible different types of objects in R, including scalars, vectors, matrices, arrays, data frames,
tables, and lists. The scalar variable x above is one example of an R object. While a scalar variable such as x
has just one element, a vector consists of several elements. The elements in a vector are all of the same type (eg.
numeric or characters), while lists may include elements such as characters as well as numeric quantities.
To create a vector, we can use the c() (combine) function. For example, to create a vector called myvector that has
elements with values 8, 6, 9, 10, and 5, we type:
> myvector <- c(8, 6, 9, 10, 5)
To see the contents of the variable myvector, we can just type its name:
> myvector
[1] 8 6 9 10
The [1] is the index of the first element in the vector. We can extract any element of the vector by typing the vector
name with the index of that element given in square brackets. For example, to get the value of the 4th element in
the vector myvector, we type:
> myvector[4]
[1] 10
1.4. Running R
In contrast to a vector, a list can contain elements of different types, for example, both numeric and character
elements. A list can also include other variables such as a vector. The list() function is used to create a list. For
example, we could create a list mylist by typing:
> mylist <- list(name="Fred", wife="Mary", myvector)
We can then print out the contents of the list mylist by typing its name:
> mylist
$name
[1] "Fred"
$wife
[1] "Mary"
[[3]]
[1] 8
9 10
The elements in a list are numbered, and can be referred to using indices. We can extract an element of a list by
typing the list name with the index of the element given in double square brackets (in contrast to a vector, where
we only use single square brackets). Thus, we can extract the second and third elements from mylist by typing:
> mylist[[2]]
[1] "Mary"
> mylist[[3]]
[1] 8 6 9 10
Elements of lists may also be named, and in this case the elements may be referred to by giving the list name, followed by $, followed by the element name. For example, mylist$name is the same as mylist[[1]] and mylist$wife
is the same as mylist[[2]]:
> mylist$wife
[1] "Mary"
We can find out the names of the named elements in a list by using the attributes() function, for example:
> attributes(mylist)
$names
[1] "name" "wife" ""
When you use the attributes() function to find the named elements of a list variable, the named elements are always
listed under a heading $names. Therefore, we see that the named elements of the list variable mylist are called
name and wife, and we can retrieve their values by typing mylist$name and mylist$wife, respectively.
Another type of object that you will encounter in R is a table variable. For example, if we made a vector variable
mynames containing the names of children in a class, we can use the table() function to produce a table variable
that contains the number of children with each possible name:
> mynames <- c("Mary", "John", "Ann", "Sinead", "Joe", "Mary", "Jim", "John", "Simon")
> table(mynames)
mynames
Ann
Jim
Joe
John
Mary Simon Sinead
1
1
1
2
2
1
1
We can store the table variable produced by the function table(), and call the stored table mytable, by typing:
> mytable <- table(mynames)
To access elements in a table variable, you need to use double square brackets, just like accessing elements in a
list. For example, to access the fourth element in the table mytable (the number of children called John), we
type:
> mytable[[4]]
[1] 2
Alternatively, you can use the name of the fourth element in the table (John) to find the value of that table
element:
> mytable[["John"]]
[1] 2
Functions in R usually require arguments, which are input variables (ie. objects) that are passed to them, which
they then carry out some operation on. For example, the log10() function is passed a number, and it then calculates
the log to the base 10 of that number:
> log10(100)
2
In R, you can get help about a particular function by using the help() function. For example, if you want help
about the log10() function, you can type:
> help("log10")
When you use the help() function, a box or webpage will pop up with information about the function that you
asked for help with.
If you are not sure of the name of a function, but think you know part of its name, you can search for the function
name using the help.search() and RSiteSearch() functions. The help.search() function searches to see if you
already have a function installed (from one of the R packages that you have installed) that may be related to some
topic youre interested in. The RSiteSearch() function searches all R functions (including those in packages that
you havent yet installed) for functions related to the topic you are interested in.
For example, if you want to know if there is a function to calculate the standard deviation of a set of numbers, you
can search for the names of all installed functions containing the word deviation in their description by typing:
> help.search("deviation")
Help files with alias or concept or title matching
deviation using fuzzy matching:
genefilter::rowSds
nlme::pooledSD
stats::mad
stats::sd
vsn::meanSdPlot
Among the functions that were found, is the function sd() in the stats package (an R package that comes with
the standard R installation), which is used for calculating the standard deviation.
In the example above, the help.search() function found a relevant function (sd() here). However, if you did not
find what you were looking for with help.search(), you could then use the RSiteSearch() function to see if a search
of all functions described on the R website may find something relevant to the topic that youre interested in:
> RSiteSearch("deviation")
The results of the RSiteSearch() function will be hits to descriptions of R functions, as well as to R mailing list
discussions of those functions.
We can perform computations with R using objects such as scalars and vectors. For example, to calculate the
average of the values in the vector myvector (ie. the average of 8, 6, 9, 10 and 5), we can use the mean() function:
> mean(myvector)
[1] 7.6
We have been using built-in R functions such as mean(), length(), print(), plot(), etc. We can also create our own
functions in R to do calculations that you want to carry out very often on different input data sets. For example,
we can create a function to calculate the value of 20 plus square of some input number:
This function will calculate the square of a number (x), and then add 20 to that value. The return() statement
returns the calculated value. Once you have typed in this function, the function is then available for use. For
example, we can use the function for different input numbers (eg. 10, 25):
> myfunction(10)
[1] 120
> myfunction(25)
[1] 645
To quit R, type:
> q()
1.7 Acknowledgements
For very helpful comments and suggestions for improvements on the installation instructions, thank you very
much to Friedrich Leisch and Phil Spector.
1.8 Contact
I will be very grateful if you will send me (Avril Coghlan) corrections or suggestions for improvements to my
email address [email protected]
1.9 License
The content in this book is licensed under a Creative Commons Attribution 3.0 License.
10
CHAPTER
TWO
There is one row per wine sample. The first column contains the cultivar of a wine sample (labelled 1, 2 or 3),
and the following thirteen columns contain the concentrations of the 13 different chemicals in that sample. The
columns are separated by commas.
11
When we read the file into R using the read.table() function, we need to use the sep= argument in read.table()
to tell it that the columns are separated by commas. That is, we can read in the file using the read.table() function
as follows:
> wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",
sep=",")
> wine
V1
V2
V3
V4
V5 V6
V7
V8
V9 V10
V11
V12 V13 V14
1
1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.640000 1.040 3.92 1065
2
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.380000 1.050 3.40 1050
3
1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.680000 1.030 3.17 1185
4
1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.800000 0.860 3.45 1480
5
1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.320000 1.040 2.93 735
...
176 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.200000 0.590 1.56 835
177 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.300000 0.600 1.62 840
178 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.200000 0.610 1.60 560
In this case the data on 178 samples of wine has been read into the variable wine.
You can then use the scatterplotMatrix() function to plot the multivariate data.
To use the scatterplotMatrix() function, you need to give it as its input the variables that you want included in the
plot. Say for example, that we just want to include the variables corresponding to the concentrations of the first
five chemicals. These are stored in columns 2-6 of the variable wine. We can extract just these columns from
the variable wine by typing:
> wine[2:6]
V2
1
14.23
2
13.20
3
13.16
4
14.37
5
13.24
...
V3
1.71
1.78
2.36
1.95
2.59
V4
2.43
2.14
2.67
2.50
2.87
V5
15.6
11.2
18.6
16.8
21.0
V6
127
100
101
113
118
To make a matrix scatterplot of just these 13 variables using the scatterplotMatrix() function we type:
> scatterplotMatrix(wine[2:6])
12
In this matrix scatterplot, the diagonal cells show histograms of each of the variables, in this case the concentrations
of the first five chemicals (variables V2, V3, V4, V5, V6).
Each of the off-diagonal cells is a scatterplot of two of the five chemicals, for example, the second cell in the first
row is a scatterplot of V2 (y-axis) against V3 (x-axis).
13
If we want to label the data points by their group (the cultivar of wine here), we can use the text function in R to
plot some text beside every data point. In this case, the cultivar of wine is stored in the column V1 of the variable
wine, so we type:
> text(wine$V4, wine$V5, wine$V1, cex=0.7, pos=4, col="red")
If you look at the help page for the text function, you will see that pos=4 will plot the text just to the right
of the symbol for a data point. The cex=0.5 option will plot the text at half the default size, and the col=red
option will plot the text in red. This gives us the following plot:
14
We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to have lower values of V4
compared to the wines of cultivar 1.
15
To use this function, you first need to copy and paste it into R. The arguments to the function are a vector containing
the names of the varibles that you want to plot, and a list variable containing the variables themselves.
For example, to make a profile plot of the concentrations of the first five chemicals in the wine samples (stored in
columns V2, V3, V4, V5, V6 of variable wine), we type:
>
>
>
>
library(RColorBrewer)
names <- c("V2","V3","V4","V5","V6")
mylist <- list(wine$V2,wine$V3,wine$V4,wine$V5,wine$V6)
makeProfilePlot(mylist,names)
16
It is clear from the profile plot that the mean and standard deviation for V6 is quite a lot higher than that for the
other variables.
V4
2.3665169
V10
1.5908989
V5
19.4949438
V11
5.0580899
V6
99.7415730
V12
0.9574494
V7
2.2951124
V13
2.6116854
17
V14
746.8932584
This tells us that the mean of variable V2 is 13.0006180, the mean of V3 is 2.3363483, and so on.
Similarly, to get the standard deviations of the 13 chemical concentrations, we type:
> sapply(wine[2:14],sd)
V2
V3
0.8118265
1.1171461
V8
V9
0.9988587
0.1244533
V14
314.9074743
V4
0.2743440
V10
0.5723589
V5
3.3395638
V11
2.3182859
V6
14.2824835
V12
0.2285716
V7
0.6258510
V13
0.7099904
We can see here that it would make sense to standardise in order to compare the variables because the variables
have very different standard deviations - the standard deviation of V14 is 314.9074743, while the standard deviation of V9 is just 0.1244533. Thus, in order to compare the variables, we need to standardise each variable so that
it has a sample variance of 1 and sample mean of 0. We will explain below how to standardise the variables.
We can then calculate the mean and standard deviations of the 13 chemicals concentrations, for just the cultivar
2 samples:
> sapply(cultivar2wine[2:14],mean)
V2
V3
V4
V5
V6
V7
V8
12.278732
1.932676
2.244789 20.238028 94.549296
2.258873
2.080845
V9
V10
V11
V12
V13
V14
0.363662
1.630282
3.086620
1.056282
2.785352 519.507042
> sapply(cultivar2wine[2:14])
V2
V3
V4
V5
V6
V7
V8
0.5379642
1.0155687
0.3154673
3.3497704 16.7534975
0.5453611
0.7057008
V9
V10
V11
V12
V13
V14
0.1239613
0.6020678
0.9249293
0.2029368
0.4965735 157.2112204
You can calculate the mean and standard deviation of the 13 chemicals concentrations for just cultivar 1 samples,
or for just cultivar 3 samples, in a similar way.
However, for convenience, you might want to use the function printMeanAndSdByGroup() below, which prints
out the mean and standard deviation of the variables for each group in your data set:
> printMeanAndSdByGroup <- function(variables,groupvariable)
{
# find the names of the variables
variablenames <- c(names(groupvariable),names(as.data.frame(variables)))
# within each group, find the mean of each variable
groupvariable <- groupvariable[,1] # ensures groupvariable is not a list
means <- aggregate(as.matrix(variables) ~ groupvariable, FUN = mean)
names(means) <- variablenames
print(paste("Means:"))
print(means)
# within each group, find the standard deviation of each variable:
sds <- aggregate(as.matrix(variables) ~ groupvariable, FUN = sd)
names(sds) <- variablenames
18
print(paste("Standard deviations:"))
print(sds)
# within each group, find the number of samples:
samplesizes <- aggregate(as.matrix(variables) ~ groupvariable, FUN = length)
names(samplesizes) <- variablenames
print(paste("Sample sizes:"))
print(samplesizes)
}
To use the function printMeanAndSdByGroup(), you first need to copy and paste it into R. The arguments of the
function are the variables that you want to calculate means and standard deviations for, and the variable containing
the group of each sample. For example, to calculate the mean and standard deviation for each of the 13 chemical
concentrations, for each of the three different wine cultivars, we type:
> printMeanAndSdByGroup(wine[2:14],wine[1])
[1] "Means:"
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
1 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 0.290000 1.899322 5.528305
2 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451 0.363662 1.630282 3.086620
3 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583 0.447500 1.153542 7.396250
[1] "Standard deviations:"
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
1 1 0.4621254 0.6885489 0.2271660 2.546322 10.49895 0.3389614 0.3974936 0.07004924 0.4121092 1.
2 2 0.5379642 1.0155687 0.3154673 3.349770 16.75350 0.5453611 0.7057008 0.12396128 0.6020678 0.
3 3 0.5302413 1.0879057 0.1846902 2.258161 10.89047 0.3569709 0.2935041 0.12413959 0.4088359 2.
[1] "Sample sizes:"
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 1 59 59 59 59 59 59 59 59 59 59 59 59 59
2 2 71 71 71 71 71 71 71 71 71 71 71 71 71
3 3 48 48 48 48 48 48 48 48 48 48 48 48 48
The function printMeanAndSdByGroup() also prints out the number of samples in each group. In this case, we
see that there are 59 samples of cultivar 1, 71 of cultivar 2, and 48 of cultivar 3.
19
return(Vw)
}
You will need to copy and paste this function into R before you can use it. For example, to calculate the withingroups variance of the variable V2 (the concentration of the first chemical), we type:
> calcWithinGroupsVariance(wine[2],wine[1])
[1] 0.2620525
Once you have copied and pasted this function into R, you can use it to calculate the between-groups variance for
a variable such as V2:
> calcBetweenGroupsVariance (wine[2],wine[1])
[1] 35.39742
If you want to calculate the separations achieved by all of the variables in a multivariate data set, you can use the
function calcSeparations() below:
> calcSeparations <- function(variables,groupvariable)
{
# find out how many variables we have
variables <- as.data.frame(variables)
20
For example, to calculate the separations for each of the 13 chemical concentrations, we type:
> calcSeparations(wine[2:14],wine[1])
[1] "variable V2 Vw= 0.262052469153907 Vb= 35.3974249602692 separation= 135.0776242428"
[1] "variable V3 Vw= 0.887546796746581 Vb= 32.7890184869213 separation= 36.9434249631837"
[1] "variable V4 Vw= 0.0660721013425184 Vb= 0.879611357248741 separation= 13.312901199991"
[1] "variable V5 Vw= 8.00681118121156 Vb= 286.41674636309 separation= 35.7716374073093"
[1] "variable V6 Vw= 180.65777316441 Vb= 2245.50102788939 separation= 12.4295843381499"
[1] "variable V7 Vw= 0.191270475224227 Vb= 17.9283572942847 separation= 93.7330096203673"
[1] "variable V8 Vw= 0.274707514337437 Vb= 64.2611950235641 separation= 233.925872681549"
[1] "variable V9 Vw= 0.0119117022132797 Vb= 0.328470157461624 separation= 27.5754171469659"
[1] "variable V10 Vw= 0.246172943795542 Vb= 7.45199550777775 separation= 30.2713831702276"
[1] "variable V11 Vw= 2.28492308133354 Vb= 275.708000822304 separation= 120.664018441003"
[1] "variable V12 Vw= 0.0244876469432414 Vb= 2.48100991493829 separation= 101.3167953903"
[1] "variable V13 Vw= 0.160778729560982 Vb= 30.5435083544253 separation= 189.972320578889"
[1] "variable V14 Vw= 29707.6818705169 Vb= 6176832.32228483 separation= 207.920373902178"
Thus, the individual variable which gives the greatest separations between the groups (the wine cultivars) is V8
(separation 233.9). As we will discuss below, the purpose of linear discriminant analysis (LDA) is to find the
linear combination of the individual variables that will give the greatest separation between the groups (cultivars
here). This hopefully will give a better separation than the best separation achievable by any individual variable
(233.9 for V8 here).
21
For example, to calculate the within-groups covariance for variables V8 and V11, we type:
> calcWithinGroupsCovariance(wine[8],wine[11],wine[1])
[1] 0.2866783
> calcBetweenGroupsCovariance <- function(variable1,variable2,groupvariable)
{
# find out how many values the group variable can take
groupvariable2 <- as.factor(groupvariable[[1]])
levels <- levels(groupvariable2)
numlevels <- length(levels)
# calculate the grand means
variable1mean <- mean(variable1)
variable2mean <- mean(variable2)
# calculate the between-groups covariance
Covb <- 0
for (i in 1:numlevels)
{
leveli <- levels[i]
levelidata1 <- variable1[groupvariable==leveli,]
levelidata2 <- variable2[groupvariable==leveli,]
mean1 <- mean(levelidata1)
mean2 <- mean(levelidata2)
levelilength <- length(levelidata1)
term1 <- (mean1 - variable1mean)*(mean2 - variable2mean)*(levelilength)
Covb <- Covb + term1
}
Covb <- Covb / (numlevels - 1)
Covb <- Covb[[1]]
return(Covb)
}
For example, to calculate the between-groups covariance for variables V8 and V11, we type:
> calcBetweenGroupsCovariance(wine[8],wine[11],wine[1])
[1] -60.41077
Thus, for V8 and V11, the between-groups covariance is -60.41 and the within-groups covariance is 0.29. Since
the within-groups covariance is positive (0.29), it means V8 and V11 are positively related within groups: for
individuals from the same group, individuals with a high value of V8 tend to have a high value of V11, and
vice versa. Since the between-groups covariance is negative (-60.41), V8 and V11 are negatively related between
groups: groups with a high mean value of V8 tend to have a low mean value of V11, and vice versa.
22
To calculate the linear (Pearson) correlation coefficient for a pair of variables, you can use the cor.test() function
in R. For example, to calculate the correlation coefficient for the first two chemicals concentrations, V2 and V3,
we type:
> cor.test(wine$V2, wine$V3)
Pearsons product-moment correlation
data: wine$V2 and wine$V3
t = 1.2579, df = 176, p-value = 0.2101
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.05342959 0.23817474
sample estimates:
cor
0.09439694
This tells us that the correlation coefficient is about 0.094, which is a very weak correlation. Furthermore, the
P-value for the statistical test of whether the correlation coefficient is significantly different from zero is 0.21.
This is much greater than 0.05 (which we can use here as a cutoff for statistical significance), so there is very weak
evidence that that the correlation is non-zero.
If you have a lot of variables, you can use cor.test() to calculate the correlation coefficient for each pair of
variables, but you might be just interested in finding out what are the most highly correlated pairs of variables. For
this you can use the function mosthighlycorrelated() below.
The function mosthighlycorrelated() will print out the linear correlation coefficients for each pair of variables in
your data set, in order of the correlation coefficient. This lets you see very easily which pair of variables are most
highly correlated.
> mosthighlycorrelated <- function(mydataframe,numtoreport)
{
# find the correlations
cormatrix <- cor(mydataframe)
# set the correlations on the diagonal or lower triangle to zero,
# so they will not be reported as the highest ones:
diag(cormatrix) <- 0
cormatrix[lower.tri(cormatrix)] <- 0
# flatten the matrix into a dataframe for easy sorting
fm <- as.data.frame(as.table(cormatrix))
# assign human-friendly names
names(fm) <- c("First.Variable", "Second.Variable","Correlation")
# sort and print the top n correlations
head(fm[order(abs(fm$Correlation),decreasing=T),],n=numtoreport)
}
To use this function, you will first have to copy and paste it into R. The arguments of the function are the variables
that you want to calculate the correlations for, and the number of top correlation coefficients to print out (for
example, you can tell it to print out the largest ten correlation coefficients, or the largest 20).
For example, to calculate correlation coefficients between the concentrations of the 13 chemicals in the wine
samples, and to print out the top 10 pairwise correlation coefficients, you can type:
> mosthighlycorrelated(wine[2:14], 10)
First.Variable Second.Variable Correlation
84
V7
V8
0.8645635
150
V8
V13
0.7871939
149
V7
V13
0.6999494
111
V8
V10
0.6526918
157
V2
V14
0.6437200
110
V7
V10
0.6124131
154
V12
V13
0.5654683
132
V3
V12 -0.5612957
118
V2
V11
0.5463642
137
V8
V12
0.5434786
23
This tells us that the pair of variables with the highest linear correlation coefficient are V7 and V8 (correlation =
0.86 approximately).
Note that we use the as.data.frame() function to convert the output of scale() into a data frame, which is the
same type of R variable that the wine variable.
We can check that each of the standardised variables stored in standardisedconcentrations has a mean of 0 and
a standard deviation of 1 by typing:
> sapply(standardisedconcentrations,mean)
V2
V3
V4
V5
V6
V7
-8.591766e-16 -6.776446e-17 8.045176e-16 -7.720494e-17 -4.073935e-17 -1.395560e-17
V8
V9
V10
V11
V12
V13
6.958263e-17 -1.042186e-16 -1.221369e-16 3.649376e-17 2.093741e-16 3.003459e-16
V14
-1.034429e-16
> sapply(standardisedconcentrations,sd)
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1
1
1
1
1
1
1
1
1
1
1
1
1
We see that the means of the standardised variables are all very tiny numbers and so are essentially equal to 0, and
the standard deviations of the standardised variables are all equal to 1.
24
Once you have standardised your variables, you can carry out a principal component analysis using the prcomp()
function in R.
For example, to standardise the concentrations of the 13 chemicals in the wine samples, and carry out a principal
components analysis on the standardised concentrations, we type:
> standardisedconcentrations <- as.data.frame(scale(wine[2:14])) # standardise the variables
> wine.pca <- prcomp(standardisedconcentrations)
# do a PCA
You can get a summary of the principal component analysis results using the summary() function on the output
of prcomp():
> summary(wine.pca)
Importance of components:
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
Standard deviation
2.169 1.580 1.203 0.9586 0.9237 0.8010 0.7423 0.5903 0.5375 0.5009
Proportion of Variance 0.362 0.192 0.111 0.0707 0.0656 0.0494 0.0424 0.0268 0.0222 0.0193
Cumulative Proportion 0.362 0.554 0.665 0.7360 0.8016 0.8510 0.8934 0.9202 0.9424 0.9617
PC11
PC12
PC13
Standard deviation
0.4752 0.4108 0.32152
Proportion of Variance 0.0174 0.0130 0.00795
Cumulative Proportion 0.9791 0.9920 1.00000
This gives us the standard deviation of each component, and the proportion of variance explained by each component. The standard deviation of the components is stored in a named element called sdev of the output variable
made by prcomp:
> wine.pca$sdev
[1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128 0.5903367
[9] 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244
The total variance explained by the components is the sum of the variances of the components:
> sum((wine.pca$sdev)^2)
[1] 13
In this case, we see that the total variance is 13, which is equal to the number of standardised variables (13
variables). This is because for standardised data, the variance of each standardised variable is 1. The total variance
is equal to the sum of the variances of the individual variables, and since the variance of each standardised variable
is 1, the total variance should be equal to the number of variables (13 here).
25
The most obvious change in slope in the scree plot occurs at component 4, which is the elbow of the scree
plot. Therefore, it cound be argued based on the basis of the scree plot that the first three components should be
retained.
Another way of deciding how many components to retain is to use Kaisers criterion: that we should only retain principal components for which the variance is above 1 (when principal component analysis was applied to
standardised data). We can check this by finding the variance of each of the principal components:
> (wine.pca$sdev)^2
[1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283 0.3484974
[9] 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779
We see that the variance is above 1 for principal components 1, 2, and 3 (which have variances 4.71, 2.50, and
1.45, respectively). Therefore, using Kaisers criterion, we would retain the first three principal components.
A third way to decide how many principal components to retain is to decide to keep the number of components
required to explain at least some minimum amount of the total variance. For example, if it is important to explain
at least 80% of the variance, we would retain the first five principal components, as we can see from the output of
summary(wine.pca) that the first five principal components explain 80.2% of the variance (while the first four
components explain just 73.6%, so are not sufficient).
Therefore, to obtain the loadings for the first principal component in our analysis of the 13 chemical concentrations
in wine samples, we type:
> wine.pca$rotation[,1]
V2
V3
V4
-0.144329395 0.245187580 0.002051061
V8
V9
V10
-0.422934297 0.298533103 -0.313429488
V14
-0.286752227
V5
V6
V7
0.239320405 -0.141992042 -0.394660845
V11
V12
V13
0.088616705 -0.296714564 -0.376167411
This means that the first principal component is a linear combination of the variables: -0.144*Z2 + 0.245*Z3 +
0.002*Z4 + 0.239*Z5 - 0.142*Z6 - 0.395*Z7 - 0.423*Z8 + 0.299*Z9 -0.313*Z10 + 0.089*Z11 - 0.297*Z12 0.376*Z13 - 0.287*Z14, where Z2, Z3, Z4...Z14 are the standardised versions of the variables V2, V3, V4...V14
(that each have mean of 0 and variance of 1).
Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings:
> sum((wine.pca$rotation[,1])^2)
[1] 1
To calculate the values of the first principal component, we can define our own function to calculate a principal
component given the loadings and the input variables values:
> calcpc <- function(variables,loadings)
{
# find the number of samples in the data set
as.data.frame(variables)
numsamples <- nrow(variables)
# make a vector to store the component
pc <- numeric(numsamples)
# find the number of variables
numvariables <- length(variables)
# calculate the value of the component for each sample
for (i in 1:numsamples)
{
valuei <- 0
for (j in 1:numvariables)
{
valueij <- variables[i,j]
loadingj <- loadings[j]
valuei <- valuei + (valueij * loadingj)
}
pc[i] <- valuei
}
return(pc)
}
We can then use the function to calculate the values of the first principal component for each sample in our wine
data:
> calcpc(standardisedconcentrations, wine.pca$rotation[,1])
[1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051
[8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921
[15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376
[22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878
[29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118
[36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
...
In fact, the values of the first principal component are stored in the variable wine.pca$x[,1] that was returned by
the prcomp() function, so we can compare those values to the ones that we calculated, and they should agree:
27
> wine.pca$x[,1]
[1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051
[8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921
[15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376
[22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878
[29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118
[36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
...
This means that the second principal component is a linear combination of the variables: 0.484*Z2 + 0.225*Z3 +
0.316*Z4 - 0.011*Z5 + 0.300*Z6 + 0.065*Z7 - 0.003*Z8 + 0.029*Z9 + 0.039*Z10 + 0.530*Z11 - 0.279*Z12 0.164*Z13 + 0.365*Z14, where Z1, Z2, Z3...Z14 are the standardised versions of variables V2, V3, ... V14 that
each have mean 0 and variance 1.
Note that the square of the loadings sum to 1, as above:
> sum((wine.pca$rotation[,2])^2)
[1] 1
The second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316), V6
(0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while the
loading for V12 is negative. Therefore, an interpretation of the second principal component is that it represents a
contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of V12. Note that
the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between the concentrations
of V11 and V2, and the concentration of V12.
28
The scatterplot shows the first principal component on the x-axis, and the second principal component on the yaxis. We can see from the scatterplot that wine samples of cultivar 1 have much lower values of the first principal
component than wine samples of cultivar 3. Therefore, the first principal component separates wine samples of
cultivars 1 from those of cultivar 3.
We can also see that wine samples of cultivar 2 have much higher values of the second principal component than
wine samples of cultivars 1 and 3. Therefore, the second principal component separates samples of cultivar 2 from
samples of cultivars 1 and 3.
Therefore, the first two principal components are reasonably useful for distinguishing wine samples of the three
different cultivars.
Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13,
V10, V12, and V14, and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms
of the concentrations of these chemicals in the different cultivars, by printing out the means of the standardised
concentration variables in each cultivar, using the printMeanAndSdByGroup() function (see above):
> printMeanAndSdByGroup(standardisedconcentrations,wine[1])
[1] "Means:"
V1
V2
V3
V4
V5
V6
V7
V8
V9
1 1 0.9166093 -0.2915199 0.3246886 -0.7359212 0.46192317 0.87090552 0.95419225 -0.57735640
2 2 -0.8892116 -0.3613424 -0.4437061 0.2225094 -0.36354162 -0.05790375 0.05163434 0.01452785
3 3 0.1886265 0.8928122 0.2572190 0.5754413 -0.03004191 -0.98483874 -1.24923710 0.68817813
Does it make sense that the first principal component can separate cultivar 1 from cultivar 3? In cultivar 1, the
29
mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171) are very high
compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736). In cultivar 3, the mean values of V8
(-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372) are very low compared to the
mean values of V9 (0.688), V3 (0.893) and V5 (0.575). Therefore, it does make sense that principal component 1
is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, and the concentrations of V9, V3
and V5; and that principal component 1 can separate cultivar 1 from cultivar 3.
Above, we intepreted the second principal component as a contrast between the concentrations of V11, V2, V14,
V4, V6 and V3, and the concentration of V12. In the light of the mean values of these variables in the different
cultivars, does it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3? In
cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292) are
not very different from the mean value of V12 (0.458). In cultivar 3, the mean values of V11 (1.009), V2 (0.189),
V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893) are also not very different from the mean value of V12
(-1.202). In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (0.364) and V3 (-0.361) are much less than the mean value of V12 (0.432). Therefore, it makes sense that principal
component is a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of
V12; and that principal component 2 can separate cultivar 2 from cultivars 1 and 3.
30
discriminants:
LD2
0.8717930699
0.3053797325
2.3458497486
wine$V5
wine$V6
wine$V7
wine$V8
wine$V9
wine$V10
wine$V11
wine$V12
wine$V13
wine$V14
0.154797889
-0.002163496
0.618052068
-1.661191235
-1.495818440
0.134092628
0.355055710
-0.818036073
-1.157559376
-0.002691206
-0.1463807654
-0.0004627565
-0.0322128171
-0.4919980543
-1.6309537953
-0.3070875776
0.2532306865
-1.5156344987
0.0511839665
0.0028529846
This means that the first discriminant function is a linear combination of the variables: -0.403*V2 + 0.165*V3 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8 - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12
- 1.158*V13 - 0.003*V14, where V2, V3, ... V14 are the concentrations of the 14 chemicals found in the wine
samples. For convenience, the value for each discriminant function (eg. the first discriminant function) are scaled
so that their mean value is zero (see below).
Note that these loadings are calculated so that the within-group variance of each discriminant function for each
group (cultivar) is equal to 1, as will be demonstrated below.
These scalings are also stored in the named element scaling of the variable returned by the lda() function. This
element contains a matrix, in which the first column contains the loadings for the first discriminant function, the
second column contains the loadings for the second discriminant function and so on. For example, to extract the
loadings for the first discriminant function, we can type:
> wine.lda$scaling[,1]
wine$V2
wine$V3
wine$V4
-0.403399781 0.165254596 -0.369075256
wine$V8
wine$V9
wine$V10
-1.661191235 -1.495818440 0.134092628
wine$V14
-0.002691206
wine$V5
wine$V6
wine$V7
0.154797889 -0.002163496 0.618052068
wine$V11
wine$V12
wine$V13
0.355055710 -0.818036073 -1.157559376
To calculate the values of the first discriminant function, we can define our own function calclda():
> calclda <- function(variables,loadings)
{
# find the number of samples in the data set
as.data.frame(variables)
numsamples <- nrow(variables)
# make a vector to store the discriminant function
ld <- numeric(numsamples)
# find the number of variables
numvariables <- length(variables)
# calculate the value of the discriminant function for each sample
for (i in 1:numsamples)
{
valuei <- 0
for (j in 1:numvariables)
{
valueij <- variables[i,j]
loadingj <- loadings[j]
valuei <- valuei + (valueij * loadingj)
}
ld[i] <- valuei
}
# standardise the discriminant function so that its mean value is 0:
ld <- as.data.frame(scale(ld, center=TRUE, scale=FALSE))
ld <- ld[[1]]
return(ld)
}
The function calclda() simply calculates the value of a discriminant function for each sample in the data set, for
example, for the first disriminant function, for each sample we calculate the value using the equation -0.403*V2 2.8. Linear Discriminant Analysis
31
0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8 - 1.496*V9 + 0.134*V10 + 0.355*V11 0.818*V12 - 1.158*V13 - 0.003*V14. Furthermore, the scale() command is used within the calclda() function
in order to standardise the value of a discriminant function (eg. the first discriminant function) so that its mean
value (over all the wine samples) is 0.
We can use the function calclda() to calculate the values of the first discriminant function for each sample in our
wine data:
> calclda(wine[2:14], wine.lda$scaling[,1])
[1] -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
[7] -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
[13] -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
[19] -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
[25] -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573
[31] -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996
...
In fact, the values of the first linear discriminant function can be calculated using the predict() function in R, so
we can compare those to the ones that we calculated, and they should agree:
> wine.lda.values <- predict(wine.lda, wine[2:14])
> wine.lda.values$x[,1] # contains the values for the first discriminant function
1
2
3
4
5
6
-4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
7
8
9
10
11
12
-4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
13
14
15
16
17
18
-3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
19
20
21
22
23
24
-5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
25
26
27
28
29
30
-3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573
31
32
33
34
35
36
-2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996
...
32
For example, we can use the groupStandardise() function to calculate the group-standardised versions of the
chemical concentrations in wine samples:
> groupstandardisedconcentrations <- groupStandardise(wine[2:14], wine[1])
We can then use the lda() function to perform linear disriminant analysis on the group-standardised variables:
It makes sense to interpret the loadings calculated using the group-standardised variables rather than the loadings
for the original (unstandardised) variables.
In the first discriminant function calculated for the group-standardised variables, the largest loadings (in absolute)
value are given to V8 (-0.871), V11 (0.537), V13 (-0.464), V14 (-0.464), and V5 (0.438). The loadings for V8,
V13 and V14 are negative, while those for V11 and V5 are positive. Therefore, the discriminant function seems
to represent a contrast between the concentrations of V8, V13 and V14, and the concentrations of V11 and V5.
We saw above that the individual variables which gave the greatest separations between the groups were V8
(separation 233.93), V14 (207.92), V13 (189.97), V2 (135.08) and V11 (120.66). These were mostly the same
variables that had the largest loadings in the linear discriminant function (loading for V8: -0.871, for V14: -0.464,
for V13: -0.464, for V11: 0.537).
We found above that variables V8 and V11 have a negative between-groups covariance (-60.41) and a positive
within-groups covariance (0.29). When the between-groups covariance and within-groups covariance for two
variables have opposite signs, it indicates that a better separation between groups can be obtained by using a linear
combination of those two variables than by using either variable on its own.
Thus, given that the two variables V8 and V11 have between-groups and within-groups covariances of opposite
signs, and that these are two of the variables that gave the greatest separations between groups when used individually, it is not surprising that these are the two variables that have the largest loadings in the first discriminant
function.
Note that although the loadings for the group-standardised variables are easier to interpret than the loadings for the
unstandardised variables, the values of the discriminant function are the same regardless of whether we standardise
the input variables or not. For example, for wine data, we can calculate the value of the first discriminant function
calculated using the unstandardised and group-standardised variables by typing:
2.8. Linear Discriminant Analysis
33
We can see that although the loadings are different for the first discriminant functions calculated using unstandardised and group-standardised data, the actual values of the first discriminant function are the same.
The returned variable has a named element x which is a matrix containing the linear discriminant functions: the
first column of x contains the first discriminant function, the second column of x contains the second discriminant
function, and so on (if there are more discriminant functions).
We can therefore calculate the separations achieved by the two linear discriminant functions for the wine data by
using the calcSeparations() function (see above), which calculates the separation as the ratio of the betweengroups variance to the within-groups variance:
> calcSeparations(wine.lda.values$x,wine[1])
[1] "variable LD1 Vw= 1 Vb= 794.652200566216 separation= 794.652200566216"
[1] "variable LD2 Vw= 1 Vb= 361.241041493455 separation= 361.241041493455"
As mentioned above, the loadings for each discriminant function are calculated in such a way that the within-group
variance (Vw) for each group (wine cultivar here) is equal to 1, as we see in the output from calcSeparations()
above.
The output from calcSeparations() tells us that the separation achieved by the first (best) discriminant function is
794.7, and the separation achieved by the second (second best) discriminant function is 361.2.
Therefore, the total separation is the sum of these, which is (794.652200566216+361.241041493455=1155.893)
1155.89, rounded to two decimal places. Therefore, the percentage separation achieved by the first discriminant
function is (794.652200566216*100/1155.893=) 68.75%, and the percentage separation achieved by the second
discriminant function is (361.241041493455*100/1155.893=) 31.25%.
34
The proportion of trace that is printed when you type wine.lda (the variable returned by the lda() function) is
the percentage separation achieved by each discriminant function. For example, for the wine data we get the same
values as just calculated (68.75% and 31.25%):
> wine.lda
Proportion of trace:
LD1
LD2
0.6875 0.3125
Therefore, the first discriminant function does achieve a good separation between the three groups (three cultivars),
but the second discriminant function does improve the separation of the groups by quite a large amount, so is it
worth using the second discriminant function as well. Therefore, to achieve a good separation of the groups
(cultivars), it is necessary to use both of the first two discriminant functions.
We found above that the largest separation achieved for any of the individual variables (individual chemical concentrations) was 233.9 for V8, which is quite a lot less than 794.7, the separation achieved by the first discriminant
function. Therefore, the effect of using more than one variable to calculate the discriminant function is that we
can find a discriminant function that achieves a far greater separation between groups than achieved by any one
variable alone.
The variable returned by the lda() function also has a named element svd, which contains the ratio of betweenand within-group standard deviations for the linear discriminant variables, that is, the square root of the separation value that we calculated using calcSeparations() above. When we calculate the square of the value stored in
svd, we should get the same value as found using calcSeparations():
> (wine.lda$svd)^2
[1] 794.6522 361.2410
35
We can see from the histogram that cultivars 1 and 3 are well separated by the first discriminant function, since
the values for the first cultivar are between -6 and -1, while the values for cultivar 3 are between 2 and 6, and so
there is no overlap in values.
However, the separation achieved by the linear discriminant function on the training set may be an overestimate.
To get a more accurate idea of how well the first discriminant function separates the groups, we would need to see
a stacked histogram of the values for the three cultivars using some unseen test set, that is, using a set of data
that was not used to calculate the linear discriminant function.
We see that the first discriminant function separates cultivars 1 and 3 very well, but does not separate cultivars 1
and 2, or cultivars 2 and 3, so well.
We therefore investigate whether the second discriminant function separates those cultivars, by making a stacked
histogram of the second discriminant functions values:
> ldahist(data = wine.lda.values$x[,2], g=wine$V1)
36
We see that the second discriminant function separates cultivars 1 and 2 quite well, although there is a little overlap
in their values. Furthermore, the second discriminant function also separates cultivars 2 and 3 quite well, although
again there is a little overlap in their values so it is not perfect.
Thus, we see that two discriminant functions are necessary to separate the cultivars, as was discussed above (see
the discussion of percentage separation above).
37
From the scatterplot of the first two discriminant functions, we can see that the wines from the three cultivars are
well separated in the scatterplot. The first discriminant function (x-axis) separates cultivars 1 and 3 very well, but
doesnt not perfectly separate cultivars 1 and 3, or cultivars 2 and 3.
The second discriminant function (y-axis) achieves a fairly good separation of cultivars 1 and 3, and cultivars 2
and 3, although it is not totally perfect.
To achieve a very good separation of the three cultivars, it would be best to use both the first and second discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well, and the
second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well.
We find that the mean value of the first discriminant function is -3.42248851 for cultivar 1, -0.07972623 for
38
cultivar 2, and 4.32473717 for cultivar 3. The mid-way point between the mean values for cultivars 1 and 2 is
(-3.42248851-0.07972623)/2=-1.751107, and the mid-way point between the mean values for cultivars 2 and 3 is
(-0.07972623+4.32473717)/2 = 2.122505.
Therefore, we can use the following allocation rule:
if the first discriminant function is <= -1.751107, predict the sample to be from cultivar 1
if the first discriminant function is > -1.751107 and <= 2.122505, predict the sample to be from cultivar 2
if the first discriminant function is > 2.122505, predict the sample to be from cultivar 3
We can examine the accuracy of this allocation rule by using the calcAllocationRuleAccuracy() function below:
> calcAllocationRuleAccuracy <- function(ldavalue, groupvariable, cutoffpoints)
{
# find out how many values the group variable can take
groupvariable2 <- as.factor(groupvariable[[1]])
levels <- levels(groupvariable2)
numlevels <- length(levels)
# calculate the number of true positives and false negatives for each group
numlevels <- length(levels)
for (i in 1:numlevels)
{
leveli <- levels[i]
levelidata <- ldavalue[groupvariable==leveli]
# see how many of the samples from this group are classified in each group
for (j in 1:numlevels)
{
levelj <- levels[j]
if (j == 1)
{
cutoff1 <- cutoffpoints[1]
cutoff2 <- "NA"
results <- summary(levelidata <= cutoff1)
}
else if (j == numlevels)
{
cutoff1 <- cutoffpoints[(numlevels-1)]
cutoff2 <- "NA"
results <- summary(levelidata > cutoff1)
}
else
{
cutoff1 <- cutoffpoints[(j-1)]
cutoff2 <- cutoffpoints[(j)]
results <- summary(levelidata > cutoff1 & levelidata <= cutoff2)
}
trues <- results["TRUE"]
trues <- trues[[1]]
print(paste("Number of samples of group",leveli,"classified as group",levelj," : ",
trues,"(cutoffs:",cutoff1,",",cutoff2,")"))
}
}
}
For example, to calculate the accuracy for the wine data based on the allocation rule for the first discriminant
function, we type:
39
[1]
[1]
[1]
[1]
"Number
"Number
"Number
"Number
of
of
of
of
samples
samples
samples
samples
of
of
of
of
group
group
group
group
2
3
3
3
classified
classified
classified
classified
as
as
as
as
group
group
group
group
3
1
2
3
:
:
:
:
Allocated to group 1
56
5
0
Allocated to group 2
3
65
0
Allocated to group 3
0
1
48
There are 3+5+1=9 wine samples that are misclassified, out of (56+3+5+65+1+48=) 178 wine samples: 3 samples
from cultivar 1 are predicted to be from cultivar 2, 5 samples from cultivar 2 are predicted to be from cultivar
1, and 1 sample from cultivar 2 is predicted to be from cultivar 3. Therefore, the misclassification rate is 9/178,
or 5.1%. The misclassification rate is quite low, and therefore the accuracy of the allocation rule appears to be
relatively high.
However, this is probably an underestimate of the misclassification rate, as the allocation rule was based on this
data (this is the training set). If we calculated the misclassification rate for a separate test set consisting of data
other than that used to make the allocation rule, we would probably get a higher estimate of the misclassification
rate.
2.10 Acknowledgements
Many of the examples in this booklet are inspired by examples in the excellent Open University book, Multivariate Analysis (product code M249/03), available from the Open University Shop.
I am grateful to the UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, for making data sets available
which I have used in the examples in this booklet.
Thank you to the following users for very helpful comments: to Rich OHara and Patrick Hausmann for pointing
out that sd(<data.frame>) and mean(<data.frame>) is deprecated; to Arnau Serra-Cayuela for pointing out a typo
in the LDA section; to John Christie for suggesting a more compact form for my printMeanAndSdByGroup()
function, and to Rama Ramakrishnan for suggesting a more compact form for my mosthighlycorrelated() function.
2.11 Contact
I will be grateful if you will send me (Avril Coghlan) corrections or suggestions for improvements to my email
address [email protected]
40
2.12 License
The content in this book is licensed under a Creative Commons Attribution 3.0 License.
2.12. License
41
42
CHAPTER
THREE
ACKNOWLEDGEMENTS
Thank you to Noel OBoyle for helping in using Sphinx, http://sphinx.pocoo.org, to create this document, and
github, https://github.com/, to store different versions of the document as I was writing it, and readthedocs,
http://readthedocs.org/, to build and distribute this document.
43
44
Chapter 3. Acknowledgements
CHAPTER
FOUR
CONTACT
I will be very grateful if you will send me (Avril Coghlan) corrections or suggestions for improvements to my
email address [email protected]
45
46
Chapter 4. Contact
CHAPTER
FIVE
LICENSE
The content in this book is licensed under a Creative Commons Attribution 3.0 License.
47