Notes Minal PDF
Notes Minal PDF
MAS3314/8314:
Multivariate Data Analysis
Prof D J Wilkinson
Module description:
In the 21st Century, statisticians and data analysts typically work with data sets containing
a large number of observations and many variables. This course will consider methods
for making sense of data of this kind, with an emphasis on practical techniques. Consider,
for example, a medical database containing records on a large number of people. Each
person has a height, a weight, an age, a blood type, a blood pressure, and a host of
other attributes, some quantitative, and others categorical. We will look at graphical and
descriptive techniques which help us to visualise a multi-dimensional data set and at
inferential methods which help us to answer more specific questions about the population
from which the individuals were sampled. We will also consider new problems which
arise in the analysis of large and/or multi-dimensional data, such as variable selection
and multiple testing.
Course texts:
T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning: Data mining,
inference, and prediction, 2nd Edition (Springer-Verlag, 2009).
I will refer frequently to these texts in the notes, especially the former, which I will cite
as [HTF]. I will refer to the latter as [Everitt], mainly for R-related information. Note that
the PDF of the full text of [HTF] is available freely on-line, and that [Everitt] should be
available electronically to Newcastle University students via the University Library reading
list web site.
WWW page:
http://www.staff.ncl.ac.uk/d.j.wilkinson/teaching/mas3314/
Last update:
May 29, 2012
These notes correspond roughly to the course as delivered in the Spring of 2012. They
will be revised before I deliver the course again in the first half of 2013.
Use the date above to check when this file was generated.
c 2012, Darren J Wilkinson
Contents
ii
CONTENTS iii
1.1 Introduction
1.1.1 A few quotes...
Google’s Chief Economist Hal Varian on Statistics and Data:
I keep saying the sexy job in the next ten years will be statisticians. People
think I’m joking, but who would’ve guessed that computer engineers would’ve
been the sexy job of the 1990s?
The ability to take data - to be able to understand it, to process it, to extract
value from it, to visualize it, to communicate it’s going to be a hugely important
skill in the next decades, not only at the professional level but even at the
educational level for elementary school kids, for high school kids, for college
kids. Because now we really do have essentially free and ubiquitous data.
So the complimentary scarce factor is the ability to understand that data and
extract value from it.
Source: FlowingData.com
The big data revolution’s “lovely” and “lousy” jobs:
The lovely jobs are why we should all enroll our children immediately in statis-
tics courses. Big data can only be unlocked by shamans with tremendous
mathematical aptitude and training. McKinsey estimates that by 2018 in the
United States alone, there will be a shortfall of between 140,000 and 190,000
graduates with “deep analytical talent”. If you are one of them, you will surely
have a “lovely” well-paying job.
1
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 2
The Search For Analysts To Make Sense Of ’Big Data’ (an article on an NPR programme)
begins:
Businesses keep vast troves of data about things like online shopping behav-
ior, or millions of changes in weather patterns, or trillions of financial transac-
tions — information that goes by the generic name of big data.
Now, more companies are trying to make sense of what the data can tell them
about how to do business better. That, in turn, is fueling demand for people
who can make sense of the information — mathematicians — and creating
something of a recruiting war.
Source: NPR.org
Also see this article in the NYT: For Today’s Graduate, Just One Word: Statistics
0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 19 20 21 22 23 24 26
2 6 4 8 4 7 3 3 1 3 3 2 4 4 2 2 4 1 2 2 1 1 1 2
> str(InsectSprays)
’data.frame’: 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1
...
∗
This internal storage is different to the way that relational databases typically store tables — database tables are
typically stored internally as collections of rows. However, since R provides mechanisms for accessing data by rows
as well as columns, this difference will not trouble us.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 5
>
Consequently the data frame has n = 72 and p = 2. Formally, we can regard each row as
being a 2-tuple from Z × {A, B, C, D, E, F }. Note that although we can easily embed Z in
R, there is no natural embedding of the unordered factor {A, B, C, D, E, F } in R, and so it
will probably not be sensible to attempt to regard the rows as vectors in R2 .
are numeric in the next example. We can create a data frame containing a subset of rows
from the data frame by using a vector of row numbers or names. eg.
> InsectSprays[3:5,]
count spray
3 20 A
4 14 A
5 14 A
>
We can also subset the rows using a boolean vector of length n which has TRUE elements
corresponding the the required rows. eg. InsectSprays[InsectSprays$spray=="B",]
will extract the rows where the spray factor is B. If the vector of counts is required, this
can be extracted from the resulting data frame in the usual way, but this could also be
obtained more directly using
> InsectSprays$count[InsectSprays$spray=="B"]
[1] 11 17 21 11 16 14 17 17 19 21 7 13
>
> dim(carsMat)
[1] 32 7
>
−40 0 40 −40 0 40
● ●
● ●
● ● ●●● ● ●●
●● ●
● ● ●
●
● ●●
●● ●
● ● ●
● ● ●●●
●●● ●
●● ●●●
●● ●●● ●● ●●
●
●●●● ●●
●●
●
●
●
● ●
●
●
● ●●●●● ●
●
● ●
●
●
●
●
● ●
● ●
●
● ●●
●
●●●
●●●●
●●●
●
●●
● ●
●●●
●● ●
●
●●
● ●●
● ●●
●
10
●● ●
●●
●●● ●
●● ●
●
● ●
●
● ●●●
●● ●
●
● ●
● ●
● ● ●
● ● ●●
●
●●●
●●●
●
●
●●
●●
●● ●● ●●
● ●● ●●
●
● ●●● ●
●● ●
●●
●●
●●●
●●
●●
●●
●
●●●
● ●
●●●●●
●●●
●● ●
● ● ●
● ●
●
● ●
● ●
● ●●
●
●●
●
●●●
●●
●●●
●●●
●● ●●●
● ●
●
● ●●●
●
●●
●
●●
●●
●
east.west ●●●
●●●
●●●
●●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●●
●●●●
●●● ●
●
● ●
●
●
●
●
● ●●
● ●
●
●
●
●
● ●
● ●●
●●
●●●
●●
●●●
●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●●●
●●●● ●
●
●●●
●●
●●
●●
●●
●●
●
●●●●●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●●●
●
●●
●●
●
● ●
●●
●●
●●
●
●●●
●●●●●
●● ●●
●●
●●
●●● ● ● ● ● ● ● ●●●
●●●
●●
●● ●●
●●●
● ● ● ●
●● ●
●●
●●
●●
●●●●
●
●
●●
●●●
●
●
●●
●●
●
●
●●●
●●
●●
●
●●
●
●
●●
● ●
●●● ●
●
●
●
●
●
●
●
●
●
●
●●
● ●
● ●
●
● ●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●
●
●●
●
●●
●●
●● ●
●●●●●●
● ●●●●
●
●●● ●●
●●
●●
●●
●●
● ●● ● ●
●● ●●●
●●
● ●
● ● ●
●
● ●
● ● ●
●●
●●
●
●●●
●●●●
●●● ●
●● ●●●
●● ●
●●
● ●
●●●●●
●● ● ● ● ● ● ●● ● ● ● ●
●
−30
●
● ● ●
● ●● ●
●● ●
● ● ● ● ●● ●
●● ●
● ●●
●●●● ●
● ●●
● ●
● ●
● ●
●● ●● ●●● ● ●
●
●●
● ●
●● ● ●
●●
● ●
● ●●
● ● ●
●
●
●●● ●
●●
● ●●
●●●
●
● ●●●●●
●● ●
● ● ●●
● ●● ●
●● ●
● ● ●
●
●●●●● ● ●
●●●
●● ●
● ●●
● ●
● ● ● ● ●
●●
●●
●
●●
● ●●● ●
●●●
●
●●●●● ●●
●●
●●●●●● ● ● ●
● ● ● ● ● ● ●● ●
● ●● ● ●
●●●
−40 20
● ● ● ● ●
●●●● ● ●
● ● ● ●● ● ●
●
●
●●●●●●
●
●
●●
●
●
●
●
● ●
●●●
●●
●●●●●●●●● ●
●
●
●
●
●
● ● ● ●
● ●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●● ●●●
●
●
●●
●●
●
●●●
● ●●
●●
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●●●
●●
●●●
●
●●
●
●
●
●●●
●●
●
●
●● ●● ●●
●● ●●
●●●●
●● ● ● ●
● ●●
● ●●● ●
● ●
●●
●
●● ●●
●● ●
●●●
●● ●●
●●●
●
●●
●
●
●●
●●
●●●●●●
● ●●
●●●
●●●●
●
● ●●●●●●
●● ●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●●
●●●●● north.south ●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ● ●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●●
●●
●
●●
●●
●●
●●●
●● ● ●
●●●
●●
● ●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●
●● ● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
● ●●
●
●●
●
●
●●
● ●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●● ●●● ● ●
●●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●● ●
● ●● ●●
●● ●● ●
● ● ●
●●● ●●●
● ●
●●
●●● ● ●●● ●● ●●●
●●
●
●
●
●●
●●
●●●
● ●● ● ●● ●● ●
● ●●
● ●●● ●●
●● ●●
●●
●
●●
●
● ●● ●
●
●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
● ●
●●
●●
●●●●● ●●●●●●● ●
100
●
●●
●●●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●●●
●●● ●
●●
●●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●
●●● ●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●●
●● ●●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●● ●
●●
●● ●●●
●● ●●
●●●●●
●
●●
●
●●●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●●
●
●●
●
●
●●
●●
● ●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●● ●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●● ●●
●●
●●● ●
●● ●●● ● ●●●●●●
●●
●●
●●●
●
●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●●
●●●
●●●●●
●●
● ●
●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●
●●●
●●
●●●●
●●●
●● angle ●
●●
●●
●●
●●●
●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●●
●●
●●
●● ●
●●
●●
●●●
●●
●●● ●
●●●
●●●
●●●
●
●●
●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●● ●
●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●
●
●●
● ●
●● ●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
● ●●
●
●●
●
●●
●
●●
●
●●
●●●
20
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●●
●● ●
●
●
●●
●
●●
●
●
●●
●
● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ●
●●
●
●●
●●
●
●●
●●
●●●
●● ●● ●
●●
● ●
●
● ●●
●● ●
● ● ● ●●
●● ●
●
●
40
●● ●●● ●
●
● ●
● ● ● ● ● ●● ● ●
●
●
● ●●●
●●
●
●
●
●
●●●●●●●●● ●●
●
●
●●●●●
● ●●
●●●
●●● ● ●
●
●
●
● ●
●
● ●●
● ●
● ●
●
●
●●
●
●●●●
● ●●
●●
●
●
●●●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●●
●
●●
●●●
●
●
●●
●
●●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●●
● ●
●
●
●
● ●
●
●●
●
● ●
●●
●
●
●
●●●● ●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●●
● ●●
● ● ● ● ● ● ● ● ● ● ●●● ●
● ●●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
radial.position ●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●●●●
●●●
●
●●
●
●●●
●●
●
●●
●
● ● ●
●●
●●
● ● ●●
●
●
●●
●
●●
●
● ●
● ● ●
● ●●
● ● ●
● ● ●●
● ●
●
●
●● ●
●● ●
● ●●●●●
−40
●● ● ●●
● ● ● ● ●●● ● ●● ● ●●
● ● ● ●
●●● ●●●●●●
●
●
●
●
●●●● ● ●
●●●
● ●
● ●●
●●
●●
●
●
●●
●● ● ● ●
● ●
●●● ● ●
● ●
●●●
●
● ●
●●● ● ●
● ●●● ●●●
●●
●
●●
●●●●●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●
●●
●●●
● ●●
●●●
●
●●
●●
●
●●●
●
●
●
●
● ●●
● ●
●
●
● ●
●●
●
●
●
●●
●●
●
●●●
● ●●
● ● ●
1400 1700
●●●
●
●●●
●
●●●
●●●●
●● ●●
●●
●
●
●●
●●
●
●●
●●● ● ●
●
● ●
● ●
●●●
●
●●
●●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●●●●●●
●●●●●
●
●
●
●●
●●
●●●●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●●
●● ● ●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●●●
●
●●
●
●
●●●●●
●●●
●●●● ●
●
●
●
●
●
●
●●
●●●●
●
●●●●●●
●●
●●
●
●
●
●
●●
●
●●
●
● ●● ●
●
●
●
●
●
●●● ●
● ●
●
● ●●●
●
●●
●●●
●●
●● ●
●
●●
●
●
●
●
●●
●
●● ●●
●
● ●● ●●● ●●●
●●
●● ●●● ●●
●●●● ●
● ● ●● ●
●
●●●
●
●●
●●● ● ● ●
● ● ● ●
● ●● ●● ●● ●●
●●●
●●●
●●●● ● ●
●
● ●●
●●●●
●●●
●●● ●●
●● ●●
●●●●●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●●●
●●
●●
●
●
●●
●
●
●
●●
●●●
●
●●● ● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
● ●●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●●
velocity
●●●●●●●●●● ●●●●
●
●●
●● ● ●
●●●●●●●
● ●
●●
● ●●
●
●
●●●
●
●●
●●
●
●
●
●
●
●●
●●●
● ●●
●● ●
● ●●
● ● ●
●
●
●
● ●●
●
●●
● ●●
●
●
●
●
●
●
●●●
●●
●●●●● ●
●●
●
●●●●●
●●
●●
●
●
●
●
●●
●● ●
●●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●●
●
●●● ●●●●●
●
● ●
●
●
●
●●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●● ●
●
●●
●●●
● ●●
●
●●
●
●●
●●●
● ●
●● ● ●●
●●
●●
●●●
●
●●
● ●●
● ● ●●
●● ●●
● ●
●
●●● ●
>
which shows that angle does indeed take on just 7 distinct values, and hence perhaps
has more in common with an ordered categorical variable than a real-valued variable. We
can use angle as a factor to colour the scatterplot as follows
pairs(galaxy,col=galaxy$angle)
to obtain the plot shown in Figure 1.2.
This new plot indicates that there appear to be rather simple relationships between the
variables, but that those relationships change according to the level of the angle “factor”.
We can focus in on one particular value of angle in the following way
pairs(galaxy[galaxy$angle==102.5,-3])
Note the use of -3 to drop the third (angle) column from the data frame. The new plot is
shown in Figure 1.3.
This plot reveals that (for a given fixed value of angle) there is a simple deterministic
linear relationship between the first three variables (and so these could be reduced to
just one variable), and that there is a largely monotonic “S”-shaped relationship between
velocity and (say) radial.position. So with just a few simple scatterplot matrices and
some simple R commands we have reduced this multivariate data frame with apparently
5 quantitative variables to just 2 quantitative variables plus an ordered categorical factor.
It is not quite as simple as this, as the relationships between the three positional variables
varies with the level of angle, but we have nevertheless found a lot of simple structure
within the data set with some very basic investigation.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 10
−40 0 40 −40 0 40
●● ● ●
● ● ●●● ● ●●
●● ●
● ●● ●
● ●
●●
● ●
● ● ●
● ● ●●●
●●● ●
●● ●●●
●● ●●● ●● ●
●
●
●●●● ●●
●●
●
●
●
● ●●
●● ●●●●
●
●
●
● ●
●
●
●
●
● ●
● ●
●
● ●●
●
●
●●●
●●●●
●●●
●
●●
● ●●
●
●●● ●
●
●●
● ●●
● ●●●
10
●● ●
●●
●●● ●
●● ●
●●●●
● ●●
●●
● ●
●
● ●
● ●
● ● ●
● ● ●●
●
●●●●●
●
●
●●
●●
●● ●● ●●● ●● ●●
●
● ●●● ●●●
● ●●●
●●
●●●
●●
●●
●●
●
●●●
●●●
● ●
●●●●
●●● ●
● ● ●
● ●
●
● ●
● ●
● ●●
●
●●
●
●●●
●●
●●●
●●●
●● ●●●
● ●
●
● ●●●
●
●●
●
●
●●
●●
east.west ●●●
●●●
●●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●●
●●●
● ●
● ●
●
●
●
● ●●
● ●
● ●
● ●
● ●●
●●
●●●
●●
●●●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●●●
●●●● ●
● ●●
●
●●
●●
● ●
●
●●
●
●●
●●
●●●
●●
●
●●
●
●●
●●●
●
●●
●●
●●●
● ●
●●
●
●
●●
●●
●
●●
●●
●
●●
●●
●●
●● ●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●●●
●●●
● ●
●
● ● ●
●
● ●●
●
● ● ● ●●
●●
●
●
●●
●●●
●●
●
●●
●
●●
●
●
●●
●
● ●
●●
●●●
●
●●●
●●●●
●
●●●● ●
●●
●●
●●●
●●● ●
●
●●●
●●
●●●
●
●
●●●
●
●●
●●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
● ●●
● ●
● ●
●
● ●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●● ●
●●
●
●
●
●
●●
●●
●
●● ●
●●●
● ●●●●
●
●●● ●
●●●
●
●●
●●
●●
● ● ●
●● ●
●● ●●●
● ●
●
● ●
●
● ●
● ●
● ●
●●
●
●●
●●●●●●●
●● ●
●● ●●●
●●
● ●●
●
●
● ●●●●
−30
●
● ● ●
● ●
● ●●
● ●
● ● ● ● ●● ●
●● ●
● ●●
●●●● ●
● ●
●● ●
● ●
● ●
●● ●● ●●● ● ●
●
●●
● ●
●● ● ●
●●
● ●
● ● ●●
● ● ●
●
●
●●● ●
●●
● ●●
●●●
●
● ●●●●●●
●● ● ●
● ● ●●●● ●
●● ●
● ●
● ●
●
●●●●● ● ●
●●●● ●
● ●●
● ●
● ● ● ● ●
●
●
●●●
●
●●
● ●●● ●
●●●
●●●●● ●
●●
●●●●●● ● ● ●● ● ●● ● ● ●● ●
● ●● ● ●
●●●
−40 20
● ● ● ● ● ●● ● ●
● ● ● ●● ● ●
●
●
●●●●●●
●
●
●●
●
●
●
●
● ●
●●●
●●
●●● ●●●●●● ●
●
●
●
●
●
● ●
● ● ●
● ●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●● ●●●
●
●
●●
●●
●
●●●
● ●●
●●
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●●●
●●
●●●
●
●●
●
●
●
●●●
●●
●
●
●● ●● ●●
●● ●●
●●●●
●● ● ● ●
● ●●●●● ● ●
●
●●
●
●● ●●
●● ●
●●●
●● ●●
●●●
●●
●
●
●●
●●
●●●●●●
● ●●
●● ●●
●●●●
●●●
●● ●●●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●●●●●
●
●
●
●●●●●
north.south ●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●●
●●
●
●●
●
●●
●●
● ●
●● ● ●●
●●
●
●● ●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●● ● ●
●
●
●
●
●●
●
●
●●●
●
●
●
● ●
●
●
●
●●
●
●
●●
● ●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●● ●●● ● ●
●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●●
●
●
● ● ● ● ● ● ● ●●● ●
●●
●
●●● ●
●● ●
●●● ●● ●
● ●
●●●●
● ● ●●
●● ●
●●
●●
● ●●●●●●
●●
●●
●
● ●● ●● ● ● ●●● ●● ●●
●
●●
●
● ●● ●
●
● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
● ●
●●
●●
●●●●● ●●●●●●● ●
100
●
●●
●●●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●●●
●●● ●
●●
●●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●
●●● ●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●●
●● ●●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●● ●
●●
●● ●●●
●● ●●
●●●●●
●
●●
●
●●●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●●
●
●●
●
●
●●
●●
● ●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●● ●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●● ●●
●●
●●● ●
●● ●●● ● ●●●●●●
●●
●●
●●●
●
●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
●●●
●●●
●●●
●●●●●
●●
● ●
●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●
●●●
●●
●●●●
●●●
●● angle ●
●●
●●
●●
●●●
●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●●
●●
●●
●● ●
●●
●●
●●●
●●
●●● ●
●●●
●●●
●●●
●
●●
●
●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●
●● ●
●
●●
●●
●●
●●
●●
●
●●
●●
●●
●●
●
●●
● ●
●● ●●
●●
●●●
●●
●●●
●●
●●●
●●
●●●
●●
● ●●
●
●●
●
●●
●
●●
●
●●
●●●
20
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●●
●● ●
●
●
●●
●
●●
●
●
●●
●
● ●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ●
●●
●
●●
●●
●
●●
●●
●●●
● ●●
● ● ●
●● ●
● ● ●
●● ●
●●● ● ● ●
● ● ● ● ●
●
40
●● ●
●●● ●
●
● ●
● ● ● ● ● ●● ● ●
●
● ●●●●
●
●
●
●
●●●●●●●●● ●●
●
●
●●
●●●●
● ●●
●●●
●●● ● ●
●
●
●
● ●
●
● ●
●●
● ●
● ●
●
●
●●
●
●●●●
● ●●
●●
●
●
●●●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●
●
●●
●●●
●
●
●●
●
●●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
● ●
●
●
●
● ●
●
●●
●
● ●
●●
●
●
●
●●●● ●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●●● ●●
● ● ● ● ● ● ● ● ● ● ●●● ●
● ●●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
radial.position ●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●●●●
●●●
●
●●
●
●●●
●●
●
●●
●
● ●
● ●
●●
●●
● ● ●●
●
●
●●
●
●●
●
● ●
● ● ●
● ●
● ●
● ●
● ● ●●
● ●
●
●
●● ●
●● ●
● ●●●●●
−40
●● ● ●●
● ● ● ●●● ● ● ●● ● ●●
● ● ● ●
●●● ●●●●●●
●
●
●
●
●●●● ●
● ●
●●●
● ●
● ●●
●●
●●
●
●
●●
●● ● ● ●
● ●
●●● ● ●
● ●
●●
●●
● ●
●●● ● ●
● ●●● ●●●
●●
●
●●
●●●●●●●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
● ●
●●
●●●
● ●●
●●●
●
●●
●●
●
●●●
●
●
●
●
● ●●
● ●
●
●
● ●
●●
●
●
●
●
●●
●
●●
●●
● ●●
● ● ●
1400 1700
●●●
●
●●●
●
●●●
●●●●
●● ●●
●●
●
●
●●
●●
●
●●
●●● ● ●
●
● ●
● ●
●●●
●
●●
●
●●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●●●●●●
●●●●●
●
●
●
●●
●●
●●●●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●●
●● ● ●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
● ●
●
●●
●
●
●●●●●
●●●
●●●● ●
●
●
●
●
●
●
●●
●●●●
●
●●●●●●
●●
●●
●
●
●
●
●●
●
●●
●
● ●● ●
●
●
●
●
●●● ●
● ●
●
● ●●●
●
●●
●●●
●●
●● ●
●
●●
●
●
●
●
●●
●
●● ●●
●
● ●● ●●● ●●●
●●
●● ●●● ●●
●●●● ●
● ● ●● ●
●
●●●
●
●●
●●● ● ● ● ● ● ●
● ●● ●● ●● ●●
●●●
●●●
●●●● ● ●
●
● ●●
●●●●
●●●
●●● ●●
●● ●●
●●●●●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●●●●●●
●●
●●
●
●
●●
●
●
●●
●●
●●
●
●●●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●● ●●●
velocity
●●●●●●●●●●●●●●●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●
● ●●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●● ●
● ●●
●
●●
●●●
●
●
●
●
● ●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●●
●●●●● ●●
●
● ●
●
●●
●
●●
●
●●
● ●
●●
●●
●●
●●
●●
●●
●
●● ●
● ● ●
●
●●
●
●●
●●
●●
●●
●●●
●
● ●
●●● ●
●
● ●●●● ●
●● ●● ●●
●●● ●
● ●●
● ● ●●
●● ● ● ●
●
●●● ●
Figure 1.2: Scatterplot matrix for the Galaxy data, coloured according to the levels of the angle
variable
●●● ●
●
●●● ●●● ●●●
●●●
● ●
●●● ●●● ●●●●
●●● ●●● ●●
●●● ●●● ●
●
●
●
●●● ●
−10
●●●●● ●●● ●
●
●●●
●●
● ●●● ●
●●●
●●● ●●● ●●
● ●
40
●●● ●●● ●
● ●●
●●
●●● ●●● ●
●●
●●● ●●● ●
●●
●●● ●●● ●●●●●
●●● ●●● ● ●●●
●●
●●● ●●● ●●
● ●●●●
●●●
north.south ●●● ●
−40 0
● ● ●
40
● ●●● ● ●●●
●●● ●●● ●●●●
●●● ●●● ●
●●
●●● ●●● ●
●
●●
●●● ●●● ●●
●
●●●● ●●● ●●●
●●
●
●●● ●●● radial.position ●●●●●●●
−40 0
●●● ●
●
●●● ●●● ●●●
●●●
● ●
●●● ●●● ●●●●
●●● ●●● ●●
●●● ●●● ●
●
●
●
●●●●
● ●● ●
●
●●●
●●
● ●●● ●
●●● ● ●
●●●
●●●●
● ●●● ● ●●●●
●
1700
●●●●●●●● ●
● ● ●●●●●●●●●● ●●●●●●●● ●
●
●●●
● ● ●
●●●
● ●●●
● ●
●●
●●● ●●
●●
●● ●●
●●●
●●●
●
●
●●
●● ●●●
●
●
●●
●●●
●●●
●●●
●●
●●●
●●●
●●●
●●
velocity
●●● ●●● ●●●●
●● ●●●●● ●●●●
●●
●●●●● ●●●● ●●● ●●●
● ●●●●● ● ●●●●●
1400
●●●●●●● ●
●●●●●● ●● ●●●●●●
−10 0 5 10 −40 0 20
Figure 1.3: Scatterplot matrix for the subset of the Galaxy data corresponding to an angle vari-
able of 102.5
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 11
Figure 1.4: Image plot of the nci cancer tumor microarray data, using the default colour scheme
(or rather, rotated anti-clockwise). Using the default “heat” colour scheme, low values
are represented as “cold” red colours, and “hot” bright yellow/white colours represent the
high values. Although many people find this colour scheme intuitive, it is not without
problems, and in particular, can be especially problematic for colour-blind individuals. We
can produce a simple greyscale image using
image(nci,axes=FALSE,xlab="genes",ylab="samples",col=grey((0:31)/31))
which has the lightest colours for the highest values, or using
image(nci,axes=FALSE,xlab="genes",ylab="samples",col=grey((31:0)/31))
to make the highest values coloured dark. Another popular colouring scheme is provided
by cm.colors(), which uses cyan for low values, magenta for high values, and middle
values white. It is good for picking out extreme values, and can be used as
image(nci,axes=FALSE,xlab="genes",ylab="samples",col=cm.colors(32))
The image() function is a fairly low-level function for creating images, which makes it
very flexible, but relatively difficult to produce attractive plots. For imaging multivariate
data there is a higher-level function called heatmap() which produces attractive plots
very simply. A basic plot can be obtained with
heatmap(nci,Rowv=NA,Colv=NA,labRow=NA,col=grey((31:0)/31))
leading to the plot shown in Figure 1.5.
Note that this function keeps the natural orientation of the supplied matrix, and by
default will label rows and columns with the row and column names associated with the
matrix. Note that the options Rowv=NA,Colv=NA are critical, as they disable the default
behaviour of the function to use clustering to reorder the rows and columns to reveal
interesting structure present in the data. We will look again at this behaviour when we
study clustering later in the course. This is a computationally intensive operation, so we
don’t want to attempt this on a full microarray matrix at this stage.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 13
Figure 1.5: Heatmap of the nci cancer tumor microarray data, using a greyscale colour scheme
with darker shades for higher values
1.0
0.8
0.6
0.4
0.2
0.0
Figure 1.6: A sample of images from the zip.train dataset, generated with the command
example(zip.train)
1.0
0.8
0.6
0.4
0.2
0.0
Figure 1.7: Image corresponding to the fifth row of the zip.train dataset — the digit shown is
a “3”
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 15
to give a multivariate dataset with n = 7, 291 and p = 256, where each row represents
a vector in R256 . It is perhaps concerning that representing the images as a vector in
R256 loses some important information about the structure of the data — namely that it is
actually a 16 × 16 matrix. So, for example, we know that the 2nd and 18th elements of
the 256-dimensional vector actually correspond to adjacent pixels, and hence are likely to
be highly correlated. This is clearly valuable information that we aren’t explicitly including
into our framework. The idea behind “data mining” is that given a sufficiently large number
of example images, it should be possible to “learn” that the 2nd and 18th elements are
highly correlated without explicitly building this in to our modelling framework in the first
place. We will look at how we can use data on observations to learn about the relationship
between variables once we have the appropriate mathematical framework set up.
Definition 1 The measurement of the jth variable on the ith observation is denoted xij ,
and is stored in the ith row and jth column of an n × p matrix denoted by X, known as the
data matrix.
We denote the column vector representing the jth variable by x(j) , which we can obviously
define directly as
x1j
x2j
x(j) = .. .
.
xnj
We can begin to think about summarising multivariate data by applying univariate sum-
maries that we already know about to the individual variables.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 16
We can compute x̄j for j = 1, 2, . . . , p, and then collect these sample means together into
a p-vector that we denote x̄, given by
A moments thought reveals that this is equivalent to our previous construction. The vec-
tor version is more useful however, as we can use it to build an even more convenient
expression directly in terms of the data matrix, X. For this we need notation for a vector
of ones. For an n-vector of ones, we use the notation 1 n , so that
1 n ≡ (1, 1, . . . , 1)T .
We will sometimes drop the subscript if the dimension is clear from the context. Note that
1 n T 1 n = n (an inner product), and that 1 n 1 p T is an n × p matrix of ones (an outer product),
sometimes denoted Jn×p . Pre- or post-multiplying a matrix by a row or column vector of
ones has the effect of summing over rows or columns. In particular, we can now write the
sample mean of observation vectors as follows.
Proposition 1 The sample mean of a data matrix can be computed as
1 T
x̄ = X 1 n.
n
Min . :1409
1 s t Qu. : 1 5 2 3
Median :1586
Mean :1594
3 r d Qu. : 1 6 6 9
Max . :1775
>
The apply() command can also be used to apply arbitrary functions to the rows or
columns of a matrix. Here we can obtain the mean vector using
> apply ( galaxy , 2 ,mean)
e a s t . west n o r t h . south angle r a d i a l . p o s i t i o n velocity
−0.3323685 1.5210889 80.8900929 −0.8427245 1593.6253870
>
The 2 is used to indicate that we wish to apply the function to each column in turn. If
we had instead used 1, the mean of each row of the matrix would have been computed
(which would have been much less interesting). We can also use our matrix expression
to directly compute the mean from the data matrix
> as . vector ( t ( g a l a x y ) %∗% rep ( 1 , 3 2 3 ) / 3 2 3 )
[1] −0.3323685 1.5210889 80.8900929 −0.8427245 1593.6253870
>
where %*% is the matrix multiplication operator in R. Since R helpfully transposes vectors
as required according to the context, we can actually compute this more neatly using
> rep(1,nrow(galaxy))%*%as.matrix(galaxy)/nrow(galaxy)
east.west north.south angle radial.position velocity
[1,] -0.3323685 1.521089 80.89009 -0.8427245 1593.625
>
It is typically much faster to use matrix operations than the apply() command. Note
however, that R includes a convenience function, colMeans() for computing the sample
mean of a data frame:
> colMeans(galaxy)
east.west north.south angle radial.position
-0.3323685 1.5210889 80.8900929 -0.8427245
velocity
1593.6253870
and so it will usually be most convenient to use that.
This example hopefully makes clear that our three different ways of thinking about com-
puting the sample mean are all equivalent. However, the final method based on a matrix
multiplication operation is the neatest both mathematically and computationally, and so
we will make use of this expression, as well as other similar expressions, throughout the
course.
Definition 4
1
Hn ≡ In×n − 1 n 1 n T
n
is known as the centering matrix.
So we can subtract the mean from a data matrix by pre-multiplying by the centering matrix.
This isn’t a numerically efficient way to strip out the mean, but is mathematically elegant
and convenient. The centering matrix has several useful properties that we can exploit.
Proposition 2 The centering matrix Hn has the following properties:
1. Hn is symmetric, Hn T = Hn ,
2. Hn is idempotent, H2n = Hn ,
3. If X is an n × p data matrix, then the n × p matrix W = Hn X has sample mean equal
to the zero p-vector.
Proof
These are trivial exercises in matrix algebra. 1. and 2. are left as exercises. We will
use symmetry to show 3.
1 T
w̄ = W 1n
n
1
= (HnX)T1 n
n
1
= XTHn1 n
n
1 1
= XT(In×n − 1 n1 nT)11n
n n
1 1
= XT(11n − 1 n1 nT1 n)
n n
1 T
= X (11n − 1 n)
n
= 0.
Re-writing the sample variance matrix, S, in terms of the centered data matrix W, is useful,
since we can now simplify the expression further. We first write
w1 T
X n w2 T
wi wi = (w1 , w2 , . . . , wn ) .. = WT W,
T
i=1
.
wn T
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 21
We can then substitute back in our definition of W using the centering matrix to get
1 1 1 1
S= WT W = (Hn X)T Hn X = XT Hn T Hn X = XT Hn X,
n−1 n−1 n−1 n−1
using symmetry and idempotency of Hn . This gives us the rather elegant result:
Proposition 3 The sample variance matrix can be written
1
S= XT Hn X.
n−1
We shall make considerable use of this result.
> H=diag(323)-matrix(1/323,ncol=323,nrow=323)
> t(galaxy)%*%H%*%as.matrix(galaxy)/(323-1)
east.west north.south angle radial.position
east.west 144.66088 -32.67993 -21.50402 263.93661
north.south -32.67993 523.84971 26.35728 -261.78938
angle -21.50402 26.35728 1462.62686 -49.01139
radial.position 263.93661 -261.78938 -49.01139 670.22991
velocity 451.46551 -1929.95131 37.21646 1637.78301
velocity
east.west 451.46551
north.south -1929.95131
angle 37.21646
radial.position 1637.78301
velocity 8886.47724
This method is clearly mathematically correct. However, the problem is that building the
n × n centering matrix and carrying out a matrix multiplication using it is a very compu-
tationally inefficient way to simply strip the mean out of a data matrix. If we wanted to
implement our own method more efficiently, we could do it along the following lines
> Wt=t(galaxy)-colMeans(galaxy)
> Wt %*% t(Wt)/(323-1)
east.west north.south angle radial.position
east.west 144.66088 -32.67993 -21.50402 263.93661
north.south -32.67993 523.84971 26.35728 -261.78938
angle -21.50402 26.35728 1462.62686 -49.01139
radial.position 263.93661 -261.78938 -49.01139 670.22991
velocity 451.46551 -1929.95131 37.21646 1637.78301
velocity
east.west 451.46551
north.south -1929.95131
angle 37.21646
radial.position 1637.78301
velocity 8886.47724
This uses a couple of R tricks, relying on the fact that R stores matrices in column-major
order and “recycles” short vectors. We can improve on this slightly by directly constructing
the outer product 1 n x̄T .
> W=galaxy-outer(rep(1,323),colMeans(galaxy))
> W=as.matrix(W)
> t(W)%*%W/(323-1)
east.west north.south angle radial.position
east.west 144.66088 -32.67993 -21.50402 263.93661
north.south -32.67993 523.84971 26.35728 -261.78938
angle -21.50402 26.35728 1462.62686 -49.01139
radial.position 263.93661 -261.78938 -49.01139 670.22991
velocity 451.46551 -1929.95131 37.21646 1637.78301
velocity
east.west 451.46551
north.south -1929.95131
angle 37.21646
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 23
radial.position 1637.78301
velocity 8886.47724
In fact, we can do even better than this by exploiting the R function sweep(), which is
intended to be used for exactly this sort of centering procedure, where some statistics are
to be “swept” out of a data frame.
> W=as.matrix(sweep(galaxy,2,colMeans(galaxy)))
> t(W)%*%W/(323-1)
east.west north.south angle radial.position
east.west 144.66088 -32.67993 -21.50402 263.93661
north.south -32.67993 523.84971 26.35728 -261.78938
angle -21.50402 26.35728 1462.62686 -49.01139
radial.position 263.93661 -261.78938 -49.01139 670.22991
velocity 451.46551 -1929.95131 37.21646 1637.78301
velocity
east.west 451.46551
north.south -1929.95131
angle 37.21646
radial.position 1637.78301
velocity 8886.47724
Figure 1.8: An image of the variance matrix for the zip.train dataset
Then we have
n
1 X
S= (xi − x̄)(xi − x̄)T
n − 1 i=1
1 2 4 2 4 T
= − −
2 3 2 3 2
4 4 4 4 T
+ − −
1 2 1 2
6 4 6 4 T
+ − −
2 2 2 2
1 −2 0 2
= (−2, 1) + (0, −1) + (2, 0)
2 1 −1 0
4 −2 0 0 4 0
= + +
−2 1 0 1 0 0
1 8 −2 4 −1
= =
2 −2 2 −1 1
Next, let’s calculate using our formula based around the centering matrix
First calculate
2 −1 −1
1 3 3 3
−1 −1
H3 = I3 − 1 31 3T =
3
2
3 3
,
3 −1 −1 2
3 3 3
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 26
So, the variances of the variables are 4 and 1, and the covariance between them is -1,
which indicates that the variables are negatively correlated.
Before moving on, it is worth dwelling a little more on the intermediate result
1
S= WT W,
n−1
which shows us that the matrix S factorises in a very nice way. We will return to matrix
factorisation later, but for now it is worth noting that we can re-write the above as
S = AT A
where
1
A= √ W.
n−1
Matrices which factorise in this way have two very important properties
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 27
S = AT A,
Proposition 5 The sample correlation matrix may be computed from the sample variance
matrix as
R = D−1 SD−1 ,
and conversely, the sample covariance matrix may be computed from the sample corre-
lation matrix as
S = DRD.
Proof
This follows since pre-multiplying by a diagonal matrix re-scales the rows and post-
multiplying by a diagonal matrix re-scales the columns.
Proof
We have already shown that the sample covariance matrix can be written as S = AT A.
Consequently, R can be written as
R = BT B,
where B = AD−1 . We showed earlier that matrices that can be written in this form must
be symmetric and positive semi-definite.
1/2 0 4 −1 1/2 0
R=
0 1 −1 1 0 1
1/2 0 2 −1
=
0 1 −1/2 1
1 −1/2
= ,
−1/2 1
giving a sample correlation coefficient between the two variables of -0.5.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 30
Figure 1.9: An image of the sample correlation matrix for the zip.train dataset
So, the expectation of a random vector is just defined to be the vector of expected values.
Similarly, the expected value of a random matrix is the matrix of expected values. Also, if
we recall that the covariance between two scalar random variables X and Y is defined by
(so for independent X, Y we will have Cov(X, Y ) = 0), we see that the variance matrix for
a random matrix is defined to be the matrix of covariances between pairs of elements:
Definition 6 The variance matrix of a random p-vector X is defined by
a(j) = f (ej ) − b.
y = a1 x 1 + a2 x 2 + · · · + ap x p + b
y = aT x + b.
Example: 2d projection
We may view the selection of variables to plot in one window of a scatterplot matrix as a
2d projection of the multivariate data. Selection of components xk and xj corresponds to
the projection T
ek
y= x,
ej T
and hence the linear transformation with A = (ek , ej )T .
More generally, we can project onto the linear subspace spanned by any two orthogo-
nal unit p-vectors, v 1 , v 2 using
A = (v 1 , v 2 )T .
This follows since we can write x as
x = y1 v 1 + y2 v 2 + w,
where w is orthogonal to v1 and v2 . From this it follows that v i · x = yi , and the result
follows. Note that, as above, A has the property AAT = I2 . 2d projections are obviously
very useful for visualisation of high-dimensional data on a computer screen.
Y = AX + b,
where A is a q × p matrix and b is a q-vector. Then the expectation and variance of Y are
given by
E(Y ) = A E(X) + b, Var(Y ) = A Var(X) AT .
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 34
Proof
First consider E(Y ),
E(Y ) = E(AX + b)
X1
a
11 12 a · · · a 1p b1
X
= E ... ... . . . ... .. + ...
2
.
aq1 aq2 · · · aqp bq
Xp
a11X1 + a12X2 + · · · + a1pXp + b1
= E ...
aq1X1 + aq2X2 + · · · + aqpXp + bq
E(a11X1 + a12X2 + · · · + a1pXp + b1)
= ...
E(aq1X1 + aq2X2 + · · · + aqpXp + bq )
a11 E(X1) + a12 E(X2) + · · · + a1p E(Xp) + b1
= ...
aq1 E(X1) + aq2 E(X2) + · · · + aqp E(Xp) + bq
= A E(X) + b.
Once we have established linearity of expectation for vector random quantities, the
variance result is relatively straightforward.
= A Var(X) AT
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 35
We are now in a position to establish that the variance matrix of a random vector
shares two important properties with sample covariance matrices.
Proof
T
Var(X) = { E [X − E(X)][X − E(X)] }T T
Y = αT X + b
where α is a given fixed p-vector and b is a fixed scalar. From our above results it is clear
that
E(Y ) = αT E(X) + b
and
Var(Y ) = αT Var(X) α.
Importantly, we note that the positive semi-definiteness of Var(X) corresponds to the fact
that there are no linear combinations of X with negative variance.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 36
Y = AZ + µ,
E(Y ) = A E(Z) + µ = µ
and
Var(Y ) = A Var(Z) AT = A Ip AT = AAT .
Notes:
1. Suppose we would like to be able to simulate random quantities with a given ex-
pectation µ and variance Σ. The above transformation provides a recipe to do so,
provided that we can simulate iid random quantities with mean zero and variance
one, and we have a method to find a matrix A such that AAT = Σ. We will examine
techniques for factorising Σ in the next chapter.
2. The special case where the elements of Z are iid N (0, 1) is of particular importance,
as then Y is said to have a multivariate normal distribution. We will investigate this
case in more detail later.
y = Ax + b
Y = XAT + 1 n bT .
Again, we would like to know how the sample mean, ȳ and variance SY of the new data
matrix Y relate to the sample mean x̄ and variance matrix SX of X.
Proposition 10 The sample mean and variance of Y are related to the sample mean and
variance of X by
ȳ = Ax̄ + b, SY = ASX AT .
Proof
Let’s start with the sample mean,
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 38
1 T
ȳ = Y 1n
n
1
= (XAT + 1 nbT)T1 n
n
1
= (AXT1 n + b11nT1 n)
n
1
= AXT1 + b
n
= Ax̄ + b.
The sample variance matrix result is left as an exercise.
where θ = π/18. We can then apply this transformation to the data matrix and re-plot with
A=matrix(c(cos(pi/18),-sin(pi/18),sin(pi/18),cos(pi/18)),ncol=2,byrow=
TRUE)
subRotated=as.matrix(subCentered) %*% t(A)
plot(subRotated,col=galaxy$angle,pch=19)
giving the plot shown in Figure 1.11.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 39
●● ●
●●●●
●
●●● ● ● ●
●●●●● ●
● ●●● ●
●●●● ●
●
●
● ●
● ● ●●●
● ●●● ●
100 ●●●●●●● ● ● ● ●●● ●
●
● ● ● ●● ● ●
● ● ●
●●●
●●
●
● ●● ●● ● ● ●● ● ●
● ● ●● ●
●●● ●● ●
●●
velocity
● ● ● ● ●● ● ●●
● ● ● ● ●
●● ●● ● ●
● ●
● ● ●● ● ●●● ●●● ●
●●●●●● ● ●●
●● ●
0
● ●
● ● ● ●
● ● ● ● ●
●
● ●
● ●● ● ● ● ●● ●
●●●●
●● ● ● ●● ● ● ● ● ● ● ●
●●● ●●●● ●
● ●●●●
● ● ●●
● ● ● ●●● ● ● ●●
●●●●●
●● ● ●●●● ● ● ● ●
●● ● ● ● ●●
−100
● ●●● ● ● ● ●● ●
●●●●
● ●
●●●
● ●
● ●●●●● ●
●●●● ●
●● ● ● ●●
● ●● ● ● ● ●●●● ●
● ● ●
●● ● ●● ●● ● ● ●● ●
●● ● ●
●● ●●●●
●●
−40 −20 0 20 40 60
radial.position
Figure 1.10: A plot of velocity against radial position for the centered galaxy data
200
●
●●●● ●● ●
● ●●
●●●● ●●
●●●●
●●● ●●●
● ●●●
●●●●●●●
● ●
● ●● ●● ●●●
●●●● ● ●●
100
●● ●
● ●●● ● ●
subRotated[,2]
● ●● ●●●●●●●● ●●
● ● ● ●●● ● ●●
●●
● ●●● ●
●● ● ● ● ● ● ●● ●●
● ● ●● ●
● ● ● ● ●● ● ● ●● ●●●●●●● ● ●●
● ●●●● ●
● ●● ● ● ● ● ● ●
●●
0
●●●●● ●●●●●●● ●● ●
● ● ● ● ● ● ●●● ●●
●●
●●● ● ●● ● ● ●● ● ●
●●●●●● ●●●●
● ●●
● ●●● ● ● ●●
● ● ● ●● ●
●● ●●●
● ●● ●● ●● ● ● ● ● ●●
● ●
●●●●●●
−200 −100
● ●●● ●●
● ●●● ●●
●
●
●
●●●
●
●●●
● ● ● ●●
● ● ●
●● ●● ● ● ●
● ● ● ● ● ●● ● ● ●
●● ● ●●● ● ●● ●●●●
● ●●
●●●●
●
● ●●
● ●
−40 −20 0 20 40 60
subRotated[,1]
Histogram of y
800
600
Frequency
400
200
0
Figure 1.12: Histogram of the mean pixel intensity for each image in the zip.train data
We could apply this transformation to our data in a variety of ways. First, directly from the
transformation as
y=zip.train[,-1] %*% rep(1/256,256)
We can histogram the resulting vector of mean image intensities with
hist(y,30,col=2)
giving the image shown in Figure 1.12.
Alternatively, we could have used the apply() function to create the vector using y=
apply(zip.train[,-1],1,mean). If we are interested in the variance of this vector, we
can obviously compute it using
> var(y)
[1] 0.0345547
On the other hand, if we still have the 256 × 256 variance matrix for the image data in a
variable v from a previous example, we could have computed this variance directly without
ever explicitly constructing the transformation as
> rep(1/256,256) %*% v %*% rep(1/256,256)
[,1]
[1,] 0.0345547
Y = AZ
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 41
●
● ●● ●
4
● ● ●
● ●● ●●● ●
●●● ● ●●●●●●●●●● ● ●●
● ●
● ● ●●● ●● ● ●
● ●● ● ●●
● ● ●●● ● ●●●
●● ● ● ● ● ● ● ● ●
2
●● ●●● ●
●●●●● ● ● ●● ●
●● ●●
● ● ● ● ●● ●
●
●●
●●
●
●
●●● ●
●
●●●
● ●
●
●●
●●● ●● ● ●
●●● ● ● ● ●
● ● ●● ● ● ●● ●●●● ●●●●
●
●● ●●
●
● ●● ●
● ●
●●● ●
● ●
●●●●●
●
●
●
●● ●●
●●
●
●
●
●●●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●●
●●●●●
●●● ●
● ● ●
●● ●●● ●●
● ●●●●●●
●
●
●
●●
●
●
● ●
●●
●●
● ●
●
●
●●
●●
● ●
●●
●●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●●●●●
●
● ●● ● ●●●
Y[,2]
●
● ● ●●●●● ● ●
●●●
●●
● ●
●
●●
●● ●
●
●● ●●●
●
●● ●●
●●●●●●
●●
● ●
●● ●
●●●●
●●
● ● ●●● ● ●
● ● ●●●●
●●
●●
●●
●●●●
●● ●
●●
●
●●●
●●
●●●●
●● ●
●●●●
● ●
● ●●●
●●● ●●
●● ●●●●● ● ● ● ●●●● ●●●●●●● ● ●● ●●
0
● ●● ●●● ●● ● ●●●
●● ● ●● ●
● ●●
●●●
●●
●
● ● ●●●
● ●
●●●
●●●●
●●●
● ●●●●
●
●●●
● ●●●●●●● ●●●● ● ●
●● ●●
●● ●● ●●●●
●●●
●
●
●●
●
●●●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●●●
● ●
●●
●●
●●
●●
●●●●●●● ●
●●
● ●●●
●●
●●●
●
● ● ●
● ●
●●
● ●
●●●
● ●
●●●
●
● ●●● ●
●●
●
●
●
●●
●●●
●●●
●
●●●
●
●
●●●●●●
●●●
●
●● ● ●● ● ● ●●
●●●
●●●●●●
●● ●
●
●●●●●●●
●● ● ●
● ● ●●●●
● ● ●
●● ● ● ●● ●
● ●●
●●
● ●
●●
●● ● ●●
●●●● ●
●● ● ● ●
●●●
● ●●●●● ●● ●
● ●●●●●●●
● ●● ●● ●
●●● ●●●
● ●●●●●●●● ●●● ● ● ●●
−2
● ●● ●●●● ●●
● ●● ●● ●●
●●●●● ● ●●
● ●
● ●● ●●●●●●●● ●●
●
●●●● ●
● ●● ●
● ●●●● ● ●
●● ●● ●●● ●● ●
●
● ● ●
● ●●
● ●● ● ●
● ●
−4
● ● ●
● ●● ● ●
●
−4 −3 −2 −1 0 1 2 3
Y[,1]
√
Figure 1.13: Scatterplot of some simulated correlated samples with correlation 1/ 2
where A is a 2 × 2 matrix will have variance matrix AAT . We can investigate this using R
with the matrix
1 0
A=
1 1
as follows
> Z=matrix(rnorm(2000),ncol=2)
> A=matrix(c(1,0,1,1),ncol=2,byrow=TRUE)
> Y=Z %*% t(A)
> plot(Y,col=2,pch=19)
> A %*% t(A)
[,1] [,2]
[1,] 1 1
[2,] 1 2
> var(Y)
[,1] [,2]
[1,] 1.0194604 0.9926367
[2,] 0.9926367 1.9257760
So we see that the sample variance matrix for 1,000 simulated observations of Y is
very close to the√ true variance matrix. The sample correlation is also close to the true
correlation of 1/ 2. The plot of the samples is shown in Figure 1.13.
CHAPTER 1. INTRODUCTION TO MULTIVARIATE DATA 42
1 3 2
Suppose that we wish to construct a new data matrix Y based on the affine transformation
1 1 0 0
y= x+ .
0 0 1 1
Construct the 4 × 2 data matrix Y.
1 2 3 0 1
1 0
2 3 4 0 1
Y= 1 0 +
3 2 1 0 1
0 1
1 3 2 0 1
3 3 0 1
5 4 0 1
= +
5 1 0 1
4 2 0 1
3 4
5 5
= .
5 2
4 3
We now have all of the basic skills we need to be able to begin to think about more
sophisticated methods of multivariate data analysis. See Chapter 1 of [HTF] and Chapters
1 and 2 of [Everitt] for further background and details relating to the material in this
chapter.
Chapter 2
2.1 Introduction
2.1.1 Factorisation, inversion and linear systems
It should be clear by now that linear algebra and matrix computations are central to mul-
tivariate statistics. For the analysis of large and complex datasets it is essential to have
a basic knowledge of computational linear algebra, and this starts with an understanding
of the efficient solution of linear systems, and the role of various different matrix factori-
sations, often known as matrix decompositions, to transform general linear systems into
very special linear systems that are much easier to solve.
We have already seen a couple of instances of matrix factorisation. First, we showed
that the sample covariance matrix, S, could be decomposed as
S = AT A
for some matrix A. We were able to use this fact to deduce important properties of S.
Similarly, we showed that applying a matrix A to a vector of independent random quantities
with unit variance resulted in a vector with covariance matrix AAT . We noted that if we
were interested in simulating random quantities with a given covariance matrix Σ, we
could do this if we were able to find a matrix A such that
Σ = AAT .
44
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 45
to approach the problem. It is hardly ever necessary to directly compute the inverse of
a matrix. There are usually far more efficient ways to solve a linear system, sometimes
by solving the linear system directly (say, using Gaussian elimination), or more usually by
solving in multiple steps using matrix factorisations. By way of motivation, suppose that
we have an efficient way to factorise A as
A = LU,
where L and U are lower and upper triangular matrices, respectively.∗ We can then solve
Ax = b by writing the problem as
LUx = b.
If we define v = Ux, we can first solve
Lv = b
Definition 7 An n × n matrix A is lower triangular if all elements above the leading diag-
onal are zero, that is aij = 0, ∀j > i. Similarly, a matrix is upper triangular if all elements
below the leading diagonal are zero, that is aij = 0, ∀i > j.
Clearly only diagonal matrices are both upper and lower triangular. Upper and lower
triangular matrices have very similar properties (since one is just a transpose of the other).
We will focus mainly on lower triangular matrices. It should be assumed that there are
exactly analogous properties for upper triangular matrices unless otherwise stated.
Proposition 11 1. The sum of two lower triangular matrices is also lower triangular
4. The determinant of a lower triangular matrix is the product of the diagonal elements
∗
We will explain what this means in the next section. For now it suffices to know that they are just special matrices.
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 46
5. A lower triangular matrix is invertible iff all diagonal elements are non-zero
6. The eigenvalues of a lower triangular matrix are the diagonal elements of the matrix
Proof
2. Left as an exercise
3. Thinking about using the Gaussian elimination method to construct this inverse, it is
clear that it will only be necessary to use row operations which are lower triangular,
leading to a lower triangular inverse
4. Direct expansion of the determinant along the top row in terms of minor determi-
nants, applied recursively, makes this clear
5. The determinant will be non-zero iff all diagonal elements are non-zero, by previous
result
6. The eigenvalues are given by the roots of |A − λ I |. This matrix is lower triangular,
so the determinant is the product of diagonal elements.
In addition to all of the properties of lower triangular matrices, unit lower triangular matri-
ces have some additional useful properties.
Proposition 12 1. The product of two unit lower triangular matrices is unit lower trian-
gular
3. The determinant of a unit lower triangular matrix is 1, and hence all unit lower trian-
gular matrices are invertible
Proof
1. Left as an exercise
4. |L − λ I | = (1 − λ)n .
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 47
It is clear that a non-singular lower triangular matrix A can be factorised in the form
A = DL
where D is diagonal and L is unit lower triangular, by choosing D = diag {a11 , a22 , . . . , ann }
and L = D−1 A.
If we put
D = diag {2, 4} , D−1 = diag {1/2, 1/4}
we get
1/2 0 2 0 1 0
L = D−1A = =
0 1/4 −1 4 −1/4 1
and we are done. The resulting factorisation is
2 0 2 0 1 0
= .
−1 4 0 4 −1/4 1
Lx = b,
where n × n L and n-dimensional b are given, and a solution for n-dimensional x is re-
quired. We will assume that L is invertible (no zeros on the diagonal), so that the solution
is unique. If we re-write the equation in component form, a simple solution strategy be-
comes apparent:
l11 0 · · · 0 x1 b1
l21 l22 · · · 0 x2 b2
.. .. = .. .
.. .. . .
. . . . . .
ln1 ln2 · · · lnn xn bn
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 48
Iterative construction of the solution in this way is known as forward substitution. Clearly
the algorithm would fail if any lii = 0, but this corresponds to the rank degenerate case.
For the invertible case we are considering, the algorithm provides a simple and efficient
method for computing a solution.
2 0 x1 4
= ,
1 4 x2 6
so, starting with the first equation we have
2x1 = 4 ⇒ x1 = 2.
The second equation can then be solved as
6 − x1 6 − 2
x1 + 4x2 = 6 ⇒ x2 = = = 1,
4 4
so our solution is x = (2, 1)T.
> L=matrix(c(2,0,0,1,3,0,2,3,4),ncol=3,byrow=TRUE)
> L
[,1] [,2] [,3]
[1,] 2 0 0
[2,] 1 3 0
[3,] 2 3 4
> forwardsolve(L,c(6,9,28))
[1] 3 2 4
>
The solution is therefore given by x = (3, 2, 4)T .
The solution of upper triangular systems is analogous, but looks a little different, so
we will consider it explicitly. Suppose that U is an upper triangular matrix, and we want to
solve the system
Ux = b
for x. Again, we write out the system in component form:
u11 u12 · · · u1n x1 b1
0 u22 · · · u2n x2 b2
.. .. = .. .
.. .. . .
. . . . . .
0 0 · · · unn xn bn
We now see that system is easiest to solve starting with the last equation
unn xn = bn ⇒ xn = bn /unn ,
then
un−1,n−1 xn−1 + u(n−1),n xn = bn−1 ⇒ xn−1 = (bn−1 − u(n−1),n xn )/un−1,n−1 ,
and so on, with xi given by
n
!
X .
xi = bi − uij xj uii .
j=i+1
for x using R.
This is easily accomplished as the following R session shows
> R=matrix(c(2,2,3,0,2,-1,0,0,3),ncol=3,byrow=TRUE)
> R
[,1] [,2] [,3]
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 50
[1,] 2 2 3
[2,] 0 2 -1
[3,] 0 0 3
> backsolve(R,c(8,6,-6))
[1] 5 2 -2
So the solution is x = (5, 2, −2)T .
A = LU.
Clearly then, since |A| = |LU| = |L||U| = |U|, we have |A| = ni=1 uii , and this is an efficient
Q
way to compute the determinant in the rare cases where it is really needed.
The factorisation is just a matrix representation of the procedure of using Gaussian
elimination to zero out the lower triangle of A to create U. L represents the (inverse of
the) row operations required to achieve this. We will not give a formal derivation of the
construction, but instead give a rough outline. First we note that the row operations used
in Gaussian elimination can be represented by matrices. So the row operation which adds
λ times row i to row j can be represented by the matrix
and λ is known as the multiplier of the operation. The row operations needed to zero out
the lower diagonal will all have i < j, so Mij (λ) will be unit lower triangular, and therefore
the product of all of the row operations will also be unit lower triangular. Row operations
are easily inverted, as
Mij (λ)−1 = I − λej ei T .
This can be verified by multiplication, noting particularly how the final product term van-
ishes. It will require n(n − 1)/2 such row operations to zero out the lower triangle. If
we number them sequentially, so that M1 is the row operation to zero out a21 , M2 is the
row operation to zero out a31 , etc., then the matrix representing the complete set of row
operations is
M = Mn(n−1)/2 · · · M2 M1 .
Then if we denote the upper triangle that we are left with by U we have
MA = U ⇒ A = M−1 U = LU
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 51
Start with
1 4 7
A = 2 5 8
3 6 10
and zero out the first column. The required multipliers are -2
and -3 respectively, so we record their “inverses”, 2 and 3, in
the locations we have zeroed out:
1 4 7
2 −3 −6 .
3 −6 −11
Now we zero out the second column with the multiplier -2, and
so we record the number 2 in the position we zero out:
1 4 7
2 −3 −6 .
3 2 1
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 52
Proposition 15 If
A = LDLT
is the LDLT decomposition of symmetric positive definite A, then
Now since the elements of D are all strictly positive, we can define D1/2 to be the matrix
whose diagonal elements are the square root of the corresponding elements of D, and
then we have
Proof
Start with the decomposition A = LDLT and put G = LD1/2 .
The Cholesky decomposition has many important applications in multivariate statistics
and data analysis, and we will examine a few of these shortly, and more later in the course.
Note that ifQwe are interested in the determinant of A, we have |A| = |G||GT | = |G|2 ,
and |G| = ni=1 gii , so this provides an efficient way to compute the determinant of a
symmetric positive definite matrix. Obviously we can construct the decomposition starting
from a basic LU factorisation, but that is a very inefficient method. It turns out to be quite
straightforward to derive a fairly efficient algorithm for its construction from first principles.
Direct construction
We can write out the Cholesky decomposition in component form as
Somewhat analogous to the technique of forward substitution, we can work through the
equations one by one, starting from the top left, working row by row through the non-zero
elements of G. The first few equations can be solved as follows:
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 54
2 √
g11 = a11 ⇒ g11 = a11
a21
g21g11 = a21 ⇒ g21 =
g
q11
2 2 2
g21 + g22 = a22 ⇒ g22 = a22 − g21
a31
g31g11 = a31 ⇒ g31 =
g11
a32 − g31g21
g31g21 + g32g22 = a32 ⇒ g32 = ,
g22
and so on. We see that if we work through the equations in order, we know enough at
each stage to solve explicitly for the next unknown. In general we have
j−1
X
v
u i−1
aij − gik gjk
u X k=1
2
gii = aii −
t gik , gij = , i > j.
k=1
gii
We now construct the LDMT decomposition by extracting the diagonal of U and transpos-
ing the result to get
T
1 −1 2 1 0 0 1 0 0 1 0 0
−1 5 0 = −1 1 0 0 22 0 −1 1 0 .
2 0 14 2 1/2 1 0 0 32 2 1/2 1
At this point we note that, due to the symmetry of A, we have L = M, and so this is in
fact the LDLT decomposition. Since the elements of D are all positive, A must be strictly
positive definite, and we can construct the Cholesky factor as LD1/2 to get
1 0 0 1 0 0 1 0 0
−1 1 0 0 2 0 = −1 2 0 ,
2 1/2 1 0 0 3 2 1 3
as before.
> g
east.west north.south angle radial.position
east.west 12.027505 0.0000000 0.00000000 0.000000
north.south -2.717100 22.7259122 0.00000000 0.000000
angle -1.787904 0.9460287 38.19077490 0.000000
radial.position 21.944419 -8.8957576 -0.03564306 10.465974
velocity 37.536090 -80.4351441 4.72421244 9.431724
velocity
east.west 0.00000
north.south 0.00000
angle 0.00000
radial.position 0.00000
velocity 29.94046
> g%*%c
east.west north.south angle radial.position
east.west 144.66088 -32.67993 -21.50402 263.93661
north.south -32.67993 523.84971 26.35728 -261.78938
angle -21.50402 26.35728 1462.62686 -49.01139
radial.position 263.93661 -261.78938 -49.01139 670.22991
velocity 451.46551 -1929.95131 37.21646 1637.78301
velocity
east.west 451.46551
north.south -1929.95131
angle 37.21646
radial.position 1637.78301
velocity 8886.47724
>
Note that R returns the upper triangle of the Cholesky factorisation, which can easily be
transposed to give the lower triangle if required. We next consider why it might be useful
to be able to compute the Cholesky factorisation of a variance matrix.
X = µ + GZ,
where G is a matrix satisfying Σ = GGT , has the required mean and variance.
We can also invert this affine transformation, to get
Z = G−1 (X − µ),
a method for transforming quantities with a given mean and variance to standard uncor-
related form. This kind of standardisation transformation also has many applications, and
we shall look at the symmetric version of this kind of transformation later. Note also that
the above equation does not require us to compute the inverse of G. We are merely using
the equation as shorthand for solving the linear system
GZ = X − µ
AC = Q
but then
A = QR,
where R = C−1 . There are various variations on the precise way that this decomposi-
tion is computed, but these will not concern us. This decomposition has many potential
applications in statistical problems.
We can compute and manipulate QR factorisations using R.
> A=matrix(c(1,2,3,4,5,6),ncol=2)
> A
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 60
> myQR=qr(A)
> qr.Q(myQR)
[,1] [,2]
[1,] -0.2672612 0.8728716
[2,] -0.5345225 0.2182179
[3,] -0.8017837 -0.4364358
> qr.R(myQR)
[,1] [,2]
[1,] -3.741657 -8.552360
[2,] 0.000000 1.963961
> qr.Q(myQR)%*%qr.R(myQR)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> t(qr.Q(myQR))%*%qr.Q(myQR)
[,1] [,2]
[1,] 1.000000e+00 -5.144539e-17
[2,] -5.144539e-17 1.000000e+00
In the case where A is square (n = p), we have
n
Y
|A| = |QR| = |Q||R| = |R| = rii ,
i=1
W = Hn X,
where, again, we are just using Hn X as shorthand for X − 1 n x̄T — we are not suggesting
actually constructing the centering matrix Hn .† Next construct the QR decomposition of W
as
W = QR.
Putting
1
G= √ RT
n−1
gives lower triangular G as the Cholesky factor of the sample variance matrix for X. We
can verify this as follows:
1
S= WT W
n−1
1
= (QR)T QR
n−1
1
= RT R
n−1
= GGT .
Similarly, if we set √
Z= n − 1Q,
then Z has mean zero and variance the identity, since
1
SZ = ZT Z
n−1
= QT Q
= Ip .
Ax = b
QRx = b
⇒ Rx = QT b,
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 63
y = Xβ + ε.
Given data y and X we want to find the β which minimises kεk22 = εT ε. Differentiating with
respect to the vector β and equating to zero leads to the so-called “normal equations”
XT Xβ = XT y.
The classical method for solving these equations was via a Cholesky decomposition for
XT X and then backward and forward substitution. However, it turns out to be more efficient
and numerically stable to use the QR factorisation of X, as
|A − λ I | = 0.
Since this is a polynomial in λ of degree n, there are n (not necessarily distinct) eigenval-
ues of A. In general, some of these eigenvalues will be complex, but not if A is symmetric.
G−1 = VD−1/2 VT ,
which can be very useful in several contexts. The symmetry of this square root matrix is
also desirable in some applications. However, the Cholesky factorisation is much faster
than an eigen-solver, so in the many applications where the cheaper Cholesky factor will
be adequate, it is usually to be preferred.
Similarly,
7 7
A − λ2 I = ,
7 7
so
1 1
v2 = √ .
2 −1
Consequently, our spectral decomposition is A = VDVT, where
1 1 1 16 0
V=√ , D= .
2 1 −1 0 2
Now we have the spectral decomposition, it is straightforward
to construct the symmetric square root as
G = VD1/2VT
1 1 1 4 √0 1 1
=
2 1 −1 0 2 1 −1
1 1 1 √4 √4
=
2 1 −1 2 − 2
√ √
1 4+ 2 4− 2
= √ √ .
2 4− 2 4+ 2
Our proposed square root is clearly symmetric, and we can
verify that it is indeed a square root with
√ √ √ √
1 4 + √ 2 4 − √ 2 4 + √ 2 4 − √ 2 1 36 28
G2 = = = A.
4 4− 2 4+ 2 4− 2 4+ 2 4 28 36
Z = Σ−1/2 (X − µ),
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 67
where Σ−1/2 is the inverse of the symmetric square root of Σ. Z will have mean 0 and
variance I.
Similarly, if X is an n × p data matrix, the Mahalanobis transform is
Z = Hn XS−1/2 ,
though again, it must be emphasised that this is a mathematical description of the trans-
formation, and not a recipe for how to construct it numerically. Z will have sample mean
zero and sample variance Ip .
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] 0.051396600 0.02254908 0.45279536 0.1122205 0.88274167
[2,] -0.208493672 -0.01789604 0.31495251 0.8879502 -0.26183862
[3,] 0.002463535 -0.99826125 0.04727651 -0.0346832 0.00551559
[4,] 0.182722226 0.05000384 0.82508240 -0.3634307 -0.38893366
[5,] 0.959424462 -0.01205694 -0.11307185 0.2562542 -0.03013096
> e$vectors%*%t(e$vectors)
[,1] [,2] [,3] [,4]
[1,] 1.000000e+00 -7.224852e-17 -4.978182e-17 1.115915e-16
[2,] -7.224852e-17 1.000000e+00 1.760090e-16 -8.700722e-18
[3,] -4.978182e-17 1.760090e-16 1.000000e+00 2.170533e-16
[4,] 1.115915e-16 -8.700722e-18 2.170533e-16 1.000000e+00
[5,] 2.232271e-17 -1.676939e-16 -1.400287e-16 8.567399e-17
[,5]
[1,] 2.232271e-17
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 68
[2,] -1.676939e-16
[3,] -1.400287e-16
[4,] 8.567399e-17
[5,] 1.000000e+00
> e$vectors%*%diag(e$values)%*%t(e$vectors)
[,1] [,2] [,3] [,4] [,5]
[1,] 144.66088 -32.67993 -21.50402 263.93661 451.46551
[2,] -32.67993 523.84971 26.35728 -261.78938 -1929.95131
[3,] -21.50402 26.35728 1462.62686 -49.01139 37.21646
[4,] 263.93661 -261.78938 -49.01139 670.22991 1637.78301
[5,] 451.46551 -1929.95131 37.21646 1637.78301 8886.47724
> G=e$vectors%*%diag(sqrt(e$values))%*%t(e$vectors)
> G
[,1] [,2] [,3] [,4] [,5]
[1,] 8.6098647 1.8123819 -0.3859399 7.2496158 3.8132008
[2,] 1.8123819 13.3404531 0.7001528 -0.2300872 -18.4947058
[3,] -0.3859399 0.7001528 38.2218104 -0.9113324 0.5003811
[4,] 7.2496158 -0.2300872 -0.9113324 20.2249833 14.4131733
[5,] 3.8132008 -18.4947058 0.5003811 14.4131733 91.2244082
> G%*%G
[,1] [,2] [,3] [,4] [,5]
[1,] 144.66088 -32.67993 -21.50402 263.93661 451.46551
[2,] -32.67993 523.84971 26.35728 -261.78938 -1929.95131
[3,] -21.50402 26.35728 1462.62686 -49.01139 37.21646
[4,] 263.93661 -261.78938 -49.01139 670.22991 1637.78301
[5,] 451.46551 -1929.95131 37.21646 1637.78301 8886.47724
> Ginv=e$vectors%*%diag(1/sqrt(e$values))%*%t(e$vectors)
> G%*%Ginv
[,1] [,2] [,3] [,4]
[1,] 1.000000e+00 -2.478825e-16 -2.061869e-17 1.212003e-16
[2,] -2.428952e-16 1.000000e+00 3.009745e-16 1.170261e-16
[3,] -1.111113e-15 8.248890e-16 1.000000e+00 1.020284e-15
[4,] 6.617055e-16 -3.846478e-16 2.830987e-16 1.000000e+00
[5,] 8.075138e-16 -1.285430e-15 -4.671759e-16 -4.477755e-17
[,5]
[1,] 1.171616e-17
[2,] -2.679606e-16
[3,] -1.481901e-16
[4,] 8.386304e-17
[5,] 1.000000e+00
Now we have Ginv we can use it to construct the Mahalanobis transform of the data
> W=sweep(galaxy,2,colMeans(galaxy))
> Z=as.matrix(W)%*%Ginv
> colMeans(Z)
[1] 4.419302e-16 -2.600506e-15 2.784151e-17 1.075502e-15
[5] -1.821798e-15
> var(Z)
[,1] [,2] [,3] [,4]
[1,] 1.000000e+00 6.811770e-16 -8.034862e-16 1.586732e-15
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 69
Mahalanobis distance
Given an observation x of a random vector X with E(X) = µ and Var(X) = Σ, it is
natural to wonder how “unusual” it is, in some appropriate sense. We cannot give a very
formal solution to this problem until we develop some multivariate distribution theory, but
even now we can give an intuitive explanation of the most commonly used summary. In
some sense we want to know how “far” the observation is from the mean, but in general
some components are more variable than others, so it doesn’t make sense to look at
“distance” on the scale of the original variables. However, it does make perfect sense for
a standardised observation, z. So the Mahalanobis distance of x from µ is just
√
kzk = z T z
q
= [Σ−1/2 (x − µ)] T Σ−1/2 (x − µ)
p
= (x − µ)T Σ−1 (x − µ).
√
Computationally, the efficient way to compute this distance measure is as z T z where
z = Σ−1/2 (x − µ).
Note that it is not necessary to use a symmetric square root for this computation. Stan-
dardisation using the Cholesky factor leads to exactly the same distance, and is much
cheaper to compute and then solve than using a symmetric square root. Confirmation of
this fact is left as an exercise.
It is also worth noting that if the variance is defined empirically, as the sample variance
matrix S of a data matrix X, then the Mahalanobis distance can be calculated yet more
stably and efficiently using the QR factorisation of X.
Computing the Mahalanobis distance for each observation in a data set is the most
commonly used technique for identifying outlying observations in multivariate data.
●● ●
●●●●
●
●●● ● ● ●
●●●●● ●
● ●●● ●
●●●● ●
●
●
● ●
● ● ●●●
● ●●● ●
1700 ● ●●
●●●
●
●●●● ● ● ● ●●●
● ●
● ●
galaxy$velocity
● ● ● ●
●●●
●●
● ●● ●● ● ●
● ●● ● ●
● ● ●● ●
●●● ●● ●
● ● ● ●● ●●
● ●●
1600
● ● ● ● ● ●
●● ●● ● ●
● ●
● ● ●● ● ●●● ●●● ●
●●●●●● ● ●●
●
● ● ● ●
● ● ●●
● ● ●
●
● ●●
● ● ●● ● ● ● ●● ●
●●●●
●● ● ● ●● ● ● ● ● ● ● ●
●●● ●●●● ●
● ●●●●
● ● ●●
● ● ● ●●● ● ● ●●
●●●●●
●● ● ●●●● ● ● ● ●
● ● ● ● ●●
1500
● ●●● ● ● ● ●
●● ●
●●●●
● ●
●●●
● ●
● ●●●●● ●
●●●● ●
●● ● ● ●●
● ●● ● ● ● ●●●● ●
● ● ●
●● ● ●● ●● ● ● ●● ●
●● ● ●
●● ●●●●
1400
●●
−40 −20 0 20 40 60
galaxy$radial.position
Figure 2.1: Scatterplot for the Galaxy data showing outliers identified by large Mahalanobis dis-
tance
Note that regular element-wise multiplication was used to square the elements in Z, rather
than matrix multiplication, which is not even valid here due to Z being non-square. A
histogram of the distances (not shown) can be obtained with hist(d,20), and this shows
a fairly uninteresting distribution, suggesting no particularly outlying values. There are a
few distances larger than 4, so it may be worth looking at those in more detail. We can
inspect the cases individually using
> galaxy[d>4,]
east.west north.south angle radial.position velocity
286 24.85322 49.84784 63.5 55.7 1531
287 23.82696 47.78950 63.5 53.4 1533
288 22.80071 45.73114 63.5 51.1 1539
330 -21.46211 -43.04634 63.5 -48.1 1642
331 -22.48837 -45.10469 63.5 -50.4 1616
and we can highlight them on a scatterplot using
> plot(galaxy$radial.position,galaxy$velocity,pch=19,col=4)
> points(galaxy$radial.position[d>4],galaxy$velocity[d>4],pch=19,col=2)
resulting in the plot show in Figure 2.1. Although it is difficult to tell much from a single 2d
projection, it does appear as though the 5 values highlighted are indeed some way away
from the centre of the cloud of data points.
Note that once again, R has a built-in function for computing the (squared) Maha-
lanobis distance, so our vector of distances could have been constructed much more
straightforwardly using
d=sqrt(mahalanobis(galaxy,colMeans(galaxy),var(galaxy)))
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 71
A = UDVT ,
AT A = VD2 VT ,
UT U = (AVD−1 )T (AVD−1 )
= D−1 VT AT AVD−1
= D−1 VT VD2 VT VD−1
= D−1 D2 D−1
= Ip .
It must be emphasised that this is a mathematical construction to demonstrate the exis-
tence of the factorisation, and its relationship to the spectral decomposition, but this is not
how the factorisation is constructed in practice (and in any case, there are issues with this
construction when D is singular). There are very efficient methods which can be used for
this, often starting from a QR factorisation of A. Typically AT A is not constructed at all, as
this is a numerically unstable computation.
(AAT )U = UD2 ,
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 72
$u
[,1] [,2] [,3] [,4] [,5]
[1,] -0.051396600 -0.02254908 -0.45279536 -0.1122205 0.88274167
[2,] 0.208493672 0.01789604 -0.31495251 -0.8879502 -0.26183862
[3,] -0.002463535 0.99826125 -0.04727651 0.0346832 0.00551559
[4,] -0.182722226 -0.05000384 -0.82508240 0.3634307 -0.38893366
[5,] -0.959424462 0.01205694 0.11307185 -0.2562542 -0.03013096
$v
[,1] [,2] [,3] [,4] [,5]
[1,] -0.051396600 -0.02254908 -0.45279536 -0.1122205 0.88274167
[2,] 0.208493672 0.01789604 -0.31495251 -0.8879502 -0.26183862
[3,] -0.002463535 0.99826125 -0.04727651 0.0346832 0.00551559
[4,] -0.182722226 -0.05000384 -0.82508240 0.3634307 -0.38893366
[5,] -0.959424462 0.01205694 0.11307185 -0.2562542 -0.03013096
We see that it is exactly the same as the spectral decomposition obtained earlier, modulo
some sign changes on some of the columns, which are arbitrary.
Σ = VDVT .
Cov v i T X, v j T X = v i T Cov(X, X) v j = v i T Σv j = λj v i T v j = 0.
So the eigenvectors form a basis for the space of linear combinations of X, and repre-
sent a collection of uncorrelated random quantities whose variances are given by their
respective eigenvalues.
Formally we characterise principal components as linear combinations of variables
that explain the greatest variation in the random quantity or data set. However, we need
to place a constraint on the linear combination for this to be well defined. For example,
if αT X is a linear combination with variance σ 2 , then it is clear that kαT X has variance
k 2 σ 2 , so we can make the variance of a linear combination as large as we wish simply by
multiplying it by a large enough scalar. However, we are really interested in the direction
with largest variation, and so we impose the constraint kαk = 1, or αT α = 1. To maximise
subject to a non-linear constraint we use a Lagrange multiplier, and begin by optimising
f = αT Σα − λ(αT α − 1)
= αT (Σ − λ I)α + λ.
λ = λ1 , α = v1.
So the first eigenvector of the spectral decomposition represents the linear combination
of variables with the largest variance, and its variance is given by its corresponding eigen-
value. This is the first principal component.
The second principal component is defined to be the linear combination uncorrelated
with the first principal component which has the greatest variance. But since we have
seen that the linear combinations corresponding to the eigenvectors are uncorrelated, it
follows that the set of linear combinations corresponding to quantities uncorrelated with
the first principal component is spanned by the 2nd to pth eigenvectors, and hence the
maximum variance will be obtained at λ2 , corresponding to the 2nd eigenvector, v 2 . The
third principal component is the linear combination with greatest variance uncorrelated
with the first two principal components, and so on. So we see that there is a direct corre-
spondence between the spectral decomposition of the variance matrix and the principal
components of the random vector.
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 75
= VT Var(X) V
= VT VDVT V
= D.
So the components of Y are uncorrelated, and each has the correct variance, as required.
often known as the total variation of the random vector. The total variation has the de-
sirable property that it is preserved by orthogonal rotation. In particular, if we consider
transforming X to its principal components via
Y = VT X
we have
t(X) = Tr (Var(X))
= Tr VDVT
= Tr DVT V
= Tr (D)
= t(Y ),
using the fact that Tr (AB) = Tr (BA) for square matrices.
One motivation for undertaking principal components analysis is to try and find low-
dimensional projections of the original variables which contain most of the information in
the original higher-dimensional vectors. One way of characterising the information is by
its variation. Since the principal components are ordered with most variable first, it makes
sense to consider the first q principal components (q ≤ p). The total variation of the first q
principal components is clearly qi=1 λi . The proportion of variance explained is therefore
P
q
X
λi q
i=1 1 X
p = λi ,
X Tr (D) i=1
λi
i=1
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 76
S = VDVT .
Then all of the properties we described for random variables easily follow. For example,
the eigenvectors in the columns of V correspond to linear combinations of the variables
whose sample variance is the corresponding eigenvalue, and distinct eigenvectors corre-
spond to linear combinations of variables whose sample covariance is zero. So we can
define the first principal component of X to be the linear combination of variables whose
sample variance is largest (subject to the unit norm constraint), and it trivially follows that
this corresponds to the first eigenvector, v 1 , and its sample variance is λ1 . Similarly the
second principal component corresponds to v 2 , with sample variance λ2 , and so on.
We can obviously transform observations to their principal component representation
using
y = VT x
and so we can transform X to Y, the data rotated to the principal component axes, using
Y = XV.
However, in practice the data are usually centered before rotating, so in practice the trans-
formation
Y = Hn XV
is used.§ In the context of principal components analysis, the matrix of eigenvectors, V,
is often known as the loadings matrix, and the transformed observations in Y are often
known as component scores.
§
The usual caveat about not actually using the centering matrix to center the data applies.
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 77
●
●●
●● ●●●
●
● ●●●●
●● ●
●● ●●
●
●●
●
●●● ●
●
● ● ●
● ●
60
● ●●
● ●●
● ●
●●
●●
●
●
●
●●●
●
●● ●●
●●●
20
●●
●●●●
●●
●● ●
●
●●●●
●● ●● ●●● ●
Y[, 2]
●●●● ● ●●●●
●● ● ●●●●
0
●
●● ● ● ●●
●●●●● ● ●● ●● ●● ●
● ●●● ●●● ●●● ●●● ●
●● ●● ●● ●●●
●
● ●●
● ● ●● ● ●
● ● ●● ● ●● ●
●● ●●
● ● ● ● ● ● ● ● ● ● ●
●● ●
● ●●● ●● ●
●
●●
●●●
●●
● ●●●●
●● ● ●●●●
●●●● ●
●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●●●●●● ●●●
●●●
−40
●●
● ●● ● ●
●●
●●●●●
● ●● ● ● ● ● ● ● ●● ●●●
●
●●
Y[, 1]
Figure 2.2: Scatterplot for the Galaxy data showing the second principal component plotted against
the first
Figure 2.3: Screen grab of a 3d interactive plot of the first three principal components for the
galaxy data
require(rgl)
plot3d(Y[,1:3],col=galaxy$angle)
Since the first 3 principal components explain over 99% of the variation in the data, es-
sentially no information is lost in this 3d projection, and the ability to interact with the plot,
and drag it around to look at it from different angles gives a great deal of insight into the
structure of the data. A screen grab of this plot is shown in Figure 2.3.
We have investigated the principal components for this data by hand, but R has built-in
functions for carrying out principal components analysis. The classic function for carrying
out PCA was the princomp() function, which works similarly to how we have just been
doing things “by hand”.
> pca=princomp(galaxy)
> pca
Call:
princomp(x = galaxy)
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
98.041939 38.235447 22.053993 8.286072 4.738194
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 79
W = Hn X.
W = UDVT
we see that
1
S= WT W
n−1
1
= VDUT UDVT
n−1
1
= VD2 VT
n−1
= VΛVT ,
where
1
Λ= D2 .
n−1
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 80
We have therefore constructed the spectral decomposition of S directly from the SVD
of W, and the loadings matrix of the principal components corresponds precisely to the
right singular vectors of W. We can use this loadings matrix together with the diagonal
matrix of eigenvalues, Λ just as if we had constructed S and then carried out a spectral
decomposition.
Rotation:
PC1 PC2 PC3 PC4
east.west -0.051396600 -0.02254908 0.45279536 -0.1122205
north.south 0.208493672 0.01789604 0.31495251 -0.8879502
angle -0.002463535 0.99826125 0.04727651 0.0346832
radial.position -0.182722226 -0.05000384 0.82508240 0.3634307
velocity -0.959424462 0.01205694 -0.11307185 -0.2562542
PC5
east.west 0.88274167
north.south -0.26183862
angle 0.00551559
radial.position -0.38893366
velocity -0.03013096
> pca$rotation
PC1 PC2 PC3 PC4
east.west -0.051396600 -0.02254908 0.45279536 -0.1122205
north.south 0.208493672 0.01789604 0.31495251 -0.8879502
angle -0.002463535 0.99826125 0.04727651 0.0346832
radial.position -0.182722226 -0.05000384 0.82508240 0.3634307
velocity -0.959424462 0.01205694 -0.11307185 -0.2562542
PC5
CHAPTER 2. PCA AND MATRIX FACTORISATIONS 81
NSCLC
20
COLON
●RENALRENAL
OVARIAN OVARIAN NSCLC COLON
RENAL
RENAL ● ● UNKNOWN
OVARIAN
PROSTATE ●
●NSCLC
OVARIAN
OVARIAN
● COLON
●COLON MCF7A−repro
COLON
MCF7D−repro
● BREAST ●
BREAST
●RENAL
NSCLC
RENAL NSCLC ● ●
● PROSTATE COLON
OVARIAN
● ● ● ● ●
● ● ●
BREAST
● ●
RENAL
● ●
BREAST ● ● ● COLON
● CNS
NSCLC ● NSCLC
RENAL
● NSCLC●NSCLC ●
CNS
CNS ●
● CNSCNS
●MELANOMA
BREAST ●● ● ● ●
●
●●
0
K562B−repro
● ● LEUKEMIA
K562A−repro
pca$x[, 2]
LEUKEMIA ●
LEUKEMIALEUKEMIA
● LEUKEMIALEUKEMIA
RENAL ● ● ●●
● ●
●
−20
MELANOMA
MELANOMA
MELANOMA
● ●
MELANOMA
●
−40
MELANOMA
●
MELANOMA MELANOMA
BREAST
●● ●
● BREAST
−40 −20 0 20 40 60
pca$x[, 1]
Figure 2.4: Scatterplot of the first two principal components of the nci microarray data
east.west 0.88274167
north.south -0.26183862
angle 0.00551559
radial.position -0.38893366
velocity -0.03013096
> dim(pca$x)
[1] 323 5
> plot(pca$x[,1],pca$x[,2],pch=19,col=galaxy$angle)
The final command again gives a plot similar to Figure 2.2. Since the SVD provides a more
numerically stable approach to PCA, use of the command prcomp() is recommended
unless there is a good reason not to.
●
●● ● ●● ●● ●●●●
●
●●●●
●
●●
●
●●●
●●
●
●●
●
●●● ●
●
●●
●●● ●● ●●
●●●
●
● ●●
●
●
●●
●●
●●
●●●
●
●●
●●●
●● ●
●
●●
●
●
● ●
●●●● ●
●● ●
●
●●
●●
●●
●
●●●
●
●
●
●●●
●
●●
●
●●
●
●●
●
●●
●●●
●
●
●●●
●●●●●
●
● ● ● ●●
● ●●
●● ●
● ●
● ● ● ● ●●● ●● ●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●● ●
●
●
●
●
●
●●
●●
●●
●
●●● ● ●●●● ● ● ●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●●●
●●
●●●●
●●
●●●●
●
●●●
●●
●●● ●
● ●●●● ●● ● ●
● ● ● ● ● ● ● ● ●
5
●
●●●
●●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●
● ●
●
●●
●
●●●●
●●●
● ●
●●
●●● ●
●
●
● ●
●●
●●
● ●
● ●●●●● ●● ●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●●●
●
●
●●●
●
● ● ● ●●
●●●●
●●●●
●● ●
●●
●●
●●●
●●●
● ● ● ●●
●●●
●●
●
●●●
●
●●
●
●●
●
●●
●●
●
●
●●●
● ●
●
●
●●
●●
●
●●
●●●
●●●
●●
●●●
●
●●
●
●●
●●
●
●●
●●
●●●●
●
●
●●●●●
●
● ●
●
●●● ●
●●
●
●●
●
●
●●● ●●● ●
●● ●●●●●●● ●
● ●
●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●●●
●
●
●●●
●
●
●
●●●●
●●
●●●●
●
● ●
●●●
●
●
●●
●
●● ●
●
●
● ●
●
●●
●●●●●
●●●●●●
●
●
●●
●●
●●● ●
●
●
● ●
●●
●●●●
●●●● ● ●
● ●●●
pca$x[, 2]
●●● ●
●●
●●
●●
●●
●
●● ●
●●
●●
●●
●●●●●●●
●●
●●● ●
●●
● ● ●●
● ●●●● ●
●●●●●●
●
●●●●●●
●●● ●
●●
● ● ● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
●●
●● ●●
●●●●● ●●
●●
●●
● ●
●
●●
●
●
●●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●
●●●
●●
●
●●
●
●
●●●
●
●●
●●
●
●●
●●
●
●
●●
●●
●
●●
●●●
●
●
● ●
●●
●●
●
●
●●
●●
●●
●
●
●
●●
●●
●●●●
●
●
●
●●
●●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●●●
●●●
●
●
●●●
●●
●
●●●
●●●
●●
● ● ●●●● ●● ●●● ● ●
●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●
●●● ●
●
● ●●
●
●
●●
●●
●
●●
●●
●
●●
●
●●●
●●
●
●●
●●●
●
●●
●
●●
●●
●
●● ●
●
●
●●
●●●
●●
●
●●
●
●●
●
●●●
●
●●●
●
●●
●●●●●
●●●
●●● ●●●
●
●●
●
●●
●
●●●●● ●● ● ● ●
●●●●●
●●
●
●●
●
●
●●●●●● ●
●● ●●
●● ●
●
●●●●
●
●●●
●
●●●●
● ●
●
●
●●
●
●●
●●●
●●●
●
●●
●●●●●
●
●●
●
●●
●
●●●
●●
●●
●
●●
●●● ● ●●●
●●●
●● ●
●
●●●
●●●●●●●●●
●●● ●● ●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●●●
●●
●
●
●● ●●
●
●● ●●
●●
●●● ●
●●●●●● ● ●● ●
●
● ●●● ●●● ● ● ●●●●●
●
●● ●●● ● ● ●
0
●●
●
●●
●●
●●●●
●●
●●●
●●
●
●●
●
●●
●●●
●●
● ●
●●
●
●●
●●
●●
●
●●
●
●
●●●
●●
●
●●●
●●
●●
●
●●
●
●●●
●●
●●●
●
●
●●
●●●
●●
●●
●●
●●
●●
●
●●●
●●
● ●
●
●●
●●●●
●
●●●●
●
●●
●●
●● ●
●
●●●
●● ●●●● ●●●
●●●
●
●●●●
● ● ●●
●●
●●
●●
●
●●●
●●
●●●
●●
●●
●●
●●●
●●●
●
●●
●●●●●
●●●
●
●●●
●
●●
●
●
●●●
●●●
●●
●●
●●
●
●●●●
●●●
●●
●●
●
●●
●●●
●●
●●
●●
●●
●●●●
●●●●● ●● ● ● ●
●
●● ●●●●
●
● ●●●●
●● ●●●
●
●●
●●●
● ●
●●
●
●
●●
●●
●
●
●●●●
● ●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●●●
●●
●●
● ●
●●
●
●
●●
●●
●●●
●
●
●●
● ●●
●●
●
●
●●
●●
●
● ●
● ●
●
●●●●●●
● ●●
● ●●
●● ●
●
●●●
● ●● ● ● ●●
●
●●●●
●●
●●
●●●
●
●
●●●
●●
●
●●●●
●●●●
●
●●●●
●
●
●
●
●
●●●
●●
●
●
●●●
●
●
● ●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●●
●●●
●
●●
●●●
●●●●
●●
●
●
●●●● ●●●●
● ●
●● ●
●●
●
●
●●
●
●●
●●
●
●●
●
●
● ● ●● ●●●●●
●
●● ●
●
●●
●●
●
●●
●●●
●●
●
●
●●●
●●
●●
●
● ●
●●
●●
● ●
●●
●
●
●●
●
●
●●●
●
●
●
●●
● ●
●●●
●●● ●
●
●●●
●●
●
●●
●●●
●●
●●
●●●●●●● ●●
●●●●
●
●● ● ●
●●●
●● ● ● ● ●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●●● ●
●●
●
● ●●
●●●●
●● ●
● ●●●
●●●
●● ●
●●
●●●
●
●
●●● ●
●●
●
● ● ●●
●●
●●
●●
●●●
●● ●
● ●●●● ●
●
● ● ●●●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●●
●
●
●●
●●●●
● ●● ● ●
● ●●●
●
●●●
● ●
●●●
●●
●
●●●
●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●●●●●●●●
●●
●●●●
●●●●
● ●
●
●●
●●
● ●●
●●
●● ●● ●●●
●●
●
● ●
●
●●
●●
●●
●
● ● ● ●●
● ● ●● ●
●
●
●●
●
●
●●
●●
●
●●
●●●●
● ● ● ●● ●●
● ●● ●
● ● ● ● ●
● ●
● ● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
●●●●●
●● ●●
●●●●●● ●●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●●
●
●●
●
●
●●
●
●
●●●
●
●
●
●●
●
● ●
●●●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●●●
●
●
● ●●● ● ●
● ● ●
●●
●●
●
●●●
●
● ● ●
● ●● ●●●●●● ● ● ●
● ● ●●●●●●● ●● ●● ● ●
● ● ●
●
−5
●●●●
●
● ● ●●● ● ● ● ●●
●●● ● ●
● ● ●●●●●●●●
●● ● ●
●●
●●●
● ●
●●●●● ● ●●
●●●●
● ●●●●
●
●●●
● ●● ●●
●
●
●
●
●● ●●
●● ●● ●
●
●
●●
●
●
●
●
●
●
● ●
●
●● ●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●●
●●
●
●
● ●
●●
●●
●●
●●
●● ●●●● ●●●● ●
● ● ● ●
●● ●●
●● ●●●●● ●●●
●
●●●●●●
● ●
●● ●●
●●●
● ●
●●● ●● ● ● ● ●
●● ●●●●●● ●●●● ●●● ● ●● ●● ●
● ● ● ● ●●●●●● ● ●
● ●●
● ●
−5 0 5 10
pca$x[, 1]
Figure 2.5: Scatterplot of the first two principal components of the zip.train digit image data
0.8
0.8
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.8
0.8
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 2.6: Images of the first 4 principal component loading vectors for the zip.train digit
data
2.6 Conclusion
Numerical linear algebra and matrix computations are the key tools needed to make sense
of multivariate data sets. The methods covered in this chapter provide the foundation for
much of the rest of the course.
Note that the standard reference on numerical linear algebra is:
Golub, G.H. and Loan, C.F.V. (1996) Matrix computations, Johns Hopkins University
Press.
Some of the examples in this chapter were based on corresponding examples in the
above text. In addition, see section 3.2.3 (p.52) and 3.9 (p.93), as well as section 14.5
(p.534) of [HTF] for further details.
Chapter 3
E(SX ) = Σ.
Before demonstrating these results, it should be noted that under some additional (fairly
mild) assumptions it can be shown that the sample variance matrix is also a consistent
estimator of Σ, but we will not pursue this here.
84
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 85
Proof
1.
n
!
1 X
E X̄ = E Xi
n i=1
n
1 X
= E(X i)
n i=1
n
1 X
= µ
n i=1
= µ.
2.
n
!
1 X
Var X̄ = Var Xi
n i=1
n
1 X
= Var(X i)
n2 i=1
n
1 X
= Σ
n2 i=1
1
= Σ.
n
3. Consistency
is intuitively obvious, since 1. and 2. tell us that E X̄ → µ and
Var X̄ → 0 as n → ∞. However, we can establish consistency more formally
by applying Markov’s inequality to kX̄ − µk2 . First note that
p
X
2 T
kX̄ − µk = (X̄ − µ) (X̄ − µ) = (X̄i − µi )2 ≥ 0.
i=1
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 86
Now we have
p
!
X
2
(X̄i − µi )2
E kX̄ − µk = E
i=1
p
X
E (X̄i − µi )2
=
i=1
p
X
= Var X̄i
i=1
p
X1
= Var(Xi )
i=1
n
1
= Tr (Σ) .
n
Then Markov’s inequality tells us that for any a > 0 we have
Tr (Σ)
P kX̄ − µk2 ≥ a ≤
na
Tr (Σ)
⇒ P kX̄ − µk2 < a ≥ 1 −
na
Tr (Σ)
P kX̄ − µk2 < ε2 ≥ 1 −
nε2
Tr (Σ)
⇒ P kX̄ − µk < ε ≥ 1 −
nε2
n
−→ 1.
∞
Now since n
1X
E X i − X̄ = µ − µ=0
n i=1
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 87
we have
These results now make explicit the relationship between the data summaries and
associated population quantities, and give us confidence that, at least in the case of
“large n”, our sample summary statistics should be “good” estimators of the corresponding
summaries for the population from which the data is sampled.
Proposition 22 (Derivatives wrt vectors and matrices) In each of the following results,
the derivative is of a scalar function, and is wrt a vector x or matrix X, where the compo-
nents of the vector or matrix are assumed to be algebraically independent.
1.
∂ T ∂ T
a x= x a=a
∂x ∂x
∗
The results can all be found in The Matrix Cookbook, to which you are referred for further details and many other
interesting results and examples. Note that you are not expected to memorise these results, and their derivations are
not examinable in this course.
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 88
2.
∂ T
x Ax = (A + AT )x
∂x
and note that this reduces to 2Ax when A is symmetric.
3.
∂
Tr (XA) = AT
∂X
4.
∂
Tr XT AX = AX + AT X
∂X
and note that this reduces to 2AX when A is symmetric.
5.
∂ ∂
Tr XT AXB = Tr BXT AX = AXB + AT XBT
∂X ∂X
and note that this reduces to 2AXB when both A and B are symmetric.
6.
∂
|X| = |X|X−T
∂X
7.
∂
log |X| = X−T
∂X
8.
∂ T
a Xb = abT
∂X
9.
∂ T −1
a X b = −X−T abT X−T
∂X
Now we have the necessary technical results, we can think about the regression problem.
yi = xi T β + εi , i = 1, 2, . . . , n.
We assume that the covariates are the rows of an n × p data matrix X, and so β is a
p-dimensional vector. We can write this in matrix form as
y = Xβ + ε.
Since we typically consider the case n > p, the presence of the error term ε is necessary,
otherwise the system would be over-determined, and then typically no β would exist in
order to solve the linear system
y = Xβ.
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 89
εT ε = (y − Xβ)T (y − Xβ)
= y T y − 2y T Xβ + β T XT Xβ,
β̂ = (XT X)−1 XT y,
but there are very efficient and numerically stable ways to compute this based on matrix
decompositions such as the QR factorisation, as discussed in Section 2.4.2.
Call:
lm(formula = velocity ∼ angle + radial.position, data = galaxy)
Coefficients:
(Intercept) angle radial.position
1586.9882 0.1076 2.4515
or just using vectors and matrices, as
> lm(galaxy[,5] ∼ as.matrix(galaxy[,3:4]))
Call:
lm(formula = galaxy[, 5] ∼ as.matrix(galaxy[, 3:4]))
Coefficients:
(Intercept)
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 90
1586.9882
as.matrix(galaxy[, 3:4])angle
0.1076
as.matrix(galaxy[, 3:4])radial.position
2.4515
Note that lm() includes an intercept term by default, as this is usually what we want.
However, we can tell R not to include an intercept term by using 0+ at the beginning of
the covariate list, either using model notation
> lm(velocity ∼ 0+angle+radial.position,data=galaxy)
Call:
lm(formula = velocity ∼ 0 + angle + radial.position, data = galaxy)
Coefficients:
angle radial.position
16.163 3.261
or matrix notation:
> lm(galaxy[,5] ∼ 0+as.matrix(galaxy[,3:4]))
Call:
lm(formula = galaxy[, 5] ∼ 0 + as.matrix(galaxy[, 3:4]))
Coefficients:
as.matrix(galaxy[, 3:4])angle
16.163
as.matrix(galaxy[, 3:4])radial.position
3.261
We can compute the regression coefficients directly using a QR factorisation with
> QR=qr(as.matrix(galaxy[,3:4]))
> backsolve(qr.R(QR),t(qr.Q(QR))%*%galaxy[,5])
[,1]
[1,] 16.163269
[2,] 3.261159
The intercept term can be found by appending 1 to the front of X, as
> QR=qr(cbind(rep(1,323),galaxy[,3:4]))
> backsolve(qr.R(QR),t(qr.Q(QR))%*%galaxy[,5])
[,1]
[1,] 1586.9881822
[2,] 0.1075920
[3,] 2.4514815
So we see that least squares problems can be solved with a QR factorisation, a rotation,
and a backwards solve. This is the method that the lm() function uses.
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 91
y i = BT xi + εi , i = 1, 2, . . . n,
Y = XB + E,
where E is an n × q matrix of “errors”, and this is typically known as the general linear
model. The regression problem here is, given data matrices X and Y, to find the matrix B
which makes the Frobenius norm of the error matrix, kEk, as small as possible. Note that
q
n X
X
2 T
ε2ij ,
kEk = Tr E E =
i=1 j=1
and so this is again a least squares problem, where we choose B to minimise the sum of
squares of the errors.
XT XB̂ = XT Y.
B̂ = R−1 QT Y.
Call:
lm(formula = Y ∼ X)
Coefficients:
[,1] [,2] [,3]
(Intercept) 0.005571 0.014090 -0.031221
X1 0.055359 0.011646 0.327704
X2 0.069211 0.163145 -0.031904
X3 -0.059708 -0.028433 -0.025860
X4 0.004978 -0.017083 -0.038912
X5 0.175285 0.160014 0.005762
Call:
lm(formula = Y ∼ 0 + X)
Coefficients:
[,1] [,2] [,3]
X1 0.0550853 0.0109542 0.3292373
X2 0.0694000 0.1636241 -0.0329654
X3 -0.0595292 -0.0279796 -0.0268643
X4 0.0034643 -0.0209107 -0.0304312
X5 0.1762379 0.1624230 0.0004232
We now solve directly using the QR factorisation, first without an intercept, and then with.
Note how the backsolve() function solves multiple RHSs exactly as we would wish.
> QR=qr(X)
> backsolve(qr.R(QR),t(qr.Q(QR))%*%Y)
[,1] [,2] [,3]
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 93
∂
Tr Σ−1 ET E = −2XT YΣ−1 + 2XT XBΣ−1 ,
∂B
where here we used results 3. and 5. of Proposition 22. Equating to zero gives the same
normal equations as previously, since the variance matrix cancels out. Therefore the
weighted solution is in fact the same as the unweighted solution, and appropriate irre-
spective of any covariance structure associated with the errors.
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 94
Definition 9 Let Z = (Z1 , Z2 , . . . , Zq )T , where the components Zi are iid N (0, 1). Then for
a p × q matrix A and p-vector µ, we say that
X = AZ + µ
has a multivariate normal distribution with mean µ and variance matrix Σ = AAT , and we
write
X ∼ N (µ, Σ).
There are several issues with this definition which require clarification. The first thing to
note is that the expectation and variance of X do indeed correspond to µ and Σ, using
properties of affine transformations that we considered in Chapter 1. The next concern
is whether the distribution is really characterised by its mean and variance, since there
are many possible affine transformations that will lead to the same mean and variance. In
order to keep things simple, we will begin by considering only invertible transformations
A. In this case, we must have p = q.
Proposition 23 The density of a p-dimensional MVN random vector X, with mean µ and
positive definite variance Σ is given by
−p/2 −1/2 1 T −1
f (x) = (2π) |Σ| exp − (x − µ) Σ (x − µ) .
2
Proof
We choose any invertible p × p matrix A such that AAT = Σ, and put
X = AZ + µ.
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 95
We first need to know the probability density function for Z, which is clearly given by
p
Y
φ(z) = φ(zi )
i=1
p
Y 1
= √ exp{−zi2 /2}
i=1
2π
p
( )
1 X
= (2π)−p/2 exp − z2
2 i=1 i
−p/2 1 T
= (2π) exp − z z .
2
Note that the final term is essentially the Mahalanobis distance. We can avoid direct
computation of |Σ| and Σ−1 by using an appropriate matrix decomposition. In the positive
definite case, the Cholesky decomposition usually provides the best solution, so suppose
that Σ has Cholesky decomposition
Σ = GGT ,
where G is lower triangular. To see how this helps, first consider the determinant term
1
log |Σ| = log |Σ|1/2 = log |G|.
2
But since G is lower triangular, its determinant is the product of its diagonal elements, and
so the log of this is given by
p
X
log |G| = log gii .
i=1
where z = G−1 (x − µ), and hence may be found by forward solving the triangular system
Gz = (x − µ).
This is an efficient way to compute the log density, as it just requires a Cholesky decom-
position and a single forward solve.
Y = AX + b,
Proof
The mean and variance of Y are clear from our understanding of affine transforma-
tions of random vectors. The key to this result is understanding why Y is MVN. For this,
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 97
note that we need to show that Y is an affine transformation of a standard normal vector,
but since X is MVN, there must exist a matrix M such that
X = MZ + µ,
where Z is a standard normal vector. Then
Y = AX + b
= A(MZ + µ) + b
= (AM)Z + (Aµ + b),
and so Y is also an affine transformation of a standard normal vector, and hence MVN.
Now since marginal distributions correspond to linear transformations, it follows that all
marginal distributions of the MVN are (multivariate) normal.
Proposition 25 If X ∼ N (µ, Σ) is partitioned as
X1 µ1 Σ11 Σ12
∼N , ,
X2 µ2 Σ21 Σ22
then the marginal distribution of X 1 is given by
X 1 ∼ N (µ1 , Σ11 ).
Proof
The result follows simply by noting that
X1
X 1 = (I, 0) ,
X2
and using properties of linear transformations.
Similarly, univariate marginals my be computed using the fact that Xi = ei T X, giving
Xi ∼ N (µi , σii ).
Proof
(X 1 |X 2 = x2 ) ∼ N (µ1|2 , Σ1|2 ),
where
Note that the simple version of the result presented here assumes that Σ22 is invertible,
but it generalises straightforwardly to the case of singular Σ22 .
Proposition 28 For an MVN model, the maximum likelihood estimator of the mean, µ is
given by
µ̂ = x̄.
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 99
Proof
We proceed by differentiating the log-likelihood function wrt µ and equating to zero.
First note that n
∂l 1X ∂
=− (xi − µ)T Σ−1 (xi − µ).
∂µ 2 i=1 ∂µ
Now
∂ ∂
(xi − µ)T Σ−1 (xi − µ) = (xi T Σ−1 xi − 2µT Σ−1 xi + µT Σ−1 µ)
∂µ ∂µ
= −2Σ−1 xi + 2Σ−1 µ
= 2Σ−1 (µ − xi ),
Proposition 29 For an MVN model, the maximum likelihood estimator of the variance
matrix, Σ is given by
n−1
Σ̂ = S.
n
Consequently, the maximum likelihood estimator is biased, but asymptotically unbiased,
and consistent.
Proof
We will proceed by using matrix differentiation to find a matrix Σ which maximises the
likelihood.
n
∂l n ∂ 1X ∂
=− log |Σ| − (xi − µ)T Σ−1 (xi − µ)
∂Σ 2 ∂Σ 2 i=1 ∂Σ
n
n 1 X −1
= − Σ−1 + Σ (xi − µ)(xi − µ)T Σ−1
2 2 i=1
" n #
n −1 1 −1 X
=− Σ + Σ (xi − µ)(xi − µ)T Σ−1 ,
2 2 i=1
CHAPTER 3. INFERENCE, THE MVN AND MULTIVARIATE REGRESSION 100
Y = XB + E,
y i = B T x i + εi ,
and we now make the modelling assumption εi ∼ N (0, Σ), and we assume for now that
Σ is known. We have already found the least squares estimate for B, but given the ad-
ditional distribution assumption we are now making, we can now construct the maximum
likelihood estimate of B. We first construct the likelihood as
n
Y
−p/2 −1/2 1 T −1
L(B) = (2π) |Σ| exp − εi Σ εi
i=1
2
( n
)
1 X
= (2π)−np/2 |Σ|−n/2 exp − εi T Σ−1 εi
2 i=1
−np/2 −n/2 1 −1 T
= (2π) |Σ| exp − Tr EΣ E
2
−np/2 −n/2 1 −1 T
= (2π) |Σ| exp − Tr Σ E E ,
2
Since the only dependence on B in this log-likelihood is via E, it is clear that maximising
this function wrt B is equivalent to minimising
Tr Σ−1 ET E .
But we have already seen that minimising this function wrt B leads to the normal equa-
tions, and the usual least squares solution
B̂ = (XT X)−1 XT Y.
So again, the natural estimator that we have already derived without making any distri-
butional assumptions may be re-interpreted as a maximum likelihood estimator in cases
where an assumption of multivariate normality is felt to be appropriate. Note further that
since the solution we obtain is independent of Σ, it also represents the maximum likeli-
hood estimate of B in the case of unknown Σ.
See 3.2 (p.44) and 3.7 (p.84) of [HTF] for further discussion of the topics discussed in
this Chapter.
Chapter 4
4.1 Introduction
4.1.1 Motivation
When we did PCA for the microarray data at the end of Chapter 2, we saw in Figure 2.4
that some of the samples appeared to “cluster” together into groups. This is something
that we can quite naturally do “by eye” in one or two dimensions, by examining the relative
“closeness” of groups of points, but it is not so obvious how to formalise the process,
or understand how it generalises to arbitrary data sets in p-dimensional space. Cluster
analysis is a formalisation of this process which can be applied to any set of observations
where some measure of “closeness” of observations can be defined. If one associates a
label with each group or cluster, then a clustering algorithm assigns a group label to each
observation in a data set. In the context of data mining and machine learning, cluster
analysis is considered to be an example of unsupervised learning. This is because it is
a mechanism for allocating observations to groups that does not require any training set,
which provides examples of labelled observations which can be used to learn about the
relationship between observations and labels. There are different approaches to cluster
analysis. Some approaches assign observations to a pre-specified number of groups,
and others hierarchically combine observations into small clusters and then successively
into bigger clusters until the number of clusters is reduced to the desired number. Of
course, a priori is is not always clear how many clusters will be appropriate, but methods
exist to address this issue, too.
1. dij ≥ 0, ∀i, j;
102
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 103
2. dii = 0, ∀i;
3. dij = 0 ⇒ xi = xj .
Occasionally the l1 norm, often known as Manhattan distance or the “taxi cab” norm,
d(xi , xj ) = 1 p T |xi − xj |, is used. In some applications it also makes sense to use the
Mahalanobis distance, q
dij = (xi − xj )T S−1 (xi − xj ),
(or the square of this), where S is the sample variance matrix of X.
If not all variables are real-valued, then usually an overall distance function d(·, ·) is
constructed as a weighted sum of distances defined appropriately for each variable, as
p
X
d(xi , xj ) = wk dk (xik , xjk ).
k=1
though there are obviously other possibilities. The weights, wk , could be chosen to be
equal, but this probably only makes sense if the distances for the different variables are
comparable, and all variables are felt to be equally important. Otherwise the weights
should be chosen to ensure that each variable is weighted appropriately in the overall
distance function. See section 14.3 (p.501) of [HTF] for further details.
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 104
distance (though in fact, it will work for observations belonging to any inner product space
with distance defined using the induced norm), with
These are the k-means referred to in the name of the algorithm. At this point it is helpful
to note that we have a sum-of-squares decomposition for the distances of observations
from their centroid exactly analogous to that used in one-way ANOVA.
The overall sum-of-squares distance for the data can therefore be orthogonally decom-
posed into a “within groups” and “between groups” sum of squares.
Proof
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 106
ni
k X ni
k X
X X
kxij − x̄k2 = kxij − x̄i + x̄i − x̄k2
i=1 j=1 i=1 j=1
k ni
XXh
= kxij − x̄ik2 + kx̄i − x̄k2
i=1 j=1
i
T
+ 2(x̄i − x̄) (xij − x̄i)
ni
k X ni
k X
X X
= kxij − x̄ik2 + kx̄i − x̄k2
i=1 j=1 i=1 j=1
ni
k X
X
+2 (x̄i − x̄)T(xij − x̄i)
i=1 j=1
ni
k X k
X X
= kxij − x̄ik2 + nikx̄i − x̄k2
i=1 j=1 i=1
k ni
X X
T
+2 (x̄i − x̄) (xij − x̄i)
i=1 j=1
ni
k X k
X X
2
= kxij − x̄ik + nikx̄i − x̄k2,
i=1 j=1 i=1
Pni
since j=1 (xij − x̄i) = 0.
An effective clustering algorithm with make the within-group sum-of-squares SSW as
small as possible (conditional on k), and so equivalently, will make SSB as large as pos-
sible (since SST OT is fixed). This, then, is precisely the goal of k-means clustering: to
allocate observations to clusters to minimise SSW . This initially seems like a simple task:
just consider every possible partition of the n observations into k clusters and choose the
allocation resulting in the smallest value of SSW . Unfortunately this is not practical for
any data set of reasonable size, as the number of possible partitions of the data grows
combinatorally. For example, for n = 100 and k = 5 the number of possible partitions is
around 1068 . Clearly a more practical solution is required. The k-means algorithm is one
possible approach.
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 107
3. Set each mi to be the sample mean of observations allocated to that cluster, that is,
set mi = x̄i , i = 1, 2, . . . , k.
To understand why the k-means algorithm works, first note that once we have com-
pleted step 2. the first time through the loop, we have a set of means and an allocation of
each observation to a cluster, and so we can consider the quantity
ni
k X
X
SS = kxij − mi k2 ,
i=1 j=1
and note that at the end of step 3., each time through the loop, the value of SS coincides
with SSW for the particular cluster allocation. But the k-means algorithm minimises SS
with respect to both the choice of the mi and the allocation of observations to clusters,
and hence minimises SSW .
To see this, first consider step 3. We have a value of SS from the end of step 2., but
in step 3 we alter the mi by replacing them with the cluster means x̄i . Now note that the
expression
Xni
kxij − mi k2
j=1
is minimised with respect to mi by choosing mi = x̄i , and hence step 3 has the effect
of minimising SS with respect to the mi conditional on the allocation of observations to
clusters. So step 3 cannot increase the value of SS, and any change in any mi will
correspond to a decrease in the value of SS.
Now assuming that the algorithm has not converged, the algorithm will return to step
2. Again note that there will be a value of SS from the end of step 3. (corresponding to
SSW for the cluster allocation). Step 2. will have the effect of leaving some observations
assigned to the same cluster, and moving some observations from one cluster to another.
Consider an observation x that is moved from cluster i to cluster j. It is moved because
kx − mj k < kx − mi k, and so thinking about the effect of this move on SS, it is clear that
the increase in SS associated with cluster j is less than the decrease in SS associated
with cluster i, and so the overall effect of the move is to decrease SS. This is true of every
move, and so the overall effect of step 2 is to decrease SS.
Since both steps 2. and 3. have the effect of decreasing SS, and SS is bounded below
by zero, the algorithm must converge to a local minimum, and this corresponds to a local
minimum of SSW . However, it must be emphasised that the algorithm is not guaranteed
to converge to a global minimum of SSW , and there is always the possibility that the
algorithm will converge to a local minimum that is very poor compared to the true global
minimum. This is typically dealt with by running the whole algorithm to convergence
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 108
many times with different random starting means and then choosing the run leading to
the smallest value of SSW .
NSCLC
20
NSCLC COLON
OVARIAN OVARIAN
RENAL COLON
RENAL RENAL
RENAL
OVARIAN
PROSTATE
OVARIAN
NSCLC
OVARIAN
● ● COLON
●COLON MCF7A−repro
COLON
MCF7D−repro
BREAST ●
UNKNOWN BREAST
RENAL
NSCLC
RENAL NSCLC
BREAST
●
OVARIAN
COLON ● ● ● ● ●
PROSTATE
RENAL
BREAST ● ● COLON
NSCLC
CNS
CNS
RENAL
NSCLC ● ●
NSCLC NSCLC ●
CNS CNS
BREAST CNS MELANOMA
● ●
0
K562B−repro
K562A−repro
LEUKEMIA
pca$x[, 2]
LEUKEMIALEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA
RENAL
−20
MELANOMA
MELANOMA
MELANOMA
MELANOMA
−40
MELANOMA
MELANOMA MELANOMA
BREAST
BREAST
−40 −20 0 20 40 60
pca$x[, 1]
Figure 4.1: k-means clustering (with k = 4) overlaid on a scatterplot of the first two principal
components of the nci microarray data
Choice of k
So far we haven’t discussed the choice of k. In some applications it will be clear from the
context exactly what k is most appropriate, and in many others, it will be clear “roughly”
what values of k should be considered. But in many cases a particular choice of k will
not be given a priori, and some indication from the data of what value might be most
appropriate can be very useful.
In practice, the choice of k is handled by running the k-means algorithm for several
values of k in the range of interest, and examining the way in which SSW decreases as k
increases (it should be clear that the minimum SSW should decrease monotonically with
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 110
260000
220000
SS_W
180000
1 2 3 4 5 6 7 8
Figure 4.2: SSW against k for k-means clustering of the nci microarray data
increasing k). Ideally we look for some kind of “kink”, or at least, levelling off of the sum
of squares as k increases, indicating diminishing returns from increasing k.
is quite slow to run due to the size of the data set. When it is finished we can examine the
content of the clusters as illustrated in the following R session.
> table(zip.train[,1][km$cluster==1])
0 1 2 3 4 5 6 7 8 9
7 3 46 73 5 71 2 4 433 7
> table(zip.train[,1][km$cluster==2])
1 2 3 4 6 7 8 9
1001 3 1 42 9 3 11 5
> table(Cluster=km$cluster,Digit=zip.train[,1])
Digit
Cluster 0 1 2 3 4 5 6 7 8 9
1 7 3 46 73 5 71 2 4 433 7
2 0 1001 3 1 42 0 9 3 11 5
3 524 0 3 2 0 15 42 0 2 0
4 0 0 13 2 5 3 0 445 1 61
5 512 0 12 1 1 7 29 0 4 0
6 15 0 34 546 0 296 1 0 30 2
7 7 0 565 12 14 9 36 2 10 0
8 10 0 40 8 440 40 10 23 10 125
9 119 0 6 2 6 110 533 0 3 0
10 0 1 9 11 139 5 2 168 38 444
From the results of this analysis, we see that the cluster labelled 1 represents mainly
the digit 8, together will a few other digits, especially 3 and 5, which are fairly similar to
a 8. The cluster labelled 2 mainly represents the digit 1. Cluster 3 consists mainly of
the digit 0, but also contains a number of 6s. Note that the cluster labels are completely
arbitrary, and re-running the analysis will (hopefully) lead to similar clusters, but typically
with a different permutation of the labels. On the basis of this very superficial analysis, it
does appear that the digits do largely occupy different regions of R256 , and so this gives
some encouragement that one might be able to automatically classify digits given just the
images (we will look at classification in more detail in the next chapter).
The following code fragment shows how to look at SSW as a function of k.
kmax=15
wss=numeric(kmax)
for (i in 1:kmax) {
print(i)
km=kmeans(zip.train[,-1],i,iter.max=30,nstart=10)
wss[i]=km$tot.withinss
}
plot(1:kmax,wss,type="l",xlab="k",ylab="SS_W",lwd=2,col=2)
The resulting plot (not shown), perhaps surprisingly, fails to show any especially notice-
able kink or flattening off at k = 10. In fact, this is just symptomatic of the general difficulty
of fixing k purely on the basis of the data. More sophisticated model-based clustering pro-
cedures provide principled methods for choosing k, but in most real data scenarios the
evidence for one value of k over another is often very weak. See section 14.3.6 (p.509)
of [HTF] for further details.
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 112
2. Find the minimum distance, dij (if there is more than one minimum, pick one at
random).
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 113
3. Combine clusters Ci and Cj into a single cluster, remove the ith and jth rows and
columns from the distance matrix, and add a new row and column corresponding
to the new cluster by calculating the distances between the new cluster and the
remaining clusters.
To put this algorithm into practice, we need ways of computing the distance between
clusters. One of the most commonly used methods, and arguably the most intuitive, is to
define the distance between two clusters to be the minimum distance between observa-
tions in the clusters, that is, for clusters A and B we define
This definition leads to the so-called single-linkage clustering method, often known as
nearest-neighbour clustering. Alternatively, we can define the distance between clusters
to be the maximum distance between observations in the clusters, that is
This definition leads to the complete-linkage clustering method, often known as furthest-
neighbour clustering. Both of these methods have the property that they are invariant to
any monotone transformation of the distance matrix. A further possibility (which does not
have this monotonicity property), is to use group average clustering, where
1 XX
dAB = dij .
nA nB i∈A j∈B
table, as
A 0
B 7 0
C 4 1 0
D 6 4 6 0
E 8 9 3 2 0
A B C D E
The minimum (off-diagonal) distance is clearly d(B, C) = 1,
and so we eliminate the rows and columns for B and C, then
add a new final row for the new cluster {B, C} as follows:
A 0
D 6 0
E 8 2 0
{B, C} 4 4 3 0
A D E {B, C}
Here the new distances have been calculated using the single-
linkage minimum distance rule. Looking at the new table, we
see that the smallest off-diagonal distance is d(D, E) = 2, and
so we remove D and E and add the new row {D, E} as fol-
lows:
A 0
{B, C} 4 0
{D, E} 6 3 0
A {B, C} {D, E}
Inspecting this new table, we see that the minimum distance
is given by d({B, C}, {D, E}) = 3, and this leads to the final
table
A 0
{B, C, D, E} 4 0
A {B, C, D, E}
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 115
The final step combines the remaining two clusters into a sin-
gle cluster consisting of all observations.
We can display the results of a clustering process in a tree diagram known as a den-
drogram by successively joining each pair of clusters in the order they were merged in the
clustering algorithm, and using the vertical axis to mark the distance between the clusters
at the stage they were merged. The result of this for the single-linkage clustering is shown
in Figure 4.3.
We now re-do the clustering procedure using the complete-linkage method, based on
maximum distance.
Cluster Dendrogram
4.0
A
3.0
Height
2.0
E
1.0
C
single−linkage clustering
hclust (*, "single")
Cluster Dendrogram
8
6
A
Height
4
2
E
0
complete−linkage clustering
hclust (*, "complete")
Cluster Dendrogram
90
LEUKEMIA
NSCLC
70
LEUKEMIA
LEUKEMIA
NSCLC
BREAST
BREAST
RENAL
NSCLC
BREAST
CNS
CNS
OVARIAN
NSCLC
OVARIAN
RENAL
MELANOMA
Height
OVARIAN
COLON
COLON
NSCLC
MELANOMA
BREAST
COLON
MELANOMA
CNS
OVARIAN
NSCLC
OVARIAN
PROSTATE
NSCLC
MELANOMA
MELANOMA
RENAL
PROSTATE
COLON
RENAL
RENAL
COLON
NSCLC
NSCLC
LEUKEMIA
LEUKEMIA
50
RENAL
COLON
COLON
RENAL
RENAL
RENAL
MELANOMA
MELANOMA
MELANOMA
CNS
CNS
LEUKEMIA
K562B−repro
K562A−repro
UNKNOWN
OVARIAN
30
BREAST
BREAST
MCF7A−repro
BREAST
MCF7D−repro
dist(t(nci))
hclust (*, "single")
Figure 4.5: Dendrogram for a single-linkage hierarchical clustering algorithm applied to the nci
microarray data
to obtain 3 clusters. Alternatively, the clustering can be obtained by specifying the height
at which the tree is to be cut, for example
ct=cutree(hc,h=2.5)
Just as for k-means, ad hoc methods are used to justify the number of clusters. Often
people try to look for a “break” in the tree, where there is a big jump in distance before the
next merging, but very often no such obvious gap is present in practice.
microarray data
30 50 70 90 20 60 100 140
dist(t(nci))
dist(t(nci))
NSCLC NSCLC
NSCLC CNS
PROSTATE CNS
OVARIAN CNS
PROSTATE CNS
CNS OVARIAN
CNS OVARIAN
Cluster Dendrogram
Cluster Dendrogram
CNS RENAL
CNS RENAL
CNS RENAL
BREAST RENAL
NSCLC RENAL
NSCLC RENAL
BREAST RENAL
MCF7A−repro OVARIAN
BREAST OVARIAN
MCF7D−repro NSCLC
COLON NSCLC
COLON NSCLC
COLON NSCLC
COLON MELANOMA
COLON CNS
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING
COLON NSCLC
COLON PROSTATE
BREAST OVARIAN
NSCLC PROSTATE
Figure 4.7: Dendrogram for a average-linkage hierarchical clustering algorithm applied to the nci
Figure 4.6: Dendrogram for a complete-linkage hierarchical clustering algorithm applied to the
119
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 120
NSCLC
20
NSCLC COLON
OVARIAN OVARIAN
RENAL COLON
RENAL RENAL OVARIAN OVARIAN COLON
● ● PROSTATE ● COLON MCF7A−repro
COLON
MCF7D−repro
● ●
RENAL OVARIAN
NSCLC
● RENAL
NSCLC
RENAL
UNKNOWN
NSCLC ● ● ●
OVARIAN BREAST
BREAST
RENAL ● ●
● ●
BREAST ● ● ● ●
BREAST PROSTATE
●
COLON
COLON
●
NSCLC
CNS ●
RENAL
●
CNS
CNS ●CNS ●MELANOMANSCLC
● CNS
NSCLC NSCLC
BREAST ●● ● ●
● ●
●●
0
K562B−repro
● LEUKEMIA
K562A−repro
pca$x[, 2]
LEUKEMIALEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA
RENAL
−20
MELANOMA
MELANOMA
MELANOMA
● ●
MELANOMA
●
−40
MELANOMA
●
MELANOMA MELANOMA
BREAST
●● ●
● BREAST
−40 −20 0 20 40 60
pca$x[, 1]
Figure 4.8: Four clusters obtained from a complete-linkage hierarchical clustering algorithm over-
laid on the first two principal components of the nci microarray data
method tends to give fairly spherical clusters, and is generally to be preferred, which is
why it is the default method in R. Other methods tend to try and interpolate between these
two extremes, including the average-linkage method.
To obtain an actual cluster allocation, we can cut the tree at 4 clusters and overlay the
results on the first two principal components, just as we did for the k-means procedure.
hc=hclust(dist(t(nci)))
ct=cutree(hc,4)
pca=prcomp(t(nci))
plot(pca$x[,1],pca$x[,2],pch=ct,col=ct)
text(pca$x[,1],pca$x[,2],colnames(nci),cex=0.3,pos=3)
This gives rise to the plot shown in Figure 4.8. The clustering perhaps looks unexpected,
but is difficult to interpret, since the clustering was carried out on (a distance matrix de-
rived from) the full observation vectors, but we are looking at the clusters on a simple 2d
projection.
Hierarchical clustering can be used to order a set of observations in such a way that
nearby observations are typically more closely related than observations that are far away.
This is exploited by the heatmap() function that we examined briefly in Chapter 1. Con-
sider now the following command
heatmap(nci,Rowv=NA,labRow=NA,col=grey((15:0)/15),cexCol=0.3)
which produces the plot shown in Figure 4.9. The use of the option Rowv=NA suppresses
the clustering of the rows (which is very time consuming and not particularly useful),
but since we have not explicitly suppressed clustering of the columns, this default action
is executed, and ensures that the columns are ordered meaningfully. Inspection of the
classification labels at the bottom of the columns suggests that the clustering agrees well
with the manually obtained diagnoses.
CHAPTER 4. CLUSTER ANALYSIS AND UNSUPERVISED LEARNING 121
Figure 4.9: A heatmap of the nci microarray data with columns ordered according to a hierarchi-
cal clustering
5.1 Introduction
In the previous chapter we considered the rather ambitious unsupervised learning task of
automatically grouping observations together into clusters of “similar” observations. Often
however, we have some a priori idea of what kinds of groups we expect to see (or at
least, some examples of observations that have already been classified into pre-specified
groups), and we want to use our current understanding of the group structure in order
to automatically classify new observations which have not yet been assigned to a group.
This is the statistical classification problem, and relies on partitioning the sample space
so as to be able to discriminate between the different groups. Data that has already been
classified into groups is often known as a training set, and methods which make use of
such data are often known as supervised learning algorithms.
123
CHAPTER 5. DISCRIMINATION AND CLASSIFICATION 124
Binary classification
The case k = 2 is of particular interest, as it represents a very commonly encountered
example in practice, and is known as binary classification. In this case one is typically
interested in deciding whether a particular binary variable of interest is “true” or “false”.
Examples include deciding whether or not a patient has a particular disease, or deciding
if a manufactured item should be flagged as potentially faulty.
In this binary case we will have 2 group means, which we can label µ1 and µ2 . Under
our simple classification rule we allocate x to group 1 provided that kx − µ1 k < kx − µ2 k,
otherwise we allocate to group 2. But then
It is then clear that the region Ri will be an intersection of half-spaces, and hence a convex
polytope.
Note that we are implicitly relying on linearity of expectation here to appropriately trans-
form the group means. We can write this rule differently by noting that
That is, we will allocate to the group whose mean is closest when measured according to
Mahalanobis distance. Again, this rule simplifies considerably as
which again passes through the mid-point 21 (µ1 + µ2 ), but is now the hyperplane orthog-
onal to Σ−1 (µ1 − µ2 ). This classification rule is known as Fisher’s linear discriminant, and
is one of the most fundamental results in classification theory.
Computational considerations
Computationally, it is often most efficient to Mahalanobis transform the data and then carry
out nearest group mean classification. This can be accomplished using the Cholesky
factorisation of Σ.
Note that in the high-dimensional case, p >> k, there is an additional simplification
which can arise, due to the fact that the k group means must all lie in a (k −1)-dimensional
subspace of Rp . Since the rule is to allocate to the closest group mean, all data may be
projected down to this (k −1)-dimensional space, and the classification may be carried out
in Rk−1 , since any components orthogonal to this subspace contribute the same distance
to each of the group means, and therefore cancel out of the classification rule. When
k << p this dimension reduction can result in significant computational savings.
The natural basis vectors for this (k − 1)-dimensional space are known as discrimina-
tion coordinates, or crimcoords, but their derivation and use is beyond the scope of this
course.
and so the boundary between the two groups will correspond to a contour of a quadratic
form.
Note that although the above approach to constructing a quadratic discrimination rule
is attractive, it does not penalise groups with large variances sufficiently. We will see later
how to improve on this quadratic rule by attempting to take a more principled model-based
approach.
Qi (x) = fi (x), i = 1, 2, . . . , k.
5.3.1 LDA
The simplest case of maximum likelihood discrimination arises in the case where it is as-
sumed that observations from all groups are multivariate normal, and all share a common
variance matrix, Σ. That is observations from group i are assumed to be iid N (µi , Σ)
random variables. In other words,
−p/2 −1/2 1 T −1
Qi (x) = fi (x) = (2π) |Σ| exp − (x − µi ) Σ (x − µi ) .
2
But this is just classification by minimising Mahalanobis distance, and hence corresponds
to Fisher’s linear discriminant, which allocates to group i if
T −1 1
(µi − µj ) Σ x − (µi + µj ) > 0, ∀j 6= i.
2
So in the case of equal variance matrices, the MLDR corresponds exactly to Fisher’s
linear discriminant.
CHAPTER 5. DISCRIMINATION AND CLASSIFICATION 130
Note that this is a quadratic form in x, but is different to the quadratic discriminant func-
tion we derived heuristically, due to the presence of the log |Σi | term, which has the effect
of appropriately penalising the distances associated with groups having large variances.
This penalty term, which corresponds to the normalisation constant of the density, gen-
erally improves the performance of the classifier, and hence is the form typically used in
QDA.
|Σ1 |
xT (Σ1 −1 − Σ2 −1 )x + 2(µ2 T Σ2 −1 − µ1 T Σ1 −1 )x + µ1 T Σ1 −1 µ1 − µ2 T Σ2 −1 µ2 + log < 0.
|Σ2 |
Here we can see explicitly that the quadratic term does not cancel out, and that the bound-
ary between the two classes corresponds to the contour of a quadratic form.
CHAPTER 5. DISCRIMINATION AND CLASSIFICATION 131
For LDA, a single pooled estimate of variance is required. The unbiased pooled estimate
of Σ is
k
1 X
SW = (ni − 1)Si ,
n − k i=1
and this can be plugged in for Σ. The corresponding MLE is
n−k
Σ̂ = SW ,
n
but note that here it doesn’t matter what divisor is used, since it will cancel out of Fisher’s
linear discriminant. Also note that it is not appropriate to use the overall sample variance
matrix, S, since the means within the groups are different. The discriminant function
becomes
Qi (x) = −(x − x̄i )T SW −1 (x − x̄i ).
In this case, Fisher’s linear discriminant for k = 2 takes the form, allocate to group 1 if
T −1 1
(x̄1 − x̄2 ) SW x − (x̄1 + x̄2 ) > 0.
2
6
●
● ●
●
4 ●
●
● ● ●●
● ●
●
2
X[,2]
● ● ●
●
● ●
0
−2
−2 −1 0 1 2 3
X[,1]
Figure 5.1: Simulated data with two different classification boundaries overlaid
shift1=matrix(rep(mu1,each=20),ncol=2)
shift2=matrix(rep(mu2,each=20),ncol=2)
shift=rbind(shift1,shift2)
X=W+shift
Class=rep(1:2,each=20)
We can plot the data, and overlay the decision boundaries we computed earlier as follows.
plot(X,pch=Class,col=Class)
abline(0,3,col=2)
abline(-1/8,13/4,col=3)
This gives the plot shown in Figure 5.1.
We can compute the sample means and pooled variance estimate, and then construct
the normal for Fisher’s discriminant rule using:
> xbar1=colMeans(X[Class==1,])
> xbar2=colMeans(X[Class==2,])
> V1=var(X[Class==1,])
> V2=var(X[Class==2,])
> V=(19*V1+19*V2)/(40-2)
> normal=solve(V,xbar1-xbar2)
> normal
[1] -6.41715 1.76622
We can then also use the lda() function from the MASS package to automate the process,
as
> require(MASS)
> X.lda=lda(X,Class,prior=c(0.5,0.5))
> X.lda
Call:
CHAPTER 5. DISCRIMINATION AND CLASSIFICATION 133
Group means:
1 2
1 -0.6744606 2.554725
2 2.1881006 1.346963
5.4 Misclassification
Obviously, whatever discriminant functions we use, we will not characterise the group
of interest perfectly, and so some future observations will be classified incorrectly. An
obvious way to characterise a classification scheme is by some measure of the degree of
misclassification associated with the scheme. For a fully specified model, we can define
CHAPTER 5. DISCRIMINATION AND CLASSIFICATION 134
Ideally we would like this to be Ik , but in practice the best we can hope for is that the
diagonal elements are close to one, and that the off-diagonal elements are close to zero.†
One way of characterising the overall quality of a classification scheme would be using
Tr (P) /k, with larger values of this quality score representing better schemes, and a max-
imum value of 1 indicating a perfect classification scheme.
For simple fully specified models and classification schemes, it may be possible to
compute the pij directly, analytically (see homework assignment). For models where the
parameters have been estimated from data, it may be possible to estimate p̂ij , the prob-
ability obtained from the model by plugging in MLEs for the model parameters. Even
where this is possible, it is likely to lead to over-estimates of the diagonal elements and
under-estimates of the off-diagonal elements, due to over-fitting, and ignoring sampling
variability in the parameter estimates.
Where there is no underlying model, or the model or classification scheme is complex,
it is quite likely that it will not be possible to compute P directly. In this case we can obtain
empirical estimates of the pij directly from data. The plug-in method estimates the pij by
re-using the data originally used to derive the classification rules. So, if we define nij to
be the number of class j observations allocated to class i (so that nj = ki=1 nij ), we can
P
estimate pij by
nij
p̂ij = .
nj
That is, we estimate the probability that a class j observation is classified as class i by
the observed proportion of class j observations that are classified as class i. Although
this method is very simple, and very widely used, it also leads to over-optimistic estimates
of the quality of the classification scheme by testing the classifier on the data used in its
construction (over-fitting).
1 20 0
2 0 20
A more reliable way of estimating the performance of a classifier in the case of very
large n is to split the data set (randomly) into two, and then use one half of the data
set to train the classifier and the other half to test the performance of the classifier on
independent data.
> tab=table(Predicted=predict(zip.lda,zip.test[,-1])$class,Actual=zip.
test[,1])
> tab
Actual
Predicted 0 1 2 3 4 5 6 7 8 9
0 342 0 7 3 1 6 1 0 5 0
1 0 251 2 0 4 0 0 1 0 0
2 0 0 157 3 6 0 3 0 2 0
3 4 2 4 142 0 16 0 2 11 0
4 3 5 12 3 174 3 3 7 7 4
5 1 0 2 9 0 125 3 0 4 0
6 5 3 1 0 2 0 157 0 0 0
7 0 0 1 1 2 0 0 129 0 5
8 3 1 12 4 1 5 3 1 135 3
9 1 2 0 1 10 5 0 7 2 165
> tab2=t(t(tab)/colSums(tab))
> round(tab2,digits=2)
Actual
Predicted 0 1 2 3 4 5 6 7 8 9
0 0.95 0.00 0.04 0.02 0.00 0.04 0.01 0.00 0.03 0.00
1 0.00 0.95 0.01 0.00 0.02 0.00 0.00 0.01 0.00 0.00
2 0.00 0.00 0.79 0.02 0.03 0.00 0.02 0.00 0.01 0.00
3 0.01 0.01 0.02 0.86 0.00 0.10 0.00 0.01 0.07 0.00
4 0.01 0.02 0.06 0.02 0.87 0.02 0.02 0.05 0.04 0.02
5 0.00 0.00 0.01 0.05 0.00 0.78 0.02 0.00 0.02 0.00
6 0.01 0.01 0.01 0.00 0.01 0.00 0.92 0.00 0.00 0.00
7 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.88 0.00 0.03
8 0.01 0.00 0.06 0.02 0.00 0.03 0.02 0.01 0.81 0.02
9 0.00 0.01 0.00 0.01 0.05 0.03 0.00 0.05 0.01 0.93
> sum(diag(tab2))/10
[1] 0.8749542
Here the predictive performance improves only slightly (two additional ’2’s were correctly
classified), due to the fact that the groups are approximately equally likely. However, in
cases where groups are very unbalanced, estimation of prior probabilities can make a
large difference to classification performance.
Using either method we see that we are able to automatically classify around 90% of
digit images correctly using a simple LDA classifier. This is probably not good enough
to be usable in practice, but nevertheless illustrates that simple linear algebraic methods
can be used to solve very high-dimensional classification problems well.
can also be estimated from data. This has the effect of introducing additional terms into
the discriminant functions. Bayesian classifies typically perform much better than heuristic
methods, or methods based on MLDR, but are beyond the scope of this course.
See section 4.3 (p.106) of [HTF] for further information about techniques for classifi-
cation.
Chapter 6
Graphical modelling
6.1 Introduction
Variables of multivariate distributions and data sets are often correlated in complex ways.
It would be convenient if many variables were independent, leading to a sparse indepen-
dence structure, as this would lead to many simplifications of much multivariate theory
and methodology. Unfortunately variables tend not to be independent in practice. It is
more often the case that variables are conditionally independent. Although not quite as
strong a property as independence, this too turns out to be enough to lead to consider-
able simplification of a general dependence structure, and is useful for both conceptual
and theoretical understanding, and also for computational efficiency. Sparse conditional
independence structures can be represented using graphs, and this leads to the theory
of graphical models. But before we explore graphical models, we must first ensure that
we understand the notion of conditional independence.
138
CHAPTER 6. GRAPHICAL MODELLING 139
capturing the intuitive notion that the conditional distribution of X given Y = y should not
depend on y. That is, the conditional and marginal distributions should be the same.
It turns out that independence is a special case of the more general notion of con-
ditional independence. For events A, B and C, we say that A and B are conditionally
independent given C, and write A⊥⊥B|C, if
P(A ∩ B|C)
P(A|B ∩ C) = = P(A|C) ,
P(B|C)
and so conditional on C, learning the outcome of B does not change the probability of A.
This notion again generalises easily to (continuous) random quantities, so that for
random variables X, Y and Z, we have that X and Y are conditionally independent given
Z, written X⊥⊥Y |Z if the conditional density of X and Y given Z = z factorises as
There are numerous other important properties of this factorisation for the joint density of
X, Y and Z, fX,Y,Z (x, y, z).
This factorisation property turns out to be key to understanding directed acyclic graph
(DAG) models.
Proof
Proposition 32 If X⊥⊥Y |Z and fZ (z) > 0, then the joint density factorises as
This factorisation property turns out to be important for understanding undirected graphi-
cal models.
Proof
Proposition 33 Assuming that fZ (z) > 0, we have that X⊥⊥Y |Z if and only if the joint
density factorises in the form
Proof
First note that the forward implication follows from Proposition 32. That is, if X⊥⊥Y |Z,
then there exist functions h(·, ·) and k(·, ·) such that
For example, we could choose h(x, z) = fX,Z (x, z) and k(y, z) = fY,Z (y, z)/fZ (z), but there
are obviously many other possibilities.
The reverse direction is less clear. We assume that we have the factorisation
Proposition 34
1. X⊥⊥Y |Z ⇔ Y ⊥⊥X|Z
2. X⊥⊥Y |Z ⇒ h(X)⊥⊥Y |Z
Proof
The proofs of these results are mainly straightforward, and left as an exercise.
be complete if E = V2 .
Example
Consider the graph G = (V, E) where V = {A, B, C} and E = {{A, C}, {B, C}}. We can
draw a pictorial representation of the graph as given in Figure 6.1. Note that this graph is
not complete, since the edge {A, B} is missing. Also note that we can plot this graph in
R using the ggm package, using the following commands
require(ggm)
drawGraph(UG( ∼ A*C+B*C))
Note that the drawGraph() function allows moving nodes around using a very simple
point-and-click interface. The command plotGraph() can also be used, and has a more
sophisticated interface, but may not be available on all systems.
CHAPTER 6. GRAPHICAL MODELLING 142
C●
A● B●
F = {{u, v} ∈ E|u, v ∈ U }.
Example
For the graph considered in the previous example, nodes A and C are adjacent (A ∼ C),
but nodes A and B are not. Similarly, the subgraph G{A,C} is complete, but the subgraph
G{A,B} is not.
Example
Again, considering again the previous example, the graph G has two cliques, {A, C} and
{B, C}.
Example
The previous example is somewhat uninteresting in that the cliques correspond with
edges. So consider now the graph G = (V, E) where V = {D, E, F, G} and
Draw the graph and write down the cliques. How many cliques are there?
CHAPTER 6. GRAPHICAL MODELLING 143
W● Z●
V●
X● Y●
The cliques are {D, E, F } and {E, G}. There are two cliques.
Note that the set of cliques defines the graph. That is, complete knowledge of all
cliques of the graph allows construction of the graph.
Example
Consider the graph G which has three cliques: {V, W, X}, {W, X, Y } and {W, X, Z}. We
can construct and plot this graph in R using
drawGraph(UG( ∼ V*W*X+W*X*Y+W*Y*Z))
leading to the plot shown in Figure 6.2.
the edge structure to encode the conditional independence structure. This is useful for
both visualisation and computational analysis. It turns out that there are several different
ways that we can do this, but they all turn out to be equivalent in practice in most cases.
Given a set of random variables X1 , X2 , . . . , Xn , we can define an associated graph
G = (V, E), where V = {X1 , X2 , . . . , Xn }, and E is a set of edges.
Definition 10 We say that G has the factorisation property (F) if the joint density of the
random variables factorises in the form
Y
fX (x) = φc (xc ),
c∈C
for some functions φc (·), where C denotes the set of cliques associated with the graph G.
Definition 11 We say that G has the global Markov property (G) if for any disjoint vertex
subsets A, B and C such that C separates A and B in G we have A⊥⊥B|C.
Definition 12 We say that G has the local Markov property (L) if for all v ∈ V we have
v⊥⊥V \ cl(v)|bd(v).
Definition 13 We say that G has the pairwise Markov property (P) if for all u, v ∈ V such
that u 6∼ v we have
u⊥⊥v|V \{u, v}.
We have just presented four different ways that one can associate a graph with a con-
ditional independence structure of a set of random variables. If these different interpreta-
tions all turn out to be fundamentally different, this is potentially confusing. Fortunately, it
turns out that these different interpretations are all closely linked.
This important result tells us that if a graph satisfies the factorisation property (F), then
the other properties all follow.
Proof
Let’s start with the simplest result, (G) ⇒ (L): This is trivial, since by definition, bd(v)
separates v from V \ cl(v).
Now let us consider (L) ⇒ (P):
and so Ã⊥⊥B̃|C. Now using property 2 of Proposition 34 we conclude first that A⊥⊥B̃|C,
since A ⊆ Ã and then that A⊥⊥B|C, since B ⊆ B̃. That is, (G) is satisfied.
This important (and difficult) result is stated without proof. However, the implication is that
in most practical cases, all four interpretations of a graph of conditional independence
structures are equivalent, and may be considered interchangeably.
Corollary 1 If all densities are positive, then (F) ⇔ (G) ⇔ (L) ⇔ (P).
Writing the density in this way allows us to partition x, µ and Q, and derive the conditional
distribution of the first part of x given the rest.
CHAPTER 6. GRAPHICAL MODELLING 146
Then
(X 1 |X 2 = x2 ) ∼ N (µ1|2 , Q11 −1 ),
where
µ1|2 = µ1 − Q11 −1 Q12 (x2 − µ2 ).
Proof
f (x1|x2) ∝ f (x)
1
∝ exp − (x − µ)TQ(x − µ)
2
1
∝ exp − (x1 − µ1)TQ11(x1 − µ1)+
2
+ (x2 − µ2)TQ22(x2 − µ2)
T
o
+2(x1 − µ1) Q12(x2 − µ2)
1 T
∝ exp − x1 Q11x1 − 2x1TQ11µ1 + 2x1TQ12(x2 − µ2)
2
1 T
∝ exp − x1 Q11x1 − 2x1TQ11µ1|2
2
1
∝ exp − (x1 − µ1|2)TQ11(x1 − µ1|2) .
2
Using this result, it is straightforward to see why zeroes of Q correspond to pairwise con-
ditional independence statements.
CHAPTER 6. GRAPHICAL MODELLING 147
First consider the case q12 = 0 (= q21 ). This is wlog, since we can always re-order the
variables to ensure that the zero of interest is in this position.∗ Partition X as
X3
X4
X1 X1
X= , where X 1 = and X 2 = .. .
X2 X2 .
Xp
Then X 1 |X2 = x2 has variance Q11 −1 , which must be diagonal, since Q11 is. Now this
means that X1 and X2 are conditionally uncorrelated, but for the MVN, uncorrelated and
independent are equivalent. Consequently X1 ⊥⊥X2 |(X3 , . . . , Xp ), and so q12 = 0 leads
directly to this CI statement. In general, we have the following result.
Proposition 38 If X ∼ N (µ, Q−1 ), then for i 6= j we have
qij = 0 (= qji ) ⇔ Xi ⊥⊥Xj |X\{Xi , Xj }.
Therefore, there is a direct correspondence between the zero structure of Q and an undi-
rected graph with the pairwise Markov property (P), where zeroes in Q correspond to
missing edges in the graph. Further, provided that Q (or equivalently, the variance matrix,
Σ) is strictly positive definite, the densities are all positive, and we can conclude that all of
our properties (F), (G), (L) and (P) are satisfied by the associated graph.
20 40 60 80 10 30 50 70
40 80
● ● ●● ●
● ●
●
● ●● ● ●●
●●● ● ●● ● ● ● ● ● ●● ●●●● ●
●●●●●
●● ● ●●●
●●
●
● ● ● ● ●●●
●●
●
● ● ●●●
●●●
● ●●●●
● ●
●● ●
●
●●●●●
●● ● ●
●●●●●●●
●
● ●●●● ● ●
● ● ●● ● ●●●●
●●●●● ●●
●● ● ● ●●● ●●●●●●●●●● ●●●
● ● ● ●● ● ●● ●●●●●● ●
● ●●●
● ●●● ●
mechanics ●●●●
●
●
●
●
●
●●
● ● ● ●●●● ●
●●
●
●
● ●
●● ●●
●
●
●
●●●
●
●●
●
●
●
●
● ● ●● ●●
●●
●
●●
●
●
●
●
●●●
● ●
●
●
● ●● ●
● ●●●●
● ●
● ● ● ●●● ●● ●
● ●
●
● ●●
●● ●●● ● ●●●● ●
●●● ● ● ●● ●● ●● ●●● ●● ●
●● ●
●● ●
● ● ● ●● ●
● ●
●● ●●●● ●● ●●●● ● ●● ● ● ● ● ● ●
● ● ● ● ●●● ●
0
● ● ● ● ●● ●
●
● ●● ● ●●● ●
● ●●●●●●●●●● ●●● ● ●● ●
● ● ● ●● ●
●● ● ●
20 60
●●●● ●
● ●●
●● ● ●●● ●
●●● ● ● ● ●●● ● ● ●●● ●●● ●●●●
● ● ●●
●● ●
●●●● ●●●●
● ●●
●
●
●●
●● ●
●●
●●
●●
●
● ● ● ●●●●●●●●●
● ●
●
●●● ●● ●●●●●●
● ●
●●●●● ● ●●● ● ● ●● ●
●●● ●● ●●● ●● ●● ● ●● ● ● ●●
●●● ●
● ●
● ● ● ●
●
●
●●
● ● ● ●●●●
●
●●
● ●
●●
●
●●
●●
●●
●
● vectors ● ●
●●●●
● ●
●●●●●
● ●●
● ●
●
●●
●●
●
●
●
●
●
●
● ●●
●●●●
●●
● ● ●● ●●
● ●
● ●
●
●
●●
●●●
●●
●● ●
●
●
●
●
●
● ●
● ●
●●
●
●●
●
●
●
●●
●●
●●
●●
● ●
● ●
●● ● ●
●
●●
●● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
●● ●● ● ●● ● ●● ● ●
●●● ● ●●●●● ●●● ●● ●
●● ●
●● ● ● ● ●● ● ●● ●
60
●● ● ● ●● ●
●●●● ● ●●●● ●● ● ● ● ● ● ●● ●●●●
●● ●●●●●
●●
● ●● ●●● ●●●●●
● ●●● ●●
● ●●
●●● ●
●●●●
●● ●
●● ●
●
●●
●
●●
●
●●
●
●
●
● ● ●●●●
●● ●● ●
●
●● ●
● ●●
● ●
● ●● ●
●●
●
●●
●●
●
●●
●●
●
●
●● ●
●
●● ● ●● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●● ●
● algebra ●●●●●
●
●●
●●
●●
●●●
●●●
●●●
● ●●
● ●●
●
●
●●●
●●●
●
●
●
●
●● ●
●●
●
●
● ●
●
●●●
●●
●
●
●
● ● ● ●
●
●
● ●●●●●●● ●
●
●
●●
●●●
●●
●
●
●
●
●● ●
●● ●●● ●
● ●
20
● ● ● ●
● ● ● ●
● ● ● ●●●● ●● ● ● ●● ● ●
● ● ●●●●● ●●●●● ●●
● ● ●●●●●●
● ●●●
●●● ●● ●●●●
●●●●
● ●● ● ●
●
● ●●●
● ●● ●●
●● ●●●●●●●●●●●●● ●●
●● ●●
● ●●●●●●
● ●● ●●
● ●●● ●●●● ● ●●● ● ●
● ● ●●
●●●● ● ●●● ●● ●
50
●●● ●● ●
●● ●● ● ●● ●●
●● ●●
●●●●●●
● ●● ●
●● ● ●●● ●●●● ● ●● ●
● ● ●●●● ●● ●●●●
●●●● ●
●
●●● ●●●● ● ●●●●●●●●● ● ● ●●● ●
●● ●● ● ●
● ●● ● ●●●●
●●● ● ●●●●
●
●● ●
●●
●●
●●● ●
●● ●●●
●
●
● ● ●●
●●●●
●
●●
●
●
●●
●
●
●● ●
analysis ●●●
●
●
●
●● ●
● ●● ●●●
●
● ●
●
●
●● ● ● ●● ● ●● ●
● ● ●●●
●
●
●●●
● ●●
●● ● ●
● ● ● ● ● ●
10
● ● ● ●
● ● ●● ● ●●● ● ●● ● ● ● ●●
●
● ●●● ● ●●●●●● ● ●● ● ●●●● ● ● ●●
● ● ● ●
● ● ● ● ●●● ● ● ●●● ● ●
●●● ● ●● ●●●●●●
● ●
● ●●● ● ● ●●● ●
50
● ● ● ●● ● ● ●● ●● ●●● ●●
● ●● ●
●●● ●●
●●
●●
● ●●●
●●
●
●
●●●●●●
●
●●
●
●
● ●
●●
●●
●●
●
●●
●●
●●● ● ●● ●●
●●
●●● ●
●●
●●●
●
●●●
●●
●●
●● ●● ●
● ●●●
●●●
●
●●●
● ●
●●● ● ●●●
●●●
● ●
●
●
●●
●
●
● ●
●●●●
●
●
●●●
●
●
●
● ●
●●
●
● ● ●● ●
●●●
●
● ● ●●●●●●●
●
●
●●
●●
●
●
●●
●●
●●● ●
●
●●●
● statistics
●●● ●
●
● ● ●●● ● ● ●●●●● ●
● ● ●●● ●●
● ●
● ● ●● ●●●●● ●● ● ●● ●● ● ●●
10
● ● ● ● ●
0 20 60 20 40 60 80 10 30 50 70
data(marks)
pairs(marks)
giving the plot shown in Figure 6.3. The plot shows a general positive correlation between
marks in different subject areas (unsurprisingly), and this can be confirmed by inspecting
the sample variance matrix for the data.
> Sigma=var(marks)
> round(Sigma,digits=2)
mechanics vectors algebra analysis statistics
mechanics 305.69 127.04 101.47 106.32 117.49
vectors 127.04 172.84 85.16 94.67 99.01
algebra 101.47 85.16 112.89 112.11 121.87
analysis 106.32 94.67 112.11 220.38 155.54
statistics 117.49 99.01 121.87 155.54 297.76
There is clearly no indication that any of these variables are marginally independent. To
investigate conditional independence structure, we inspect the partial correlation matrix.
> PCorr=parcor(Sigma)
> round(PCorr,digits=2)
mechanics vectors algebra analysis statistics
mechanics 1.00 0.33 0.23 0.00 0.03
vectors 0.33 1.00 0.28 0.08 0.02
algebra 0.23 0.28 1.00 0.43 0.36
analysis 0.00 0.08 0.43 1.00 0.25
statistics 0.03 0.02 0.36 0.25 1.00
We see immediately that, to 2dps, the partial correlation between mechanics and analysis
is zero. However, it is also clear that the sample partial correlations between mechanics
and statistics, and between vectors and statistics, are also very small, and probably not
CHAPTER 6. GRAPHICAL MODELLING 150
mech stat●
●
alg
●
vec ana
● ●
Figure 6.4: Possible conditional independence graph describing the exam marks data
significantly different from zero. In fact, even the partial correlation between vectors and
analysis is smaller than 0.1. If we arbitrarily threshold the partial correlations at a value of
0.1, then we immediately see the clique structure in the adjacency matrix:
> adj=1*(abs(PCorr)>0.1)
> adj
mechanics vectors algebra analysis statistics
mechanics 1 1 1 0 0
vectors 1 1 1 0 0
algebra 1 1 1 1 1
analysis 0 0 1 1 1
statistics 0 0 1 1 1
and so we can plot the corresponding conditional independence graph with
drawGraph(adj)
giving the plot shown in Figure 6.4. What we see is that algebra separates mechanics and
vectors from analysis and statistics. It is not that (say) ability in mechanics and statistics
is uncorrelated, but that they are uncorrelated given ability in algebra. That is, although
knowing how good someone is at mechanics is useful for predicting how good someone
is at statistics, if we already knew how good they were at algebra, learning how good
they are at mechanics will give us no additional information about how good they are at
statistics.
Obviously, there exist a range of statistical methods for estimating the variance and
precision matrix of a GGM conditional on a given CI structure, and also methods for
statistical testing of partial correlations, and methods for searching model space in a
principled way. Unfortunately we do not have time to explore these methods in detail in
this course, but note that the ggm package includes functions such as fitConGraph()
and pcor.test() which can be useful in this context.
CHAPTER 6. GRAPHICAL MODELLING 151
angle velocity
● ●
radial.position
●
north.south east.west
●
●
Figure 6.5: Possible conditional independence graph describing the galaxy data
See section 17.3 (p.630) of [HTF] for further details of estimating undirected GGMs
from data.
CHAPTER 6. GRAPHICAL MODELLING 152
But these factorisations are different, and in the context of conditionally specified statisti-
cal models, one of these choices may be much more natural or convenient than another.
In many applications, a particular choice of factorisation will be needed, and so it is useful
to have a method of encoding a particular factorisation and associated conditional inde-
pendence statements in a formal way. This turns out to be very convenient using directed
acyclic graphs (DAGs).
X● Z●
Y●
Definition 14 We say that the DAG G has the recursive factorisation or directed factori-
sation (DF) property if the joint density of the random variables factorises in the form
Y
f (x) = f (v| pa(v)).
v∈V
Models constructed in this way are sometimes referred to as Bayesian networks, due to
their occurrence in many applications of Bayesian inference.
Definition 15 We say that the DAG G has the directed local (DL) Markov property if for
all v ∈ V we have
v⊥⊥ nd(v)|pa(v).
Y ⊥⊥(X, Z)|Z,
or equivalently,
Y ⊥⊥X|Z.
Proof
We assume (DF), and pick an arbitrary u ∈ V . We wish to show that u⊥⊥ nd(u)|pa(u).
First partition V = u ∪ nd(u) ∪ de(u), and write our factorisation in the form
Y Y
f (x) = f (u| pa(u)) f (v| pa(v)) f (v| pa(v)).
v∈nd(u) v∈de(u)
We are not interested in the descendants of u, so we can marginalise those away to leave
Y
f (u ∪ nd(u)) = f (u| pa(u)) f (v| pa(v)).
v∈nd(u)
Proof
Start with a list ordering of the nodes x1, x2, . . . , xp, then fac-
torise the joint density in the form
f (x) = f (x1)f (x2|x1) · · · f (xp|x1, . . . , xp−1)
Yp
= f (xi|x1, . . . , xi−1).
i=1
Corollary 2 The moral graph G m associated with a DAG G satisfying (DL) or (DF) satisfies
the Markov properties (G), (L) and (P).
Note that none of the above discussion relies on the positivity assumption necessary for
the Hammersley-Clifford theorem. In summary, one way to understand the conditional
independence assumptions associated with a DAG is to form the associated moral graph,
and then read of conditional assumptions from the moral graph according to any of the
interpretations (G), (L) or (P). Note that this process loses information, in that not all of
the (conditional) independence statements associated with the original DAG model are
present in the corresponding moral graph, but we will not pursue this issue in detail here.
Example: collider
At the beginning of this section we considered three different factorisations corresponding
to the conditional independence statement X⊥⊥Y |Z. Each of these factorisations corre-
sponds to a graph with 3 nodes and 2 directed edges, with Z in the middle. There is a
graph with two edges with Z in the middle we have not considered, known as a collider.
It is the graph with edges X → Z and Y → Z. We can draw it using R with
drawGraph(DAG(Z ∼ X+Y))
giving the plot shown in Figure 6.7.
Note that this DAG corresponds to the factorisation
There is no (non-trivial) CI statement associated with this DAG. In particular, since X and
Y are both parents of Z, they get “married” in forming the moral graph, and so the moral
CHAPTER 6. GRAPHICAL MODELLING 157
X● Y●
Z●
graph for this DAG is complete. However, it should be noted that this factorisation and the
corresponding DAG do encode the marginal independence of X and Y . So here X and Y
are marginally independent, but not conditionally independent given Z. It is not possible
to encode this information in an undirected graphical model. It is therefore possible to
encode information in DAG models that is not possible to encode in undirected graphs.
Example
Draw the DAG for the random variables X1 , X2 , X3 and X4 given the following factorisation
of the joint density,
Draw the associated moral graph. Is it true that X1 ⊥⊥X2 |X3 ? Is it true that X2 ⊥⊥X4 |X3 ?
Example
Write down the factorisation of the full joint density implied by the following DAG:
X4● X5●
X6●
f (x) = f (x1)f (x2)f (x3)f (x4|x1, x2)f (x5|x2, x3)f (x6|x4, x5)
graphical Gaussian case we assume that the variables are bivariate normal, and the two
factorisations are indistinguishable from data. In general, the best strategy is often to
estimate an undirected graph structure from data, and then explore the set of directed
graphs consistent with this undirected graph, informally.
6.6 Conclusion
Graphical models are one of the most important tools in modern multivariate statistical
modelling and analysis. In particular, they are central to Bayesian hierarchical modelling,
and many modern Bayesian computational methods. We do not have time to explore
these ideas here, but the basic properties of undirected and directed graphs, and graphi-
cal Gaussian models forms a foundation for further study in this area.
Note that the standard reference on the theory of graphical models is:
Some of the material in this chapter was derived from the above text.
Chapter 7
y = Xβ + ε,
XT Xβ̂ = XT y.
When n ≥ p and X has full column rank, there is a unique solution given by
β̂ = (XT X)−1 XT y.
However, if X does not have full column rank (for example, because some of the variables
are co-linear), or p > n, then XT X will not be invertible, and there will be many solutions
to the normal equations, corresponding to many different minimisers of the loss function
L(β). Even if X is full rank, if it is close to rank degenerate, then solution of the normal
equations may be numerically unstable, and the “optimal” β may turn out to have poor
predictive properties.
Regularisation and variable selection are two ways to deal with the above problem,
which turn out to be closely related. Regularisation is concerned with smoothing the
loss function, ensuring it has a unique minimum, and “shrinking” the solution towards a
sensible default, often zero. Regularisation methods are often termed shrinkage methods
for this reason. We begin by looking at the classical method of regularising regression
problems, often known as ridge regression, or Tikhonov regularisation.
160
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 161
Call:
lm(formula = y ∼ X)
Coefficients:
(Intercept) Xeast.west Xnorth.south Xangle
1589.4229 0.7741 -3.1918 0.1245
Xradial.position
0.9012
As usual, we can append a column of 1s to the front of X and then compute the least
squares solution either using lm() or by solving the normal equations directly.
> X0=cbind(rep(1,length(y)),X)
> lm(y ∼ 0+X0)
Call:
lm(formula = y ∼ 0 + X0)
Coefficients:
X0 X0east.west X0north.south
X0angle
1589.4229 0.7741 -3.1918
0.1245
X0radial.position
0.9012
> solve(t(X0)%*%X0,t(X0)%*%y)
[,1]
1589.4229455
east.west 0.7740997
north.south -3.1917877
angle 0.1245414
radial.position 0.9011797
> QR=qr(X0)
> solve(qr.R(QR),t(qr.Q(QR))%*%y)
[,1]
1589.4229455
east.west 0.7740997
north.south -3.1917877
angle 0.1245414
radial.position 0.9011797
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 163
We can avoid the complications associated with intercepts by first centering the output
and the predictor matrix.
> y=y-mean(y)
> W=sweep(X,2,colMeans(X))
> solve(t(W)%*%W,t(W)%*%y)
[,1]
east.west 0.7740997
north.south -3.1917877
angle 0.1245414
radial.position 0.9011797
>
> QR=qr(W)
> solve(qr.R(QR),t(qr.Q(QR))%*%y)
[,1]
east.west 0.7740997
north.south -3.1917877
angle 0.1245414
radial.position 0.9011797
We can now carry out ridge regression (with λ = 100) by direct solution of the regularised
normal equations
> solve(t(W)%*%W+100*diag(4),t(W)%*%y)
[,1]
east.west 0.7646416
north.south -3.1881190
angle 0.1244684
radial.position 0.9059122
We see that the predictor is very similar to the usual least squares solution, but the addi-
tion of a diagonal term will ensure that the numerical solution of the equations will be very
stable. Note that we could also compute the optimal predictor by numerically optimising
the loss function.
> loss<-function(beta)
+ {
+ eps=y-W%*%beta
+ return(sum(eps*eps)+lambda1*sum(abs(beta))+lambda2*sum(beta*beta))
+ }
> lambda1=0
> lambda2=0
> optim(rep(0,4),loss,control=list(maxit=10000,reltol=1e-12))
$par
[1] 0.7740986 -3.1917888 0.1245412 0.9011769
$value
[1] 288650.8
$counts
function gradient
537 NA
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 164
$convergence
[1] 0
$message
NULL
Note that in the case lambda1=0, the loss is exactly as we require for ridge regression.
The other term in this loss will be explained in the following section. Note that direct
numerical optimisation of multivariate functions is fraught with difficulty, and so whenever
a direct alternative exists, it is almost always to be preferred. In this case, direct solution
of the regularised normal equations is a much better way to compute the solution.
where p
X
kβk1 = |βi |
i=1
is the l1 -norm of β. The switch from 2-norm to 1-norm seems at first to be unlikely to make
a significant difference to the shrinkage behaviour of the estimator, but in fact it does, and
importantly, encourages “sparsity” in the optimal solution. It also complicates analysis
somewhat, since it is no longer possible to compute a simple closed-form solution for the
optimal β, and since the l1 -norm is not everywhere differentiable (it is not differentiable
on the coordinate axes), methods of optimisation which assume smooth functions cannot
be used. In fact, it is precisely this lack of differentiability of the l1 -norm on the axes
which causes the method to often have minima including components of β which are
exactly zero. Zeroes in the optimal β correspond to variables which are dropped out
of the regression, and similarly, the non-zero elements of β correspond to the selected
variables. Thus, the simple switch from an l2 to an l1 regularisation penalty leads directly
to a method which simultaneously regularises and selects variables for analysis.
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 165
Clearly for λ = 0 and for small values of λ > 0, the method will behave very like
ordinary least squares regression. However, as the value of λ > 0 increases, the effect of
the l1 penalty takes effect, and variables of weak predictive effect begin to drop out of the
optimal predictor.
$value
[1] 713727.1
$counts
function gradient
469 NA
$convergence
[1] 0
$message
NULL
Note that for lambda2=0, the loss is exactly what we need for the Lasso. Note that the first
coefficient is very close to zero, so the first variable has dropped out of the regression. As
already explained, direct numerical optimisation is problematic, especially for non-smooth
objectives such as this. Fortunately there is an R package called elasticnet which has
efficient procedures for optimising loss functions of this form. We can use it to solve this
problem as follows.
> require(elasticnet)
> predict(enet(X,y,lambda=lambda2,normalize=FALSE),s=lambda1,mode="
penalty",type="coefficients")$coefficients
east.west north.south angle radial.position
0.000000000 -2.836045617 0.007405521 1.104725175
Note how the value for the ridge loss term (here, 0) is passed into the call to enet(), but
the coefficient of the l1 penalty is only needed for the call to the generic function predict
(). This is due to the way that the algorithm is implemented, in that solutions for all values
of the l1 penalty are computed simultaneously. We can exploit this in order to understand
the effect of varying the l1 penalty over a range of values. The command
> plot(enet(X,y,lambda=lambda2,normalize=FALSE))
gives the plot in Figure 7.1. The left hand side of this plot shows the results of using a
very large shrinkage parameter which shrinks all of the regression coefficients to zero.
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 166
Standardized Coefficients
angle
0
−1
−2
north.south
−3
|beta|/max|beta|
Figure 7.1: A graphical illustration of the effect of varying the LASSO shrinkage parameter on the
coefficients of the optimal predictor.
The RHS shows the effect of using a zero shrinkage parameter, leading to the usual co-
efficients for ordinary least squares. Moving from left to right, we see additional variables
being incorporated into the optimal predictor as the l1 penalty is gradually relaxed.
Clearly in the special case λ1 = 0 we get ridge regression, and in the case λ2 = 0 we get
the lasso. So the elastic net is a generalisation of both ridge regression and the lasso,
and combines the best aspects of both.
> predict(enet(X,y,lambda=lambda2,normalize=FALSE),s=lambda1,mode="
penalty",type="coefficients")$coefficients/(1+lambda2)
east.west north.south angle radial.position
0.000000 -2.834275 0.007378 1.104903
Note the division by (1 + λ2 ) at the end. If we do not impose this correction, we get
> predict(enet(X,y,lambda=lambda2,normalize=FALSE),s=lambda1,mode="
penalty",type="coefficients")$coefficients
east.west north.south angle radial.position
0.000000 -286.261800 0.745178 111.595172
These coefficients are a correction to those of the naive elastic net we have presented,
which are thought to have better predictive performance in some scenarios.
7.1.5 p >> n
In some settings we have many more variables than observations, p >> n. We have seen
that in this case there is not a unique solution to the ordinary least squares problem. In
this case regularisation is used in order to make the problem well-defined. We have seen
that ridge regression, or Tikhonov regularisation, is very effective at making the problem
solvable, and shrinking the solution towards the origin. Some kind of shrinkage is vital in
the p >> n scenario, and loss functions containing an l2 penalty are very often used for
this purpose. Variable selection is very often desirable in high dimensional scenarios, but
the Lasso does not perform especially well for p >> n due to the lack of l2 regularisation.
In this case, the elastic net, which combines l1 and l2 regularisation, performs much better,
allowing for a combination of shrinkage and variable selection which can lead to sparse
predictors with good performance.
Figure 7.2: A graphical illustration of the effect of varying the l1 shrinkage parameter on the
coefficients of the optimal predictor for the nci microarray data.
lambda1=10
lambda2=100
nci.enet=enet(X,y,lambda=lambda2,normalize=FALSE)
predict(nci.enet,s=lambda1,mode="penalty",type="coefficients")$
coefficients/(1+lambda2)
plot(nci.enet)
This leads to the plot shown in Figure 7.2. We see many predictors being brought in as
the l1 penalty is relaxed. In practice, we will most likely decide on a sensible number of
predictors, and then choose λ1 appropriately. Alternatively, we could keep back some test
data and choose the parameters via cross-validation.
See section 3.3 (p.57) and 3.4 (p.61) of [HTF] for further details of regularisation and
variable selection and Chapter 18 (p.649) of [HTF] for further details of methods for p >>
n.
Now, for small α, (1 − α)m ' 1 − mα (first 2 terms in binomial expansion), and so
Note that although we have derived this as an approximation, and assuming independent
tests, it is possible to deduce directly using Boole’s inequality that
FWER ≤ mα,
this. For example, for m = 7, 000, if a FWER of α0 = 0.1 is required, then a significance
level of around α = 0.000014 is required. This incredibly stringent significance level is
required in order to control the FWER, but will clearly come at the expense of many more
false negatives. For large m, this leads to a procedure with poor power for identifying the
true positives of primary interest.
X
FDR = .
l(α)
But then
E(X) mα
E(FDR) = = .
l(α) l(α)
If we want to have E(FDR) < α0 , we need to have
mα α0 l(α)
< α0 ⇒ α < .
l(α) m
That is, we need to choose α sufficiently small that this inequality is satisfied (but other-
wise as large as possible). However, we need to be a little bit careful, since both sides of
this inequality depend on α.
In order to see how to solve this, it is helpful to invert the relationship between l and α,
where we now regard l as given, and consider α to be the value of α giving rise to a list of
length l. Then we have
α0 l
α(l) < ,
m
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 171
α0 l
p(l) <
m
is the inequality of interest. But then we can visualise the solution of this problem by
0
plotting p(l) and αml as functions of l, and take the crossing point as determining l and p(l) ,
the threshold significance level.
1.0
0.8
pval.sort
0.6
0.4
0.2
0.0
1:6830
plot(1:500,pval.sort[1:500],type="l",col=2)
abline(0.05/6830,0,col=5)
abline(0,0.05/6830,col=4)
We still can’t see exactly where the p-values cross the Bonferroni threshold, but we
know see that the p-values cross the FDR threshold at around 180 (in fact, it first exceeds
at 187), and so we will choose to look at the smallest 186 p-values (corresponding to a
significance threshold of around 0.0014), if we are only prepared to tolerate a FDR of 5%.
An alternative way to view the solution to the problem, which is also informative, is to
rewrite the inequality as
p(l) m
< α0 .
l
Then defining
p(l) m
f(l) = ,
l
we want to find the largest l such that
f(l) < α0 .
So if we plot f(l) against l, we look for the (last) crossing of the α0 threshold, from below.
Consequently, we can think informally of f(l) as representing the expected FDR associated
with the lth ordered p-value. This is closely related to (but not quite the same as) the
concept of a q-value, which is also a kind of FDR-corrected p-value. Further examination
of such concepts is beyond the scope of this course.
0.008
pval.sort[1:500]
0.004
0.000
1:500
Figure 7.4: First 500 ordered p-values for the nci microarray data.
fdr=6830*pval.sort/1:6830
plot(1:500,fdr[1:500],type="l",col=2)
abline(0.05,0,col=3)
Notice that this function is not monotonic in l, and this why it is not quite right to interpret
f(l) as an FDR-corrected p-value, but it is close enough for our purposes.
Before leaving this example, it is worth emphasising that when working with FDR,
people often work with thresholds above the 0.05 often used in classical statistical testing.
A threshold of 0.1 is very often used (tolerating 1 in 10 false positives), and thresholds of
0.15 are also used sometimes. We can see from Figure 7.5 that if we were to increase
our FDR threshold to 0.1, we would get a list containing around 400 genes, and most
scientists would consider that to be a more appropriate compromise.
See section 18.7 (p.683) of [HTF] for further details of multiple testing problems.
CHAPTER 7. VARIABLE SELECTION AND MULTIPLE TESTING 174
0.10
fdr[1:500]
0.06
0.02
1:500
Figure 7.5: First 500 f(l) statistics for the nci microarray data.