0% found this document useful (0 votes)
11 views16 pages

EBSCO-FullText-03_16_2025

This article introduces Principal Component Analysis (PCA) as a statistical technique for simplifying complex data and enhancing data literacy among students. It presents a teaching strategy that emphasizes visualization and practical application of PCA, using relatable examples like tea-pots and dinosaurs to help students grasp the concept. The approach aims to provide a foundational understanding of PCA while minimizing complex mathematical theory, making it accessible for students in various scientific disciplines.

Uploaded by

asseelnada2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

EBSCO-FullText-03_16_2025

This article introduces Principal Component Analysis (PCA) as a statistical technique for simplifying complex data and enhancing data literacy among students. It presents a teaching strategy that emphasizes visualization and practical application of PCA, using relatable examples like tea-pots and dinosaurs to help students grasp the concept. The approach aims to provide a foundational understanding of PCA while minimizing complex mathematical theory, making it accessible for students in various scientific disciplines.

Uploaded by

asseelnada2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received: 23 February 2023 Revised: 1 December 2023 Accepted: 1 December 2023

DOI: 10.1111/test.12363

ORIGINAL ARTICLE

A gentle introduction to principal component analysis


using tea-pots, dinosaurs, and pizza

Edoardo Saccenti

Laboratory of Systems and Synthetic


Biology, Wageningen University & Abstract
Research, Wageningen, The Netherlands Principal Component Analysis (PCA) is a powerful statistical technique for
reducing the complexity of data and making patterns and relationships within
Correspondence
Edoardo Saccenti, Laboratory of Systems the data more easily understandable. By using PCA, students can learn to iden-
and Synthetic Biology, Wageningen tify the most important features of a data set, visualize relationships between
University & Research, Stippeneng 4 6708
variables, and make informed decisions based on the data. As such, PCA can
WE Wageningen, The Netherlands.
Email: [email protected]; be an effective tool to increase students data literacy by providing a visual and
[email protected] intuitive way to understand and work with data. This article outlines a teach-
Funding information
ing strategy to introduce and explain PCA using basic mathematics and statis-
ZonMw, Grant/Award Number: tics together with visual demonstrations.
456008002
KEYWORDS
correlation, covariance, data analysis, data literacy, data reduction, data visualization,
teaching statistics, variance

1 | INTRODUCTION introduced over 100 years ago! PCA was first introduced by
Pearson in 1901,3 the same Pearson after whom the correla-
The ability of extracting information from data in an effec- tion coefficient is named. Later, Hotelling introduced a dif-
tive way is one of the key components of data literacy.1 ferent formulation of PCA in 1933.4 Although the two
Modern data are becoming increasingly complex in both formulations are equivalent, they are not identical in con-
quantity and dimensions: as a consequence, also data analy- cept.5,6 PCA can also be expressed as a discrete Karhunen-
sis has become more complex and more important in all Loève transform.7
disciplines, from economics to life sciences, from education PCA is widely used in virtually all disciplines where the
to politics. It has been suggested that teaching data and sta- structure of high-dimensional data needs to be explored, and
tistical science should focus on tools that are suitable for the relationships between observations and the variables
dealing with modern data, rather than on tools designed for responsible for observed data patterns (the so-called data
old data types.1,2 In view of the high dimensionality of mod- structure) need to be understood. Using PCA to explore, ana-
ern data, it is also recommended that students are presented lyze and visualize data, students can develop a deeper under-
with data analysis tools capable of exploring and analyzing standing of data and the ability to work with it effectively.
multivariate and high-dimensional data.1,2 Although a solid background in mathematics, including
In this light, it is interesting to note that the most widely linear algebra, geometry, and calculus, is generally required
used tool for exploring multivariate and high-dimensional to understand the theory of PCA, it may be necessary to
numeric data, Principal Component Analysis (PCA), was introduce PCA to students in an applied context, focusing on

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any
medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
© 2024 The Authors. Teaching Statistics published by John Wiley & Sons Ltd on behalf of Teaching Statistics Trust.

38 wileyonlinelibrary.com/journal/test Teaching Statistics. 2024;46:38–52.


SACCENTI 39

how and when to apply the method in the practical analysis for real life biology and biochemistry experiments) to pro-
of a data set or the extraction of relevant information from vide first-hand experience on the interpretation of a PCA
measured data. In such cases, a practical, computer-based model. An appendix is included that presents a linear
approach may be favored while minimizing mathematical algebra formulation of the PCA problem and its solution.
theory. However, to avoid the risk of PCA becoming another We recently (October 2023) used this approach also in
“black box” tool8 that produces results and figures with little a course dedicated to first year graduate students with an
or no context, students must have a certain understanding of experimental background in biological sciences and lim-
the theoretical foundations of the method. ited statistical knowledge who needed to develop basic
While PCA is usually introduced as a dimensionality data analysis skills: the proposed introduction to PCA
reduction technique, we present here an approach to was evaluated “Very good” or “Excellent” by all the stu-
teach PCA starting from visualization, from which PCA dents who filled in the evaluation questionnaire at the
as a dimensionality reduction technique follows at a later completion of the course. We indicate how we use this
stage. PCA is introduced and explained using minimal approach, and hope that others may find part or all of it
mathematics, and it presented as an effective tool to address useful for their teaching.
the problem of visualizing high-dimensional data (in four
or more dimension). The overall strategy is as follows:
2 | V I S U A L I Z I N G AN D
1. Students are presented with the problem of visualizing EXPLORING DATA IN TWO
a two-dimensional data set AND THREE DIM ENSIONS
2. Students are presented with the problem of visualizing
a three-dimensional data set 2.1 | Visualizing a data set in two
3. Rotation in the plane and in the space is presented as dimensions
a tool to explore and identify the data structure
4. Rotation is formulated using trigonometry, introducing At the beginning of our lecture, students are presented
the concepts of linear combination and weighted sum with the following simple yet fundamental problem: We
5. A link is established between information content and have a set of 4608 measurements of two variables, x and
variance y, as shown in Figure 1A. The questions to be asked are:
6. Rotation along the direction of maximum variance is
presented which lead to the definition of: 1. How can we visualize this data and understand what
7. Principal components and their meaning it represents?
8. The relationship between original variables, correlation 2. What is the structure of the data?
and principal components is presented which leads to:
9. PCA as a data dimensionality reduction technique. In the context of PCA, the “structure of the data”
refers to the underlying patterns or relationships present
This approach has been developed (and used) for stu- in the data set.
dents at both the Bachelor and Master level in the Life A possible approach to explore the data structure is to
Sciences programs at Wageningen University (the use/attempt a spatial visualization, as every observation can
Netherlands), with class size ranging from 10 to more be considered a point Pðx,yÞ on the plane, where each var-
than 80 students. Classes were very composite, with stu- iable represents a spatial (orthogonal) coordinate Carte-
dents from different nationalities and backgrounds. This sian plane xOy (where xOy is a shorthand for a x and
approach is, for most of the students, the first exposure to y axis reference system with axis origin in O). These
a multivariate tool for the analysis of high-dimensional points can be visualized by plotting each pair x and y, as
data and informs following lectures and activities dedi- shown in Figure 1B: the data set contains the x and y
cated to more advanced tools (like sparse variant of PCA coordinates of a tea-pot! The tea-pot shape can be
and PCA regression). The material is covered in two lec- referred to as the structure of the data (set): the structure
ture classes (45 min each), each followed by hands-on of the data set is a tea-pot.
computer practical classes (1.5 h each) in which students
solves a set of exercises performing PCA using R and/or
Matlab on provided data sets. Each practical class ends 2.2 | After rotation a tea-pot is still a
with a wrap-up session where results of the exercises are tea-pot
presented and discussed and feedback is provided.
This paper also presents the Principal Component Next we ask the students: “what if we rotate the tea-pot,
Analysis of several data sets (some synthetic and some say if we rotate it 65 degrees clockwise?” We obtain the
40 SACCENTI

F I G U R E 1 (A) Visualization of a two-dimensional data set X. Screen-shot of a spreadsheet containing 4608 observations of two
variables x and y which the students are asked to visualize. (B) Two-dimensional plot of the numerical data set X shown in panel A: It is a
tea-pot! (C) You can recognize that it is a tea-pot even if it is rotated to be upside down. [Colour figure can be viewed at
wileyonlinelibrary.com]

right-hand the plot in Figure 1B. The answer of the stu- truck.” My response was, “Not exactly, but you are
dents to this question should be immediate. Apart being close—it is definitely a dead animal.”
now upside down, nothing really changed: we can still In a three-dimensional space, we have more freedom
recognize 4608 points on the plane representing a to rotate the set of points. We can rotate around the x-
tea-pot. axis, y-axis, or z-axis, or any combination. By rotating the
The rotation has not altered the structure of the data, points and using visual inspection taking a trial-and-error
nor our ability of recognizing the structure of the data: approach, we can arrive at Figure 2B, where the structure
stated in other words, it has not made the structure of the of the data becomes clear-it's a triceratops! The rotation
data more (or less) clear to our eyes. allowed us to discover the true structure of the data set.
This is why many statistical software programs provide
three-dimensional plotting applications that allow plots
2.3 | Visualizing a data set in three to be rotated by the user. Rotations gave us a better per-
dimensions spective on the data set because since we can visualize
the data point in three-dimensional space we could
Extending the visualization approach to three dimensions choose the best rotation, where “best” means a rotation
is straightforward. When three variables are recorded, that made us recognize the triceratops-like structure of
each observation can be represented in a three- the data.
dimensional space. However, this added dimension This approach is limited, however, to three dimen-
increases the complexity of understanding the data struc- sions as we can only visualize objects in three
ture. We now consider a data set X where 36,876 observa- dimensions. Also note that this visualization by trial-
tions are recorded for three variables x, y, and z, resulting and-error can work only where the structure, or
in 36,876 points Pðx, y, zÞ that can be plotted in a three- relationships among the variables, is quite clear
dimensional space. This produces Figure 2A. The ques- (i.e., recognizable by our brain), and hence in a non-
tion we can ask students is, “What does it represent?” statistical (deterministic) situation. In statistics, we are
One student answered, “It is a dead cat run over by a seeking to detect structure or relationships in our data,
SACCENTI 41

F I G U R E 2 Rotation as a way to extract maximal information in the case of a 36876  3 data set X. (A) Three-dimensional plot of the
original data: there is structure in the data, but it is not clear what it is. (It is not a dead cat, as one student initially guessed). (B) Optimal
rotation of the object: in this case the optimality is akin to a rotation that maximizes the possibility of recognizing the true structure of the
data set. [Colour figure can be viewed at wileyonlinelibrary.com]

allowing for variation around the structure which may


not be recognizable but relevant for the problem at
hand. We are also seeking to understand or explain the
variation.
What if a data set has 4 variables or 1000 variables?
It is not possible to visualize a 4-dimensional or
1000-dimensional object even if there is no random
variation in the data. We can speculate about the exis-
tence of tea-pot or a triceratops in a p-dimensional
space. In statistics, we are also trying to detect struc-
ture in the data while allowing for, and interpreting,
variation: so, how can we find the structure of a high-
F I G U R E 3 Variation as information. (A) Three points Aðx,yÞ,
dimensional data set? This is where Principal Compo-
Bðx,yÞ, and Cðx, yÞ are scattered around (0,0) and are
nent Analysis comes in, as a tool to detect and visual-
distinguishable. The variance varðx Þ of the x coordinate is larger
ize the structure of high-dimensional data and obtain than zero: the variation (differences) among the three points is here
low-dimensional representation of complex, high- the information that allows to distinguish the three points: the
dimensional, data. information is quantified by the variance. (B) Three points Aðx,yÞ,
Bðx,yÞ, and Cðx, yÞ are overlap on (0,0) and are indistinguishable:
since there is no variability among the coordinates there is also no
3 | V A R I A T IO N AS I N F O R M A T I O N information that allows to distinguish the three points.

The idea of information contained in a data set is fairly the true form of the object (the structure of the data), while
readily grasped by our students through the triceratops others obscure it. Thus, to identify the dinosaur, the direc-
example: they can see that some directions in space reveal tions of some rotations carry more information than others.
42 SACCENTI

F I G U R E 4 Rotation toward maximum variability. (A) Three points A ¼ Að1, 1Þ, B ¼ Bð2, 2Þ, and C ¼ Cð3, 3Þ in the reference system xOy
are aligned along the θ ¼ 45 ∘ direction. The variance varðx Þ and varðyÞ of the x and y coordinates is 1. (B) The coordinates can be rotated by
θ ¼ 45 ∘ counterclockwise to define the new coordinate system x 0 Oy0 where the variance varðx 0 Þ of the new x 0 coordinate is 2 and varðy0 Þ ¼ 0.
Rotation has maximized the variance along the x 0 direction. Note that the total variance is conserved: varðx Þ þ varðyÞ ¼ varðx 0 Þ þ varðy0 Þ ¼ 2.

In this context, we now show how information can be Equation (1) indicate how the new coordinates ðx 0 ,y0 Þ
understood as variation (variability): the more dispersed are derived from the old coordinates ðx,yÞ. The transfor-
the data points are in a certain direction, the greater the mation is achieved through a linear combination of ðx,yÞ,
information content along that direction. We first con- with coefficients determined by the sine and cosine of θ.
sider the simple example in Figure 3. In panel A, three These coefficients can also be interpreted as weights in a
points scattered around x ¼ 0 can be distinguished weighted sum, quantifying the importance of each old
because their x coordinates (x values) are different. The coordinate in the new coordinate system. This concept
diversity or dispersion of the three points is described by of weighting will play a crucial role in explaining the
the variance. In Panel B, the three points perfectly over- Principal Component Analysis (PCA) model, as will be
lap at x ¼ 0: there is no variation among the three points, discussed in Section 5.
making them indistinguishable and resulting in no infor-
mation that can be used to determine how many points
are present. Note that for the points in Panel A, a rotation 5 | ROTATION TOWARD
a rotation of 90 degrees clockwise—or anti-clockwise— MAXIM UM VARIANC E DIRECTION
makes the three points indistinguishable.
To demonstrate the concept of rotation toward
direction of maximum variance, we consider the plot
4 | ROTATION ON THE PLANE shown in Figure 4A showing the two-dimensional repre-
sentation of a 3  2 data set X given by:
Mathematically, rotation on a plane or in the space can
be introduced through the rotation matrix and the alge- 2 3
x y
braic principles of linear transformations. However, for 6 Aj 1 1 7
students with no knowledge of linear algebra but with a 6 7
X¼6 7: ð2Þ
basic understanding of trigonometry, (i.e., the definition 4 Bj 2 2 5
of sine and cosine by an angle θ), the rotation formula Cj 3 3
can be just presented. A point P of coordinates ðx, yÞ is
given in the reference system xOy. If the coordinate sys-
tem is rotated of angle θ, the new coordinates of P in the In the xOy coordinate system, the points A, B, and C
rotated reference system x 0 Oy0 are given by the trigono- are located at Að1, 1Þ, Bð2, 2Þ, and Cð3, 3Þ, respectively,
metric formula: and are aligned along the θ ¼ 45 ∘ diagonal line. We focus
now on the variance varðx Þ and varðyÞ of the x and y vari-
x 0 ¼ x cos θ þ yð sin θÞ ables, that is we want to measure of their variability
ð1Þ
y0 ¼ x sin θ þ y cos θ: around their mean values. The mean values of the x and
SACCENTI 43

y coordinates (variables) is 2, and the sample variance is, teapots has not affected the distances between the point,
for both variables: in fact after rotation, the teapot is still a teapot and the
data structure has not been changed. However,
1  the amount of information encoded along different
ð1  2Þ2 þ ð2  2Þ2 þ ð3  2Þ2 ¼ 1: ð3Þ
2 dimensions is changed. In the x 0 Oy0 coordinate system,
we see that the three points are indistinguishable along
The variance along the x and y directions (i.e., the the y0 direction and carry no information, as indicated by
variance of the x and y coordinates), is equal to 1, imply- the variance of the y0 coordinates being zero.
ing that the amount of information along both axes is the In contrast, the rotation by 45 ∘ degree has increased
same. Note that in this artificial example, we are treating the variance in the x coordinate, which has gone from 1
the three points as sampled data in order to illustrate the (in xOy) to 2 (in x 0 Oy0 ). It is evident that the 45 ∘ line is
concept. Because this is such an unsuitable sample to the direction with the maximum spread of points in the
illustrate statistics, an instructor may wish to treat these xOy plane, and, therefore, it is the direction with the
three points as a population of three equally likely points, maximum variance in our data set.
and use population variances; it makes no difference to We have found the direction of maximum variability
the illustration of the concept. for our 3  2 data set: the new coordinates x 0 and y0 are
It should be visually apparent that the reference sys- called principal components of our data set.
tem xOy can be rotated by 45 ∘ (counter clockwise), align- The above example is artificial and not suitable to
ing all three points along the new x 0 axis. illustrate statistical data because the points lay exactly on
The coordinates of point A in the new reference sys- a line in the xOy plane. We now consider a sample of
tem x 0 Oy0 are now (This is also can be left to the students three points as shown in Figure 5A: the direction of max-
to calculate): imum variance will run along a diagonal line through
the points, inclined at a certain angle θ with respect to
x 0 ¼ 1  cosð45 ∘ Þ þ 1  ð sinð45 ∘ ÞÞ the x-axis. How to determine the angle that maximizes
pffiffiffi pffiffiffi the variance? For instance, consider a direction of 40 ∘ :
2 2 pffiffiffi
¼ þ ¼ 2 ¼ 1:4 ð4Þ after rotation, the sample variances for the transformed
2 2 pffiffiffi pffiffiffi x 0 and y0 (represented by the blue dots in Figure 5B) are,
2 2
y0 ¼ 1  sinð45 ∘ Þ þ 1  cosð45 ∘ Þ ¼  þ ¼ 0: respectively, 4.7 and 1.3.
2 2
If we consider a rotation along the 57 ∘ direction, the
sample variances in the new coordinates x 0 and y0 (red
For points B and C, the new coordinates (rounded to dots in Figure 5B) are, respectively, 5.0 and 1.0. In xOy
one decimal) are ð2:8,0Þ and ð4:2,0Þ, as shown in the variance along x is 2.3 or 38:3% of the total variance,
Figure 4B. What about the variance of x 0 and y0 ? Simple since
calculations give 2 and 0 for x 0 and y0 , respectively. The
variance in the x 0 coordinates of three points in the x 0 Oy0 2:3
¼ 38:3% ð6Þ
reference system (Figure 4B) is now twice the variance 2:3 þ 3:7
of the x coordinates in the xOy reference system
(Figure 4A), while the variance in the y0 coordinates is Equation (6) gives the so-called variance explained by
now zero. Note that the total variance, that is, the sum of direction x (see also Equation A11 in the Appendix A). If
the variances in the two coordinate systems has not chan- we rotate by 40 ∘, the variance along x 0 is 4.7 or 78:3% of
ged upon rotation and it is equal to 2. the total variance; rotating by 57 ∘, the variance along x 0 is
The change of the variance in the coordinates 6 or 83:3% of the total variance. As seen before, the total
occurred without modifying the relationships between variance amounts to 6.0 in both coordinate systems.
data points: the distance dðA, BÞ between the points A The direction with an angle θ ¼ 57 ∘ is a better repre-
and B has not changed, as it can be readily verified by sentation of the data set as it captures 83:3% of the total
Pythagoras' theorem: variance, while the direction with an angle θ ¼ 40 ∘ cap-
tures a bit less variance, 78:3% of the total variance. In
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2  pffiffiffi PCA, this is referred to as “variance explained.” The first
dðAðx, yÞ, Bðx, yÞÞ ¼ 1 þ ð2  1Þ2 ¼ 2 ¼ 1:4
ð5Þ principal component is defined as the direction that maxi-
dðAðx 0 ,y0 Þ, Bðx 0 , y0 ÞÞ ¼ 2:8  1:4 ¼ 1:4: mizes the variance, or the direction that explains the maxi-
mum variance. However, we cannot be certain that the 57 ∘
The rotation has preserved distances. This we already direction is the direction of maximum variance just rely-
had seen in the teapot example in Figure 1: rotating the ing on some level of graphical visualization and intuition.
44 SACCENTI

F I G U R E 5 Three points scattered along the diagonal of the xOy plane. The variances of the x and y coordinates are 2.3 and 3.7,
respectively (rounded to 1 decimal). (A) The direction of maximum variance is directed along an (unknown) angle θ. Two directions are
shown for θ ¼ 40 ∘ (blue line) and θ ¼ 57 ∘ (red line). (B) The three points rotated in a x 0 Oy0 reference system, by θ ¼ 40 ∘ (blue points) and
θ ¼ 57 ∘ (red points). [Colour figure can be viewed at wileyonlinelibrary.com]

Two key points should be emphasized to the students: variables which are linear combinations of the original
variables such that:
1. Determining the direction(s) of maximum variability
becomes complex in high-dimensional data, as the 1. the linear combinations of the data points are ordered
number of angles required to determine the direction by their sample variances. That is, the first principal
of maximum variance increases with dimensionality, component has the maximum variance, the second
that is, with the number of variables. has the next highest variance, etc. and
2. A trial-and-error approach is not efficient. 2. the linear combinations of the data points have zero
sample correlations, thus are uncorrelated.
For this example the angle θ that maximize the vari-
ance along x is θ ¼ 54:03 ∘, for which the variance in x 0 is Note that there are p principal components only
5.0 (rounded): this solution is found using the mathemat- when n > p. If n ≤ p, there are n  1 principal compo-
ical method outlined in the Appendix A, which lies at the nents. See the Appendix A for a linear algebra
core of Principal Component Analysis and that can be explanation.
easily implemented in statistical software. It is not neces-
sary to delve into the technical details of how the princi-
pal components are actually calculated at this stage. 6 | PRINCIPAL COMPONENT
Instead, the focus is be on illustrating and explaining the ANALYSIS AS DIM ENS IONALITY
characteristics of Principal Component Analysis by con- REDUCTION
sidering the other very important result that can be
derived from these two simple examples, namely that in With reference to the three data points shown in
each case, the principal components are uncorrelated. Figure 4A, we observed that rotating along the θ ¼ 45 ∘
How instructors make this point depends on their stu- direction resulted in a new coordinate system, x 0 Oy0 (4B),
dents' backgrounds—whether they have previously seen where the y0 coordinate of each point is zero and the
correlation coefficients and whether instructors wish points are perfectly aligned along the x 0 axis. In the x 0 Oy0
their students to obtain them here by hand or by using a system, only one coordinate (x 0 ) is needed to determine
calculator or dedicated software. the position of a data point, whereas in the original xOy
We can now state the general result for principal system, two coordinates (x and y) were necessary.
components of a numeric data set with n observations on The rotation along the 45 ∘ direction has reduced the
p variables. The principal components are p new dimensionality of the system from two to one. Why is this
SACCENTI 45

possible? This happens because in this case there is a per- stronger the correlation between x and y, the more effec-
fect linear relationship between the x and y coordinates. tive the suppression of the y variable. If x and y are per-
In fact y ¼ x which is a particular case of the more gen- fectly correlated, the suppression of the y variable is
eral linear relationship y ¼ ax þ b. The 45 ∘ rotation perfect as there is variance only along one direction.
reduced the dimensionality from two to one because the This example shows, in two dimensions, what PCA
x and y variables were perfectly correlated (Pearson's actually does: it identifies the direction(s) of maximum vari-
sample correlation r ¼ 1), as the points lay exactly on a ability, converting correlated variables into uncorrelated
line. In the context of PCA, this means that if two vari- variables, and this extends to any number of variables, that
ables are perfectly correlated, one of them is redundant is, dimensions. As a result, the principal components are
as they carry the same information. By rotating the plane, linear combinations of the original variables: for a two-
the two correlated variables x and y in xOy become two dimensional case, the principal components, that is,
new uncorrelated variables x 0 and y0 in x 0 Oy0 . In fact, the the rotated variables, are given by Equation (1). The
two variables can be replaced by a new variable which is direction(s) of maximum variability, as well the coefficients
a weighted sum (i.e., a linear combination) of the two for the linear combination for each variable, can be found
original variables, giving birth to a principal component. as described in the Appendix A. In practice, we can use the
This concept holds true also when there is no perfect principal components with the greatest variances to visual-
correlation between observed/measured variables, as shown ize and interpret the data. That is, we can choose to reduce
in Figure 6. The 25 points in xOy are spread around the the dimensionality of the data set. In the above example,
θ ¼ 45 ∘ direction and are highly correlated (r ¼ 0:97). we could reduce the dimensionality of the data from 2 to
The sample variances of x and y are both 1, thus the x 1, given that the first principal component (the one given
and y direction accounts each for 50% of the total vari- by x 0 summarize the 95% of the variance [or information]
ance which is 2. The 45 ∘ rotation shown in Figure 6B of the original data). Also, using the principal compo-
reduces the original y coordinate and reduces the correla- nents can remove multicollinearity from the data.
tion between the two coordinates to r ¼ 0:02. The sample
variances in the rotated system are 1.9 for x 0 and 0.1 for
y0 , whose direction accounts for 95% and 5% of the total 7 | PRINCIPAL COMPONENT
variance (which is still 2). ANALYSIS OF A H IG H
The 45 ∘ direction is approximately the direction of DI M ENS I ON AL DA TA SET
maximum variability and the rotation along this direction
maximizes the x coordinate while suppressing the y coor- We now proceed to analyze a high-dimensional data set
dinate, effectively reducing the dimensionality. The X containing 36,876 observations of 1000 variables, thus

F I G U R E 6 Rotation of correlated variables. (A) Two variables x and y that are correlated in the xOy plane: the sample correlation
coefficient (Pearson's), is r ¼ 0:97. (B) The same variables rotated by 45 ∘ : the variables are now uncorrelated (r ¼ 0:02). Note that the scale
y'-axis in panel B is larger than the scale of y-axis in panel A even if the value range is smaller: this is done to show how the rotated points
are now scattered along the x'-axis in a random way, thus they are uncorrelated.
46 SACCENTI

the data is 1000-dimensional: what is the structure of the 53%. The remaining variance is distributed over the
data? We cannot visualize a 1000-dimensional data set remaining 997 dimensions. If we now plot the first two
and visualization is here not helpful. Before proceeding components, we obtain the two-dimensional representa-
with the analysis, we give two new definitions: tion of the 1000-dimensional data set which is given in
Figure 7C.
1. The weights of the linear combinations defining the The PCA has found rotations in the 1000-dimensional
principal components are called loadings. space such that 50% of the variance (information) present in
2. The representation of the original data in the principal the original data set is contained in the first two principal
component space (i.e., the rotated data in the PCA components. This means that if we only plot the first two
space) are called scores. component (which are, by definition, a linear combination
of the original coordinates), we can represent 50% of the
Let's now perform a PCA on the 36876  1000 data set information contained in the 1000-dimensional data on the
X pictured in Figure 8A using Matlab. The command PC1-PC2 (bidimensional) plane (Figure 7C). By doing that
line is: (plotting 2 dimensions rather than 1000), we have reduced
the dimensionality of the data set. As the figure shows, we
[R,T,,,VarExplained] = pca(X); have eliminated 998 variables (i.e., dimensions) but we
retained sufficient information to describe the object (the tri-
We obtain a 36876  1000 matrix T that contains the ceratops), removing (i.e., denoising) dimensions that carry
original 36,876 observations in a rotated coordinate sys- little information.
tem. There are still 1000 variables in T, so the rotated
space is still 1000-dimensional. Each of the new coordi-
nates in T is a linear combination of the original vari- 8 | INTERPRETATION OF
ables in X. The first principal component, that is the first PRIN CIPAL C OM PO NEN T
new variable in T now accounts for most of the variance, ANALYSIS RESULTS
followed by the second, and so on.
Figure 7B shows the variance explained for the first We now show the analysis and interpretation of the results
5 principal components: the first two components account of a Principal Component Analysis of a multivariate data
for around 50% of the total variance; the first three for set containing chemical data measured on 300 samples of

F I G U R E 7 Principal Component Analysis of a high dimensional data set. (A) Screen-shot (color-coded) representation of a
36876  1000 data set X. (B) Variance explained by the first 5 principal components. (C) Plot of the first two principal components, giving a
low-dimensional (2-dimensional) representation of the original 1000-dimensional data. The first two component accounts for 50% of the
variance in the original data. This data set is derived from the three-dimensional Triceratops data set used in Figure 3 by expanding the
original data set with 36,876 observations of 997 uncorrelated variables sampled by standard normal distribution. [Colour figure can be
viewed at wileyonlinelibrary.com]
SACCENTI 47

F I G U R E 8 Principal Component Analysis of the pizza data set. (A) Score plot (i.e., the representation of the original seven-dimensional
data in the principal component space) for first two principal components. (B) Loadings (variable importance) for the first principal
component. (C) Loadings (variable importance) for the second principal component. The loadings represents the contribution of each
variable to a given principal component, just as we have seen in the case of the two-dimensional examples in Figures 5 and 6. [Colour figure
can be viewed at wileyonlinelibrary.com]

pizza from 10 different commercial pizza brands (labeled A quite different from the other brands, since it separates
to J), on which seven chemical attributes have been mea- from all other brands, as seen in the lower right quad-
sured: moisture (amount of water in the sample), protein, rant of Figure 9A.
fat, ash, sodium and carbohydrates content and amount of The plots of the loadings of the first and second prin-
calories (per 100 g of pizza). There are 30 samples for each cipal components are shown in Figure 8B,C. The load-
brand, thus the data set X on which PCA is performed ings, that are normalized such that the sum of their
has size 300 (observations)  7 (variables) and it is squared values equals 1 (see Appendix A for more
7-dimensional. details), give the importance of each original variable
Data are standardized to unit variance before analy- (in this case the chemical quantities) to a given compo-
sis: see Section 10 Technical Remarks for more details nent. The loading sign is arbitrary, because we can swap
about why this is (may be) necessary. The two- all the loading signs without changing the PCA model:
dimensional score plot (i.e., the representation of the what matters is the absolute value of the loading. In this
original seven-dimensional data in the first two princi- example, we see that fat, sodium, ash and carbohydrates
pal component space) is given in Figure 8A: each dot and calories are responsible for the difference between
represents a pizza sample in the two-dimensional space the pizza brands that separated along the first compo-
defined by the first two principal component. Dots are nent, that is pizza brand that are different along the x-
color-coded according to the pizza-brand. The first two axis; moisture and calories content are important to dif-
components accounts for more than the 90% of the ferentiate among brands that separate along the second
total variance: the information that was originally con- component, that is, along the y-axis. To separate the pizza
tained in seven variables can be now described with clusters on the left side of Figure 8A, one needs to trace a
just two variables (principal components). The pizza diagonal line (indicated with a dashed line): this suggests
samples group in five larger groups, and each group that one or more variables contribute (i.e., have high
containing pizza samples that share similar chemical loadings) to both components: in fact, calories have a
profiles. Note that the information about to which high loading (in absolute value) on both components.
brand each sample belongs is used only for visualiza- The interpretation of a PCA model is context dependent,
tion and it is not used in PCA: the clustering of the as well its relevance to the problem being studied: the
samples according to the brand (i.e., the data structure) importance of each variable should always be evaluated
is a characteristics of the data that PCA is able to con- bearing in mind, in this case, its biochemical implica-
dense in two dimensions. Pizza brand H appears to be tions. This example was selected given that the student
48 SACCENTI

F I G U R E 9 When PCA is not helpful. (A) Scatter plot of the two principal components of a PCA model of a data set containing 100 genes
(variables) measured on the skin of 174 subjects, of which 92 (red dots) are suffering from a skin condition (psoriasis) and 82 are healthy (blue
dots). (B) Loadings (variable importance) for the first principal component. [Colour figure can be viewed at wileyonlinelibrary.com]

target group, for which this approach was initially devel- disease status disease (red dots) and healthy status (blue
oped, has a biological background. Other examples could dots). We can see that there is no separation between the
be used to accommodate the needs of students from dif- two groups. The variances explained by the three compo-
ferent disciplines. nents are not tremendously different, suggesting that in
the data, there not privileged directions along which vari-
ance (information) is concentrated. This happens because
9 | W HE N PR I NC I P A L the original variables are poorly correlated and we have
C O M P O N E N T AN A LYS I S DO E S NO T discussed how PCA exploits the correlation among vari-
SEEM TO BE HELPFUL? ables: in fact if the variables are uncorrelated to start
with, PCA will return the original variables but ordered
A real life example where application of PCA to under- by their variances. Basically, if the variables are uncorre-
stand the structure of data does not result in a straightfor- lated, there are no directions of maximum variance: ref-
ward interpretation is shown in Figure 9. Here PCA is erence 11 has nice visual examples of this situation and
applied on a data set containing the gene expression values discusses the problem and its consequences for PCA in
of 100 genes measured on the skin of 174 subjects, of which gentle but rigorous way.
92 are suffering from a skin condition and 82 are healthy. This can be seen in this example by the fact that the
Gene expression data refers to the abundance of a mRNA data points in the three-dimensional PCA space are
transcript of a gene. The collection of the mRNA abun- roughly shaped like a ball. Examining the loadings
dances of a set of genes measured on a biological sample is (as we have done for the pizza example in Figure 8) high-
referred to as transcriptome or gene expression profile. For lights another limitation of PCA. The loading plot for the
a primer on this topic see references 9, 10. The dimension first principal component (similar plots are obtained for
of this data set is 174 (observations)  100 (variables), thus the other components) shows how all variables contrib-
the data set is 100-dimensional. ute almost equally to the definition of the principal com-
The interest is here into assessing whether the gene ponent. This also illustrates an inherent disadvantage of
expression profiles of diseased subjects are different from PCA for some data sets: the new variables (i.e., the princi-
those of healthy people. If so, one could expect the sam- pal components) are linear combinations of all the origi-
ples to separate in two (more or less) distinct groups (as it nal variable: as such they are not necessarily as easy to
happens for some of the pizza samples in Figure 8). interpret as the original variables in themselves. Their
Figure 9A shows a three-dimensional plot of the first main use is to identify high contributions to variation
three components: dots are color-coded according to the and to be able to consider fewer variables—that is,
SACCENTI 49

reduced dimensionality—in visualization and/or analy- Another crucial step in PCA is the determination of
sis. Variants exist of PCA that aim to provide simpler the optimal number of principal components. The
models by suppressing the contribution of sets of vari- problem arises from the need to strike a balance
ables: this is the realm of the so called Sparse-PCA, but between preserving enough information and reducing
this is a rather advanced topic. The interested reader can the dimensionality of the data. Selecting too few com-
refer to references 12–14 for an overview. ponents may result in a loss of information, while
For this example, and based on the PCA analysis, we choosing too many components can lead to overfitting
can conclude that there is no difference, in the three- and increased computational complexity. Various
dimensional space, among the gene expression profiles of methods addressing this problem have been developed
diseased and healthy people. It may happen that a differ- to address this problem including scree plots, cumula-
ence between the two groups exists in a higher- tive explained variance, cross-validation techniques
dimensional space. For this data set, PCA is not helpful and statistical methods. These approaches aim to iden-
or, better stated, this data set has a structure that cannot tify the components where the additional explained
be exploited using PCA. variance becomes negligible, allowing for the selection
To summarize, PCA is useful for data exploration and of an optimal number of principal components that
dimensionality reduction only when there are correla- effectively represent the data. The literature on the
tions among the variables. We emphasize again that PCA topic is huge, but the mathematics needed is, in most
produces linear combinations that are uncorrelated and case, quite advanced. The problem is introduced in a
ordered by their (sample) variances from maximum to discursive way in reference 5; reference 19 provides a
minimum. If the original variables are uncorrelated, or higher level introduction to the problem and to dimen-
nearly so, PCA gives back the original variables (or close sionality assessment approaches.
to). Also correlation (at least the one employed in PCA,
which is Pearson's correlation) is only a measure of linear
relationship, so curved relationships will be hidden and 11 | CONCLUSIONS
could even inhibit data understanding and interpretation.
We have provided a gentle introduction to Principal
Component Analysis that uses a limited amount of trigo-
1 0 | SOME TECHNICAL R EMARKS nometry and basic statistical concepts such as variance,
covariance, and correlation. This approach is introduced
In real life applications, (measured) variables may have by visual intuition of rotations in the plane and space,
significantly different ranges, units, or variances. When making it suitable for introducing PCA to high-school
the variables have different scales or units, those with students or first-year university students in programs that
larger magnitudes can dominate the analysis, leading to require an introduction to data analysis techniques with-
biased results, in the sense that the principal components out assuming familiarity with linear algebra.
will be dominated by the (few) variables with the larger
variance and/or magnitude. To avoid this, variables are
often scaled to a common scale, for instance by subtract- 12 | SOFTWARE AND DATA
ing the mean and dividing by the standard deviation: in
this way, all variables have zero mean and variance equal The tea-pot coordinates (Figures 1 and 2) have been
to 1. This procedure is called standardization, or obtained from Matlab using the built-in command: [coor-
auto-scaling, or scaling to unit variance (the “standardi- dinates, x, y] = teapotGeometry The triceratops data
zation” tends to be preferred in statistics as data science (Figure 5) has been obtained from https://gitlab.com/
tends to differentiate between “scaled” [to a given com- e62Lab/medusa/-/blob/dev/examples/poisson_equation/
mon scale] and “standardized” by mean and standard triceratops_domain.h5. The 3-D coordinates have been
deviation for each variable.) PCA is extremely sensitive to extracted to Matlab format using the command:
scaling or data transformations, and interpretation also coordinates = h5read(‘triceratops_domain.h5’,’/domain/
depends on how the data have been processed. Reference pos'). The pizza data set is available at: https://data.
15 is an excellent introduction to this problem. world/sdhilip/pizza-datasets.
PCA, as descried here, is applied on numerical data The psoriasis data (a selected subset of the data pub-
expressed on a ratio scale. Variants of PCA have been pro- lished in20) it is available in the gitlab repository associ-
posed to analyze different types of data, like binary data ated to this article.
(i.e., data consisting of 0 and 1): see, for instance, references Data and Matlab codes are available https://gitlab.
16–18 and references therein. com/esaccenti/dinosaurpca.
50 SACCENTI

All analysis was performed in MATLAB version: 8. Rudin C. Why black box machine learning should be avoided for
9.14.0 (R2023a), Natick, Massachusetts: The MathWorks high-stakes decisions, in brief. Nat Rev Methods Primers. 2022;
Inc.; 2022. 2(1):81.
9. Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T. Transcrip-
tomics technologies. PLoS Comput Biol. 2017;13(5):e1005457.
A C K N O WL E D G M E N T S 10. Wang Z, Gerstein M, Snyder M. Rna-seq: a revolutionary tool
This study has received funding from The Netherlands for transcriptomics. Nat Rev Genet. 2009;10(1):57-63.
Organization for Health Research and Development 11. Björklund M. Be careful with your principal components. Evolu-
(ZonMW) through the PERMIT project (Personalized tion. 2019;73(10):2151-2158.
Medicine in Infections: from Systems Biomedicine and 12. Camacho J, Smilde AK, Saccenti E, Westerhuis JA. All sparse
Immunometabolism to Precision Diagnosis and Stratifica- pca models are wrong, but some are useful. Part i: computation
of scores, residuals and explained variance. Chemom Intell Lab
tion Permitting Individualized Therapies, project number
Syst. 2020;196:103907.
456008002) under the PerMed Joint Transnational call JTC
13. Camacho J, Smilde AK, Saccenti E, Westerhuis JA, Bro R. All
2018 (Research projects on personalized medicine-smart sparse pca models are wrong, but some are useful. Part ii: limita-
combination of pre-clinical and clinical research with data tions and problems of deflation. Chemom Intell Lab Syst. 2021;
and ICT solutions). I thank all the students who sat through 208:104212.
my classes on data analysis and who provided valuable feed- 14. Trendafilov NT. From simple structure to sparse components: a
back on lectures and course content (also those who did review. Comput Stat. 2014;29:431-454.
not share my passion for Principal Component Analysis). 15. van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK,
Van der Werf MJ. Centering, scaling, and transformations:
improving the biological information content of metabolomics
CONFLICT OF INTEREST STATEMENT data. BMC Genomics. 2006;7:1-15.
The author declares no conflict of interest. 16. Beh E, Lombardo R. Correspondence analysis. Theory, para-
ctice and new strategies. 2014.
ORCID 17. de Leeuw J. Principal component analysis of binary data by iter-
Edoardo Saccenti https://orcid.org/0000-0001-8284- ated singular value decomposition. Comput Stat Data Anal.
2006;50(1):21-39.
4829
18. Song Y, Westerhuis JA, Smilde AK. Logistic principal compo-
nent analysis via non-convex singular value thresholding. Che-
R EF E RE N C E S mom Intell Lab Syst. 2020;204:104089.
1. Gehrke M, Kistler T, Lübke K, Markgraf N, Krol B, Sauer S. 19. Saccenti E, Camacho J. Determining the number of components
Statistics education from a data-centric perspective. Teach Stat. in principal components analysis: a comparison of statistical,
2021;43:S201-S215. crossvalidation and approximated methods. Chemom Intell Lab
2. Ridgway J. Implications of the data revolution for statistics edu- Syst. 2015;149:99-116.
cation. Int Stat Rev. 2016;84(3):528-549. 20. Li B, Tsoi LC, Swindell WR, et al. Transcriptome analysis of
3. Pearson K. On lines and planes of closest fit to systems of points psoriasis in a large case–control sample: Rna-seq provides
in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11): insights into disease mechanisms. J Invest Dermatol. 2014;
559-572. 134(7):1828-1838.
4. Hotelling H. Analysis of a complex of statistical variables into
principal components. J Educ Psychol. 1933;24(6):417-441.
5. Bro R, Smilde AK. Principal component analysis. Anal
How to cite this article: E. Saccenti, A gentle
Methods. 2014;6(9):2812-2831.
6. ten Berge JMF, Kiers HAL. Are all varieties of pca the same? Br
introduction to principal component analysis using
J Math Stat Psychol. 1997;50:367-368. tea-pots, dinosaurs, and pizza, Teach. Stat. 46
7. Gerbrands JJ. On the relationships between svd, klt and pca. Pat- (2024), 38–52. DOI 10.1111/test.12363
tern Recogn. 1981;14(1–6):375-381.
SACCENTI 51

A P P EN D I X A Assume now that w is the set of weights that give


maximum variance λ, so that
In this Appendix, a linear algebra formulation of the
PCA solution is given. We organize n observations of p wT XT Xw ¼ λ: ðA5Þ
variables in a n  p matrix X. The variables are the
columns of X: x1 , x2 , …xp . Each variable is thus a n  1 Since we impose that w T w ¼ 1, we can write:
vector. We assume each column X in to be mean
centered. w T XT Xw ¼ λ  1 ¼ λwT w, ðA6Þ
A linear combination t of the p variables (columns of
X) is n  1 vector given by from which:

t ¼ w1  x1 þ w2  x2 þ …wp  xp , ðA1Þ w T XT Xw  λw T w ¼ 0, ðA7Þ

where w1 , w2 ,…, wp are numerical coefficients (weights). and:


We have stated that principal components are linear
 
combinations of the original variables that maximizes w T XT Xw  λw ¼ 0: ðA8Þ
variance. Thus the problem is to find the appropriate set
of weights that maximize the variance of t.
In the linear algebra formulation, the relationship Since w is non-zero, Equation (A8) is satisfied only
between t, X and the weights w can be written as and only if

t ¼ Xw ðA2Þ XT Xw ¼ λw: ðA9Þ

where now w is a p  1 vector whose elements are the Equation (A9) is the standard eigenvalue-convector
w1 ,w2 , …, wp weights. problem: this relationship is satisfied only when the vec-
Since the columns of X are centered, also t is cen- tor w is an eigenvector of the matrix XT X, and λ is the
tered, that is, its mean value t is zero. The variance varðtÞ associated eigenvalue. The p  p matrix C ¼ XT X is
of t is given by the sample variance–covariance matrix obtained from
the data matrix X. It follows that:
X
n
2
varðtÞ ¼ ðt i  tÞ ¼
a. The vector(s) w that maximizes the variance is (are)
i¼1
2 2 2
ðA3Þ given by the eigenvectors of the covariance matrix.
¼ ðt 1 Þ þ ðt 2 Þ þ … þ ðt n Þ
For instance, the first eigenvector of C gives the coeffi-
¼ tT t: cient of the rotation of X that gives the linear combi-
nation of the original variables defining the first
From Equations (A1) and (A3), it descends that since principal component.
t depends on w, also the variance of t depends on the b. The variance of each principal component is given by
weights. It should be noted that once the optimal weights the eigenvalue associated to the corresponding
maximizing t have been found, we can arbitrarily eigenvector.
increase the variance by just multiplying w by any (large)
number, this because of the general property of the vari- Using a more precise notation::
ance varðazÞ ¼ a2 varðzÞ. Thus, to have a properly speci-
fied problem with a unique solution, we require the X
J
  C ¼ XT X ¼ ULUT ¼ λj uj uTj ðA10Þ
 
weights w to be normalized, that is we require w  ¼ 1, j¼1

that is w T w ¼ w21 þ w22 þ … þ w2p ¼ 1. Note that the rota-


tion coefficients in Equation (1) satisfy this condition where λ1 , λ2 , …,λp are the p eigenvalues of C arranged in
since sin 2 θ þ cos 2 θ ¼ 1. This is true for any orthogonal descending order on the p  p diagonal matrix L. The
rotation in a p-dimensional space. orthonormal eigenvectors u (i.e., principal components)
Considering Equations (A2) and (A3), we can write: collected as columns of U. The eigenvalue lλ measures
the variance of the corresponding j-th principal compo-
varðtÞ ¼ w T XT Xw: ðA4Þ nent, and the eigenvectors buj is the vector of the
52 SACCENTI

loadings, that is, the coefficients needed to define the Note that if the number of samples n is larger than or
uncorrelated j-th principal components. The fraction of equal to the number of variables p (n ≥ p) and the matrix
variance vj explained by the j-th component is given by X is full rank (i.e., X has p linearly independent columns,
p  1 if X is centered as customary in PCA), than the
λj covariance-correlation matrix C has rank p, and has p
vj ¼ P : ðA11Þ
j λj distinct eigenvalues. If the number of samples n is smal-
ler than the number of variables p ðn < pÞ X, then C has
Operatively, given X (centered), the principal compo- maximum n  1 eigenvalues, and thus there are maxi-
nents are found by eigendecomposition of the covariance mum n  1 principal components. What happens if the
matrix. eigenvalues of C are not distinct is discussed.11
Copyright of Teaching Statistics is the property of Wiley-Blackwell and its content may not
be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.

You might also like