Lecture 12 - Unsupervised- PCA
Lecture 12 - Unsupervised- PCA
Unsupervised Learning
PCA
Mohamed Elshenawy
Zewail University of Science and Technology
Overview
• Unsupervised Learning
• Principal Components Analysis
1
2020-12-21
Unsupervised Learning
Sections 10.1 from the book : An Introduction to Statistical Learning. James,
Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, 2013, ISBN: 978-1-
461-47137-0
Supervised Learning
2
2020-12-21
Unsupervised Learning
• We have only a set of features 𝑋1 , 𝑋2 ,… 𝑋𝑝 measured on n observations.
• We do not have an associated target (response) variable T. Therefore, we are not
interested in prediction.
• The goal is to discover interesting things about 𝑋1 , 𝑋2 ,… 𝑋𝑝 ,
• How can we merge the given features (𝑋1 , 𝑋2 ,… 𝑋𝑝 ) to produce a smaller set of attributes
that encode most of the information included in these given features (produce 𝑍1 , 𝑍2 ,…
𝑍𝑞 where 𝑞 < 𝑝). Dimensionality Reduction.
• Is there an informative way to visualize the data? Visualization
• Can we define subgroups, using given observations, that have similar characteristics
(similar values of 𝑋1 , 𝑋2 ,… 𝑋𝑝 , for instance)? Clustering
• Learn the probability distribution that generates the data. Density Estimation
3
2020-12-21
Example Applications
• A cancer researcher might assay gene expression levels in 100 patients with breast
cancer.
• A possible approach is to look for subgroups among the genes in order to obtain a
better understanding of the disease.
• Recommendation systems (recommend items based on the purchase histories of similar
shoppers):
• to identify groups of shoppers with similar browsing and purchase histories
• to identify items that are of particular interest to the shoppers within each group.
• Search engines:
• choose what search results to display to a particular individual based on the click
histories of other individuals with similar search patterns.
4
2020-12-21
𝑋2
𝑋1
• Principal component analysis (PCA) refers to the process by which principal components
are computed.
• Principal components allow us to summarize a dataset with a large set of correlated
variables using a smaller number of representative variables that collectively explain
most of the variability in the original set.
• Each of the dimensions found by PCA is a linear combination of the input features.
10
5
2020-12-21
Example – 2-D
You need:
𝑋2 • The direction of
𝑋 ′ (direction along
𝑍1 which the
observations are
highly variable
• Representation of
the data along the
new dimension
𝑋1
11
𝑍1
12
6
2020-12-21
2-D Example- 2
13
PCA - Applications
14
7
2020-12-21
Assumptions
15
Example – 3-D
16
8
2020-12-21
17
• The first principal component loading vector solves the optimization problem.
𝑛 𝑝
1 2
max (𝑧𝑖1 −𝑧ഥ1 )2 subject to 𝜙𝑗1 =1
𝜙11 ,𝜙21 ,….𝜙𝑝1 : 𝑛
𝑖=1 𝑗=1
18
9
2020-12-21
• We constrain the loadings so that their sum of squares is equal to one, since otherwise setting
these elements to be arbitrarily large in absolute value could result in an arbitrarily large
variance.
19
20
10
2020-12-21
21
22
11
2020-12-21
𝑋 = 𝑈𝑆𝑉 𝑇
• U: 𝑛 × 𝑛 orthogonal matrix
• S: diagonal matrix 𝑛 × 𝑝 matrix
• V: 𝑝 × 𝑝 orthogonal matrix
• Principle components (PC) are the columns of V
• PC scores are the columns of U
23
USArrests dataset
• For each of the 50 states in the United States (𝑛 = 50), the data set contains the
number of arrests per 100,000 residents for each of three crimes: Assault,
Murder, and Rape. In addition, the dataset has the UrbanPop attribute, which
indicates the percent of the population in each state living in urban areas.
24
12
2020-12-21
Principle Components
25
Biplot
• Overlays a score plot (projecting
the observations onto the span of
the first two PCs, shown in blue)
and a loadings plot (shown in
orange) in a single graph.
26
13
2020-12-21
Biplot (cont.)
• We can see the first loading vector
places approximately equal weight
on Assault, Murder, and Rape, with
much less weight on UrbanPop.
• The second loading vector places
most of its weight on UrbanPop and
much less weight on the other three
features.
• This indicates that the crime-related
variables are correlated with each
other.
27
Biplot (cont.)
• States with large positive scores on the
first component, such as California,
Nevada and Florida, have high crime rates
• States like North Dakota, with negative
scores on the first component, have low
crime rates.
• California also has a high score on the
second component, indicating a high level
of urbanizations.
• States close to zero on both components,
such as Indiana, have approximately
average levels of both crime and
urbanization.
28
14
2020-12-21
• The results obtained when we perform PCA depend on whether the variables
have been individually scaled (each multiplied by a different constant).
• This is in contrast to some other supervised and unsupervised learning
techniques, such as linear regression.
29
30
15
2020-12-21
31
32
16
2020-12-21
Scree Plot
Helps us to decide on the number of principal components required to visualize the data
by examining. We choose the smallest number of principal components that are required
in order to explain a sizable amount of the variation in the data.
33
2
σ𝑛𝑖=1 σ𝑝𝑗=1 𝜙𝑚𝑗 𝑥𝑖𝑗
𝑃𝑉𝐸𝑜𝑓 𝑡ℎ𝑒 𝑚𝑡ℎ 𝑃𝐶 =
σ𝑝𝑗=1 σ𝑛𝑖=1 𝑥𝑖𝑗 2
34
17