0% found this document useful (0 votes)
2 views

Lecture 12 - Unsupervised- PCA

The document provides an overview of unsupervised learning, focusing on Principal Components Analysis (PCA) as a method for dimensionality reduction and data visualization. It discusses the challenges of unsupervised learning compared to supervised learning, and outlines the applications of PCA in various fields. Additionally, it explains the process of finding principal components, including the use of singular value decomposition and the importance of scaling variables.

Uploaded by

Ahmed Amr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 12 - Unsupervised- PCA

The document provides an overview of unsupervised learning, focusing on Principal Components Analysis (PCA) as a method for dimensionality reduction and data visualization. It discusses the challenges of unsupervised learning compared to supervised learning, and outlines the applications of PCA in various fields. Additionally, it explains the process of finding principal components, including the use of singular value decomposition and the importance of scaling variables.

Uploaded by

Ahmed Amr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

2020-12-21

Unsupervised Learning
PCA
Mohamed Elshenawy
Zewail University of Science and Technology

Overview

• Unsupervised Learning
• Principal Components Analysis

1
2020-12-21

Unsupervised Learning
Sections 10.1 from the book : An Introduction to Statistical Learning. James,
Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, 2013, ISBN: 978-1-
461-47137-0

Supervised Learning

• So far, we discussed supervised learning methods.


• In supervised learning, we have access to a set of p features, 𝑋1 , 𝑋2 ,… 𝑋𝑝 ,
measured on n observations, and a target (response) T, measured on those n
observations.
• The goal is to predict the target variable T, using the input features 𝑋1 , 𝑋2 ,… 𝑋𝑝

2
2020-12-21

Unsupervised Learning
• We have only a set of features 𝑋1 , 𝑋2 ,… 𝑋𝑝 measured on n observations.
• We do not have an associated target (response) variable T. Therefore, we are not
interested in prediction.
• The goal is to discover interesting things about 𝑋1 , 𝑋2 ,… 𝑋𝑝 ,
• How can we merge the given features (𝑋1 , 𝑋2 ,… 𝑋𝑝 ) to produce a smaller set of attributes
that encode most of the information included in these given features (produce 𝑍1 , 𝑍2 ,…
𝑍𝑞 where 𝑞 < 𝑝). Dimensionality Reduction.
• Is there an informative way to visualize the data? Visualization
• Can we define subgroups, using given observations, that have similar characteristics
(similar values of 𝑋1 , 𝑋2 ,… 𝑋𝑝 , for instance)? Clustering
• Learn the probability distribution that generates the data. Density Estimation

The Challenge of Unsupervised Learning


• In supervised learning, the task is clear:
• 1) a clear goal: predict the target variables using input features.
• 2) a clear understanding of how to assess the performance of your model (using training
and test error, cross-validation, etc.)
• In contrast, unsupervised learning is often much more challenging and the task is
less clear. There is no universally accepted mechanism to assess the results of the
unsupervised learning (we don’t know the true answer).
• It is typically performed as part of an exploratory data analysis

3
2020-12-21

Example Applications
• A cancer researcher might assay gene expression levels in 100 patients with breast
cancer.
• A possible approach is to look for subgroups among the genes in order to obtain a
better understanding of the disease.
• Recommendation systems (recommend items based on the purchase histories of similar
shoppers):
• to identify groups of shoppers with similar browsing and purchase histories
• to identify items that are of particular interest to the shoppers within each group.
• Search engines:
• choose what search results to display to a particular individual based on the click
histories of other individuals with similar search patterns.

Principle Components Analysis


Sections 10.2 from the book : An Introduction to Statistical Learning. James,
Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, 2013, ISBN: 978-1-
461-47137-0

4
2020-12-21

Which dimension has more information 𝑋1 or 𝑋2

𝑋2

𝑋1

Principle Components Analysis

• Principal component analysis (PCA) refers to the process by which principal components
are computed.
• Principal components allow us to summarize a dataset with a large set of correlated
variables using a smaller number of representative variables that collectively explain
most of the variability in the original set.
• Each of the dimensions found by PCA is a linear combination of the input features.

10

5
2020-12-21

Example – 2-D
You need:
𝑋2 • The direction of
𝑋 ′ (direction along
𝑍1 which the
observations are
highly variable
• Representation of
the data along the
new dimension

𝑋1

11

Visualize the data using 1-D

𝑍1

12

6
2020-12-21

2-D Example- 2

13

PCA - Applications

• Useful for preprocessing (reduce the dimensions of the


dataset)
• Can be used for data visualization (if we can obtain a two-
dimensional representation of the data that captures most of
the information, then can plot the observations in this low-
dimensional space.)

14

7
2020-12-21

Assumptions

1. Linear relationship between the data and learned


representation
2. Data is assumed to be continuous
3. Variation contains information

15

Example – 3-D

How to find this plane?

16

8
2020-12-21

How to find the first principle component


• The first principal component (𝑍1 ) of a set of features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 is the
normalized linear combination of the features that has the largest variance.
𝑍1 = 𝜙11 𝑋1 + 𝜙21 𝑋2 + . . . +𝜙𝑝1 𝑋𝑝
2𝑝
• By normalized, we meanσ𝑗=1 𝜙𝑗1 =1
• We constrain the loadings so that their sum of squares is equal to one.
• 𝜙11 , 𝜙21 , … . 𝜙𝑝1 : are referred to as the loadings of 𝑍1 (first component)
• The loadings make up the principal component loading vector
𝜙1 = (𝜙11 , 𝜙21 , … . 𝜙𝑝1 )𝑇
• To find the loading vector, we choose the values that produce the largest variance
(optimization problem)

17

How to find the first principle components (cont.)

• The first principal component loading vector solves the optimization problem.
𝑛 𝑝
1 2
max ෍(𝑧𝑖1 −𝑧ഥ1 )2 subject to ෍ 𝜙𝑗1 =1
𝜙11 ,𝜙21 ,….𝜙𝑝1 : 𝑛
𝑖=1 𝑗=1

• We refer to 𝑧11 , … . , 𝑧𝑛1 as the scores of the first principle component.

18

9
2020-12-21

How to find the first principle components (cont.)


𝑛 𝑛 𝑝
1 1
𝑧ഥ1 = ෍ 𝑧𝑖1 = ෍ ෍ 𝜙1𝑗 𝑥𝑖𝑗
𝑛 𝑛
𝑖=0 𝑖=0 𝑗=1
• If the data is normalized 𝑛
1
෍ 𝑥𝑖𝑗 = 0, 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑧ഥ1 = 0
𝑛
𝑖=0 𝑛 𝑝
1 2
max ෍(𝑧𝑖1 )2 subject to ෍ 𝜙𝑗1 =1
𝜙11 ,𝜙21 ,….𝜙𝑝1 : 𝑛
𝑖=1 𝑗=1
2
𝑛 𝑝 𝑝
1 2
max ෍ ෍ 𝜙1𝑗 𝑥𝑖𝑗 subject to ෍ 𝜙𝑗1 =1
𝜙11 ,𝜙21 ,….𝜙𝑝1 : 𝑛
𝑖=1 𝑗=1 𝑗=1

• We constrain the loadings so that their sum of squares is equal to one, since otherwise setting
these elements to be arbitrarily large in absolute value could result in an arbitrarily large
variance.

19

How to find the first principle components (cont.)


2
𝑛 𝑝 𝑝
1 2
max ෍ ෍ 𝜙1𝑗 𝑥𝑖𝑗 subject to ෍ 𝜙𝑗1 =1
𝜙11 ,𝜙21 ,….𝜙𝑝1 : 𝑛
𝑖=1 𝑗=1 𝑗=1

• The problem can be solved using an eigen decomposition, a standard technique


in linear algebra.
• The loading vector 𝜙1 defines a direction in feature space along which the data
vary the most. If we project the n data points 𝑥1 , . . . , 𝑥𝑛 onto this direction, the
projected values are the principal component scores 𝑧11 , . . . , 𝑧𝑛1 themselves.

20

10
2020-12-21

Finding the second principle component


• The second principal component is the linear
combination of 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 that has maximal
variance out of all linear combinations that are
uncorrelated with 𝑍1 .
• The second principal component scores
𝑧12 , 𝑧22 . . . , 𝑧𝑛2 take the form
𝑍2 = 𝜙12 𝑋1 + 𝜙22 𝑋2 + . . . +𝜙𝑝2 𝑋𝑝
• where 𝜙2 = (𝜙12 , 𝜙22 , … . 𝜙𝑝2 )𝑇 is the second
principle component loading vector.
• It turns out that constraining 𝑍2 to be
uncorrelated with 𝑍1 is equivalent to
constraining the direction 𝜙2 to be orthogonal
to the direction 𝜙1

21

Finding additional principle components

• We can define additional principal components in an incremental fashion by


choosing a new direction that:
• Is orthogonal to the principle components already considered .
• Maximizes the projected variance amongst all possible directions.

22

11
2020-12-21

Singular Value Decomposition


• You can perform PCA by using singular value decomposition of the data matrix

𝑋 = 𝑈𝑆𝑉 𝑇
• U: 𝑛 × 𝑛 orthogonal matrix
• S: diagonal matrix 𝑛 × 𝑝 matrix
• V: 𝑝 × 𝑝 orthogonal matrix
• Principle components (PC) are the columns of V
• PC scores are the columns of U

23

USArrests dataset

• For each of the 50 states in the United States (𝑛 = 50), the data set contains the
number of arrests per 100,000 residents for each of three crimes: Assault,
Murder, and Rape. In addition, the dataset has the UrbanPop attribute, which
indicates the percent of the population in each state living in urban areas.

24

12
2020-12-21

Principle Components

25

Biplot
• Overlays a score plot (projecting
the observations onto the span of
the first two PCs, shown in blue)
and a loadings plot (shown in
orange) in a single graph.

26

13
2020-12-21

Biplot (cont.)
• We can see the first loading vector
places approximately equal weight
on Assault, Murder, and Rape, with
much less weight on UrbanPop.
• The second loading vector places
most of its weight on UrbanPop and
much less weight on the other three
features.
• This indicates that the crime-related
variables are correlated with each
other.

27

Biplot (cont.)
• States with large positive scores on the
first component, such as California,
Nevada and Florida, have high crime rates
• States like North Dakota, with negative
scores on the first component, have low
crime rates.
• California also has a high score on the
second component, indicating a high level
of urbanizations.
• States close to zero on both components,
such as Indiana, have approximately
average levels of both crime and
urbanization.

28

14
2020-12-21

Scaling the variables

• The results obtained when we perform PCA depend on whether the variables
have been individually scaled (each multiplied by a different constant).
• This is in contrast to some other supervised and unsupervised learning
techniques, such as linear regression.

29

Scaling the variables


• Murder, Rape, and Assault are reported as the number of occurrences per
100, 000 people, and UrbanPop is the percentage of the state’s population that
lives in an urban area (different units).
• These four variables have variance 18.97, 87.73, 6945.16, and 209.5, respectively.
• If we perform PCA on the unscaled variables, then the first principal component
loading vector will have a very large loading for Assault, since that variable has by
far the highest variance.

30

15
2020-12-21

Scaling the variables (Cont.)

Scaled to have unit standard deviations.

31

Scaling the variables (Cont.)

• Because it is undesirable for the principal components obtained to depend on an


arbitrary choice of scaling we typically scale each variable to have standard
deviation one before we perform PCA.
• In certain settings, the variables may be measured in the same units. In this case,
we might not wish to scale the variables to have standard deviation one before
performing PCA.

32

16
2020-12-21

Scree Plot

Helps us to decide on the number of principal components required to visualize the data
by examining. We choose the smallest number of principal components that are required
in order to explain a sizable amount of the variation in the data.

33

Proportion of variance explained (PVE)

2
σ𝑛𝑖=1 σ𝑝𝑗=1 𝜙𝑚𝑗 𝑥𝑖𝑗
𝑃𝑉𝐸𝑜𝑓 𝑡ℎ𝑒 𝑚𝑡ℎ 𝑃𝐶 =
σ𝑝𝑗=1 σ𝑛𝑖=1 𝑥𝑖𝑗 2

34

17

You might also like