0% found this document useful (0 votes)
7 views

Module12.01 UnsupervisedLearning

Stat learning

Uploaded by

cadi0761
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module12.01 UnsupervisedLearning

Stat learning

Uploaded by

cadi0761
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unsupervised

Learning
Reference Books

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An


introduction to statistical learning (Vol. 112, p. 18). New York:
springer.

Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.


(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York:
springer.

Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate


statistical analysis.
Unsupervised Learning

• Unsupervised learning is a type of algorithm that learns patterns


from untagged data.
• Unsupervised learning is more subjective than supervised
learning, as there is no simple goal for the analysis, such as
prediction of a response.
• We will discuss two unsupervised learning methods:
1. Principal components analysis
2. Clustering
Principal Components Analysis

• PCA produces a low-dimensional representation of a dataset.


It finds a sequence of linear combinations of the variables
that have maximal variance, and are mutually uncorrelated.
• Apart from producing derived variables for use in supervised
learning problems, PCA also serves as a tool for data
visualization.
Principal Components Analysis: details

• The first principal component of a set of features , ,...,


is the normalized linear combination of the features
+ + ..… +

that has the largest variance. B y normalized, we mean that

• The elements as the loadings of the first principal


component; together, the loadings make up the principal
component loading vector
P C A : example

35
30
25
Ad Spending
20
15
10
5
0

10 20 30 40 50 60 70

Population

The population size (pop) and ad spending (ad) for 100 different cities are shown as
purple circles. The green solid line indicates the first principal component direction,
and the blue dashed line indicates the second principal component direction.
Computation of Principal Components

• Suppose we have a n × p data set X.


• Assume variables in X has been centered to have mean zero.
• Get the linear combination of the sample feature values
of the form
+ + ..… +
for i = 1, . . . , n that has largest sample variance, (1)

subject to the constraint that

• Since each of the has mean zero, then so does .


Hence the sample variance of the can be written as
• Plugging in (1) the first principal component loading vector
solves the optimization problem

∅ …∅

• This problem can be solved via a singular-value decomposition


of the matrix X.
• We refer to Z 1 as the first principal component, with
realized values z 11 , . . . , z n 1.
• The second principal component is the linear combination of
, , . . . , that has maximal variance among all linear
combinations that are uncorrelated with Z 1 .
• The second principal component scores z 12 , z 22 , . . . , z n2
take the form

+ + ..… +

where is the second principal component loading vector,


with elements .
Illustration

• USAarrests data: For each of the fifty states in the United


States, the data set contains the number of arrests per
100,000 residents for each of three crimes: Assault, Murder,
and Other. We also record UrbanPop (the percent of the
population in each state living in urban areas).
• The principal component score vectors have length n = 50,
and the principal component loading vectors have length
p = 4.
• P C A was performed after standardizing each variable to
have mean zero and standard deviation one.
USAarrests data: P C A plot
−0.5 0.0 0.5

UrbanPop

3
2

0.5
Hawaii California
RhodM
e aIslU
saatnacdh useNttesw Jersey

Connecticut
Second Principal Component

Washington Colorado
1

Ohio New York Nevada


WiscoM
n sininnesota Pennsylvania IllinoisArizona
Oregon
Texas
Other
KansaO Dm
s klaho elaaware Missouri
Nebraska Indiana Michigan
New HaImowpashire

0.0
0

New Mexico Florida


Idaho Virginia
Wyoming
Maine Maryland
rth Dakota Montana
Assault
South Dakota TennesseLeouisiana
Kentucky
−1

Arkansas Alaska
Alabama
Georgia
VermontWest Virginia Murder

−0.5
South Carolina
−2

North Carolina
Mississippi
−3

−3 −2 −1 0 1 2 3

First Principal Component


Figure details

The first two principal components for the USArrests data.


• The blue state names represent the scores for the first
two principal components.
• The orange arrows indicate the first two principal
component loading vectors (with axes on the top and
right). For example, the loading for Other on the first
component is 0.54, and its loading on the second
principal component 0.17 [the word Rape is centered at
the point (0.54, 0.17)].
• This figure is known as a biplot, because it displays
both the principal component scores and the principal
component loadings.
Figure details

• First loading vector places approximately equal weights on


Assult, Murder and Other
• It indicates that the first PC represents crime in the city
• The second loading vector places most of the weight on
urban pop.
• The second PC represents the urban population
• Crime related variables are correlated (high murder rate is
associated with high assault)
• Urbanpop variable is less correlated with the other three.
How to Determine Principal Components

Let be the covariance matrix of the random variable


, ,...,

Let has eigenvalue-eigenvector pairs


Where

The PC is given by

+ +… .. + where

With following properties


where
for
Another Interpretation of Principal Components

• •

1.0
• • ••

• • • •• • •• •••
• •

• •

0.5
Second principal component
•• •

• • • •
• • •• • •

0.0
• •
• • • ••• • • •
• • • •• •
•• • • •
• • • •
•• •

−0.5
• •
• •••
• • ••
• • • •
• •
−1.0

−1.0 −0.5 0.0 0.5 1.0


First principal component
Another Interpretation of Principal Components

• The first principal component loading vector has a very


special property: it defines the hyperplane in p-dimensional
space that is closest to the n observations (using average
squared Euclidean distance as a measure of closeness).
• The notion of principal components as the dimensions that
are closest to the n observations extends beyond just the
first principal component.
Scaling of the variables
• If the variables are in different units, scaling each to have
standard deviation equal to one is recommended.
• Variance of Murder, Other, Assault and UrbanPop are:
18.97, 87.73, 6945.16 and 209.5
• If they are in the same units, scaling is not mandatory
Scaled Unscaled
−0.5 0.0 0.5 −0.5 0.0 0.5 1.0

1.0
UrbanPop UrbanPop
3

150
2

0.5

100
** **
Second Principal Component

Second Principal Component


* *

0.5
* **
1

*
* *

50
* * * * * Other
* Other
* * ** * * *
** * ** *
0.0

* *
0

* **
* * * * * * *** ** * * ** * **
* *

0.0
* 0
* *M*urd*er * *
* * A*ssault *
* * ** * * * Assa
* * * * ** *
* *
−1

* * *
* *
M*urder
−50

* *
−0.5

−0.5
*
−2

−100

**
−3

−3 −1 0 1 2 3 −100 −50 0 50 100


−2 150
First Principal Component First Principal Component
Proportion of Variance Explained

• To understand the strength of each component, measure the


proportion of variance explained ( P V E ) by each one.
• The total variance present in a data set (assuming that the
variables have been centered to have mean zero) is defined as

and the variance explained by the mth principal component is


• Therefore, the P V E of the mth principal component is given
by the positive quantity between 0 and 1
Scree Plots

Left: Proportion of variance explained by each of the four


principal components in the USArrests data.
Right: The cumulative proportion of variance explained by
the four principal components in the USArrests data.
Example

Suppose Random Variable have the covariance matrix

Determine the principal components

You might also like