0% found this document useful (0 votes)
114 views

Data Mining - Module 2 - HU

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of large datasets while retaining as much information as possible. PCA transforms the data into a new coordinate system defined by the principal components, which are linear combinations of the original variables that describe the maximum amount of variance. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA was performed on a housing dataset containing 9 variables to reduce its dimensionality and understand the underlying relationships between variables. The results included eigenvalues indicating the amount of variance explained by each principal component and a graph of the first two components.

Uploaded by

Test
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

Data Mining - Module 2 - HU

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of large datasets while retaining as much information as possible. PCA transforms the data into a new coordinate system defined by the principal components, which are linear combinations of the original variables that describe the maximum amount of variance. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA was performed on a housing dataset containing 9 variables to reduce its dimensionality and understand the underlying relationships between variables. The results included eigenvalues indicating the amount of variance explained by each principal component and a graph of the first two components.

Uploaded by

Test
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Data Mining

Dr. Alexander Pelaez


Data Mining
Module 2.1: Principal Component Analysis
Why use Dimension Reduction
Large datasets present a unique challenge for analysis

Analyzing one variable at a time can be time consuming, but more importantly, may lead to
inaccurate results, and lose the richness of multidimensional data.

The difficulty in analyzing large sets stems from computing power as well as interpretation.

Humans can more easily process data in smaller dimensions as opposed to larger
dimensions.

3
Dimension Reduction - Example
This example uses the housing dataset1.
A cursory look at the data shows that it contains 9 columns, and 20640 observations.
While many datasets in practice will contain more than 9 columns, it still represents a
challenge to reduce the number of dimensions from 9 to something more meaningful.
The descriptives table in R from the summary function displays important characteristics of
the dataset:
> summary(houses)

median_house_value median_income housing_median_age total_rooms total_bedrooms


Min. : 14999 Min. : 0.4999 Min. : 1.00 Min. : 2 Min. : 1.0
1st Qu.:119600 1st Qu.: 2.5634 1st Qu.:18.00 1st Qu.: 1448 1st Qu.: 295.0
Median :179700 Median : 3.5348 Median :29.00 Median : 2127 Median : 435.0
Mean :206856 Mean : 3.8707 Mean :28.64 Mean : 2636 Mean : 537.9
3rd Qu.:264725 3rd Qu.: 4.7432 3rd Qu.:37.00 3rd Qu.: 3148 3rd Qu.: 647.0
Max. :500001 Max. :15.0001 Max. :52.00 Max. :39320 Max. :6445.0
population households latitude longitude
Min. : 3 Min. : 1.0 Min. :32.54 Min. :-124.3
1st Qu.: 787 1st Qu.: 280.0 1st Qu.:33.93 1st Qu.:-121.8
Median : 1166 Median : 409.0 Median :34.26 Median :-118.5
Mean : 1425 Mean : 499.5 Mean :35.63 Mean :-119.6
3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.:37.71 3rd Qu.:-118.0
Max. :35682 Max. :6082.0 Max. :41.95 Max. :-114.3

1 - Pace, R. Kelley and Ronald Barry, “Sparse Spatial Autoregressions”, Statistics and Probability Letters, 33 (1997) 291-297.

4
PCA

Principal components analysis is concerned with explaining the variance-covariance structure


of a set of variables. This explanation comes from a few linear combinations of the original
variables.

PCA has two objectives:

1. “Data” reduction - moving from many original variables down to a few “composite”
variables (linear combinations of the original variables).
2. Interpretation - which variables play a larger role in the explanation of total variance.

Think of the new variables, called the principal components, as composite variables
consisting of a mixture of the original variables.
PCA and Variability

In PCA, the goal is to find a set of k principal components (composite variables) that:

● Is much smaller than the original set of p variables.


● Accounts for nearly all of the total sample variance

If these two goals can be accomplished, then the set of k principal components contains
almost as much information as the original p variables.

This means that the k principal components can then replace the p original variables.

The original data set is thereby reduced, from n measurements on p variables to n


measurements on k variables.
PCA Features

PCA often reveals relationships between variables that were not previously suspected.

Because of such relationships, new interpretations of the data and variables often stem
from PCA.

PCA usually serves a more of a means to an end rather than an end in themselves. We can
use the composite variables in multiple regression or cluster analysis.
PCA Details

Principal components, the composite variables created by combinations of the original


variables, are formed by a set of linear combinations.

For p variables, a total of p components can be formed using PCA (although much fewer are
usually used).

Letting Y stand for the new composite variables, or principal components, the linear
combinations look like:
Variability

When examining data, it's important to understand the total variability. Recall from
previous models that variability, as measured by variance, is critical to understanding a
particular column of data, and thus the covariance matrix informs the analyst of the overall
variability of the data.

Covariance
Matrix
Data
Variability

When talking about the “joint” variability of a set of variables, the concept of the total sample
variance is important.

Essentially, the total sample variance provides is way to describe the sample covariance
matrix, S, with a single number.

Another way to characterize the sample variance is with the total variance.
Total variance equals:

Like generalized variance, the total sample variance reflects the overall spread of the data.

Many multivariate techniques use total sample variance in computation of variance


accounted.
Variability

Assume we have a sample covariance matrix:

Recall that the variances of the variables are along the diagonal. Therefore, the total Sample
Variance of covariance matrix S can be calculated by the trace of the matrix. Thus:

And our aim will be to find principal components which would be able to account for as
much of this variance as possible.
Linear Combinations

In PCA, the linear combinations, each formed by weighting each original variable by app
are formed so that the following conditions are met:

The variance of each successive component is smaller than the previous component:

The covariance (or correlation) between any two different principal components (i and j) is
zero:

The sum of the variances of the principal components is equal to the total sample variance:
Linear Combinations

Each principal component is formed by taking the values of the elements of the
eigenvectors as the weights of the linear combination.
If each eigenvector has elements eik:

Then the principal components are formed by:


Conceptual

The chart to the right shows only two


dimensions of a data set.

The First principal component will be in


direction of maximum variance from origin
(shown in red).

Subsequent PCs are orthogonal (perpendicular)


to 1st PC and describe maximum residual
variance (shown in blue)

The goal is to explain/summarize the underlying


variance-covariance structure of a large set of
variables through a few linear combinations of
these variables.
PCA Examination

Our final product might look like the following. Given a point in space from the data set (x1,x2)

(x1 , x2 )
PCA Examination

We examine the direction of the maximum variance

(x1 , x3 )
PCA Examination

Next we find the next perpendicular direction of maximum variance.

(x1 , x3 )
PCA Examination

Thus the new point can be represented as a new set of points in a different "space"/

λ2
λ 1
(x'1 , x'2 )
PCA assumptions

PCA does not specifically presume any type of data for the analysis.

Many people prefer to think of using PCA for only continuous variables (although there are
numerous examples of this not being the case).

If the variables happen to be MVN, then the principal components will also be MVN, with a
zero mean vector, and a covariance matrix that has zero off-diagonal elements and diagonal
elements equal to the eigenvalues of the principal components.
Running PCA

Using the housing dataset, run the PCA function from the FactoMineR package:

library(FactoMineR)
res.pca = PCA(X=housing, scale.unit=TRUE, ncp=5, graph=T)
summary(res.pca)

The result will be a number of different objects including different principal components.

The parameters above are the following:


X - the data set to compute the principal components
scale.unit - this option scales the data to the unit variance
ncp - the number of components to provide
graph - display the PCA graph for first two components.
PCA (standardized vs non-standardized)

Recall that PCA creates components based on the variability. There are two options to
create PCA, using the correlation matrix (standardized), or covariance matrix
(not-standardized).

Covariance
Matrix

Data
Correlation
Matrix
PCA (standardized vs not-standardized)

In general, using the correlation matrix is better than using the covariance matrix
(scale.unit=TRUE)

Using the correlation matrix is equivalent to using the covariance matrix of standardized
observations.

The resulting principal component values (the Yi), are calculated based on the standardized
observations.

The results of such an analysis can produce different interpretations than a PCA with the
original variables.

It is advised, that PCA be used with correlations when variables with widely different scales
of measurement are included in the analysis.
PCA Results - Housing Data
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 3.907 1.922 1.697 0.910 0.293 0.144 0.063 0.045 0.019
% of var. 43.407 21.361 18.854 10.113 3.257 1.601 0.697 0.495 0.215
Cumulative % of var. 43.407 64.767 83.622 93.735 96.992 98.592 99.290 99.785 100.000

Individuals (the 10 first)


Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 4.168 | -1.969 0.005 0.223 | -0.552 0.001 0.018 | 3.594 0.037 0.743 |
2 | 4.464 | 3.023 0.011 0.459 | -1.312 0.004 0.086 | 2.884 0.024 0.417 |
3 | 3.664 | -1.890 0.004 0.266 | -0.855 0.002 0.054 | 2.672 0.020 0.532 |
4 | 3.252 | -1.853 0.004 0.324 | -1.040 0.003 0.102 | 2.021 0.012 0.386 |
5 | 3.042 | -1.720 0.004 0.320 | -1.231 0.004 0.164 | 1.397 0.006 0.211 |
6 | 3.046 | -2.127 0.006 0.487 | -1.284 0.004 0.178 | 1.035 0.003 0.115 |
7 | 2.657 | -0.750 0.001 0.080 | -1.447 0.005 0.296 | 1.038 0.003 0.153 |
8 | 2.632 | -0.234 0.000 0.008 | -1.670 0.007 0.402 | 0.499 0.001 0.036 |
9 | 2.260 | -0.293 0.000 0.017 | -1.767 0.008 0.611 | -0.010 0.000 0.000 |
10 | 2.683 | 0.162 0.000 0.004 | -1.628 0.007 0.368 | 0.799 0.002 0.089 |

Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
longitude | 0.150 0.572 0.022 | 0.919 43.891 0.844 | -0.322 6.093 0.103 |
latitude | -0.149 0.571 0.022 | -0.958 47.719 0.917 | 0.165 1.605 0.027 |
housing_median_age | -0.428 4.698 0.184 | -0.001 0.000 0.000 | 0.063 0.234 0.004 |
total_rooms | 0.958 23.503 0.918 | -0.084 0.369 0.007 | 0.110 0.717 0.012 |
total_bedrooms | 0.966 23.886 0.933 | -0.099 0.512 0.010 | -0.055 0.177 0.003 |
population | 0.930 22.129 0.864 | -0.063 0.208 0.004 | -0.102 0.619 0.011 |
households | 0.971 24.116 0.942 | -0.099 0.515 0.010 | -0.035 0.073 0.001 |
median_income | 0.111 0.315 0.012 | 0.246 3.150 0.061 | 0.875 45.076 0.765 |
median_house_value | 0.090 0.210 0.008 | 0.264 3.637 0.070 | 0.878 45.405 0.770 |
PCA Results - Principal Component Equations.
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
longitude | 0.150 0.572 0.022 | 0.919 43.891 0.844 | -0.322 6.093 0.103 |
latitude | -0.149 0.571 0.022 | -0.958 47.719 0.917 | 0.165 1.605 0.027 |
housing_median_age | -0.428 4.698 0.184 | -0.001 0.000 0.000 | 0.063 0.234 0.004 |
total_rooms | 0.958 23.503 0.918 | -0.084 0.369 0.007 | 0.110 0.717 0.012 |
total_bedrooms | 0.966 23.886 0.933 | -0.099 0.512 0.010 | -0.055 0.177 0.003 |
population | 0.930 22.129 0.864 | -0.063 0.208 0.004 | -0.102 0.619 0.011 |
households | 0.971 24.116 0.942 | -0.099 0.515 0.010 | -0.035 0.073 0.001 |
median_income | 0.111 0.315 0.012 | 0.246 3.150 0.061 | 0.875 45.076 0.765 |
median_house_value | 0.090 0.210 0.008 | 0.264 3.637 0.070 | 0.878 45.405 0.770 |

The principal components that are formed come are identified by the dimensions above. Each of the dimension
numbers are the coefficients for each principal components:
How many principal components?

One of the main questions of PCA is how many principal components are to be
used to describe the original data set.

There are some guidelines toward the selection of the number of principal
component:

1) Eigenvalue Criterion (Kaiser Rule)


2) Proportion of Variance Explained
3) Scree plot criterion
4) Minimum Communality Criterion
Eigenvalue Criterion (Kaiser Rule)

After the Principal Components have been calculated, the number of meaningful PCs can
be determined by examining the PCs with an eigenvalue greater than 1. This would be
synonymous to explaining at least 1 variables variability.
Unfortunately, when there are less than 20 variables, the eigenvalue criterion tends to
select less principal components than maybe necessary, while if there are more than 50, it
tends to select too many.
Further, the analyst must make a distinction when the eigenvalue is a number close to 1
such as .98.
Notice the eigenvalues of the first three dimensions are greater than 1

Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 3.907 1.922 1.697 0.910 0.293 0.144 0.063 0.045 0.019
% of var. 43.407 21.361 18.854 10.113 3.257 1.601 0.697 0.495 0.215
Cumulative % of var. 43.407 64.767 83.622 93.735 96.992 98.592 99.290 99.785 100.000
Proportion of Variance Explained Criterion

Again once the principal components have been calculated, we can use the percentage of
variance explained by the individual component.
Generally, we want to take the first (n) components, with a proportion of variance explained
of somewhere between 65-85%. The actual number is up to the analyst and the domain in
which the data is being examined.

Below we choose the first three components based on the cumulative % of variance

Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 3.907 1.922 1.697 0.910 0.293 0.144 0.063 0.045 0.019
% of var. 43.407 21.361 18.854 10.113 3.257 1.601 0.697 0.495 0.215
Cumulative % of var. 43.407 64.767 83.622 93.735 96.992 98.592 99.290 99.785 100.000
Screeplot Criterion

A scree plot is a graphical examination of the eigenvalues


across the components. Assessing the number of components 4 components
in a screeplot is somewhat trivial, since the analyst simply
needs to examine the point before the line begins to flatten out. Point of
Since a majority of the variance is explained by the first levelling
component, it should be intuitive, however, sometimes the plot
isn’t as clear.

The example to the right shows the point where the plot begins
to level out, and thus the selection of the number of
components. Thus, the screeplot indicates 4 components

Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 3.907 1.922 1.697 0.910 0.293 0.144 0.063 0.045 0.019
% of var. 43.407 21.361 18.854 10.113 3.257 1.601 0.697 0.495 0.215
Cumulative % of var. 43.407 64.767 83.622 93.735 96.992 98.592 99.290 99.785 100.000
Minimal Communality Criterion

Communality represents the proportion of variance of a particular variable is shared with


other variables. Communalities provide an indication of the importance of each variable in
the PCA solution.

Variables with smaller communalities contribute less to the PCA solution than other
variables, and are indicative of a less important variable. Therefore, the analyst should be
looking for variables with larger communalities, as this indicates more representation of
variance is explained.

In order to calculate the communality for a variable, simply square each of the component
weights across all principal components and sum up the values to obtain the communality.

Generally, communalities less than 40-50% are sharing less than half of its variability with
other variables.
Minimal Communality Criterion - Example

In order to calculate the communality of a variable, square each weight across all components
sum the results.
> communality(pca_result)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Notice only 18.7% of the variance is shared,
longitude 0.0224 0.8662 0.9696 0.9710 0.9736
latitude 0.0223 0.9397 0.9669 0.9736 0.9751
housing_median_age 0.1836 0.1836 0.1875 0.9770 0.9989
Note:
total_rooms 0.9181 0.9252 0.9374 0.9382 0.9458
total_bedrooms 0.9331 0.9430 0.9460 0.9601 0.9661 The communality function is available on the blog site.
population 0.8645 0.8685 0.8790 0.8922 0.9038
households 0.9421 0.9520 0.9533 0.9724 0.9755
median_income 0.0123 0.0729 0.8378 0.8777 0.9958
median_house_value 0.0082 0.0781 0.8486 0.8740 0.9948

Thus, looking at the table above Dim4 has communalities all above 50%. Therefore, an analyst
may determine 4 components are needed.
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 3.907 1.922 1.697 0.910 0.293 0.144 0.063 0.045 0.019
% of var. 43.407 21.361 18.854 10.113 3.257 1.601 0.697 0.495 0.215
Cumulative % of var. 43.407 64.767 83.622 93.735 96.992 98.592 99.290 99.785 100.000
PCA Component Criterion - Summary

Given the 4 criterion we have the following summary

1) Eigenvalue Criterion – choose components with an eigenvalue of 1 or greater


2) Proportion of Variance Explained – choose components with a cumulative
proportion of variance greater than a specified amount 65-85%
3) Screeplot Criterion – use the screeplot to determine the point before the
screeplot line flattens.
4) Minimum Communality Criterion – choose the number of components where
each variable has a communality of at least 50%.

In the houses example, the eigenvalue and proportion of variance explained indicate
three components is sufficient, however, the screeplot and communality criterion
indicates 4
PCA Interpretation

From the houses example, it can be shown the four components chosen correspond to
certain latent attributes. Therefore, these components can be classified as:

Dimension 1: Size
Dimension 2: Geography or location
Dimension 3: Income & value
Dimension 4: Age of house
> display_pc(pca_result)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
longitude . 0.9186 . . .
latitude . -0.9578 . . .
housing_median_age . . . 0.8885 .
total_rooms 0.9582 . . . .
total_bedrooms 0.9660 . . . .
population 0.9298 . . . .
households 0.9706 . . . .
median_income . . 0.8746 . .
median_house_value . . 0.8778 . .
PCA Interpretation

The analyst can also choose to plot the first two


principal components, which normally provide
significant insight into the effect of the individual
variables.

The groupings of the variables along these two


dimensions can be interpreted in a similar fashion
to the previous interpretation, with the only
exception being the inclusion of the lesser
variables.

Finally, the percentage variance explained by


each of the dimension is labelled on the chart.
Rotation

In order to interpret the principal components more clearly, rotation of these can be performed.
There are a number of different rotations, but we will focus on varimax in this section.

Suppose we have a population measured on p random variables X1,…,Xp. Note that these random
variables represent the p-axes of the Cartesian coordinate system in which the population resides.

Our goal is to develop a new set of p axes (linear combinations of the original p axes) in the
directions of greatest variability. Simply rotate the axes to identify the principal component
X
2

X
1
Varimax Rotation

The varimax rotation allows the components to be rotated based on the maximum variance
between the components, thereby representing the data more clearly to each principal component.
Thus, the loadings of each variable become more clear with respect to each principal component.

In R, we will extract the loadings from the the result, and then apply the varimax rotation.

loadings.pcarot= varimax(pca_result$var$coord)$loadings
#this function gets the varimax rotation of the loading

# these new rotations can then be put back into the pca_result, and plot the new loadings.
pca_result$var$coord = loadings.pcarot
plot(pca_result, choix ="var")
PCA Interpretation - Rotation

The new rotation (right) shows clearly the maximized variance and orthogonality of the
variables on the two dimensions.

Rotation

Unrotated Dimensions Varimax Rotated Dimensions


PCA Interpretation - Rotation

The new rotated loadings clearly separate the variables across 5 dimensions
> display_pc(pca_result)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
longitude . 0.9186 . . .
latitude . -0.9578 . . .
housing_median_age . . . 0.8885 .
total_rooms 0.9582 . . . .
total_bedrooms 0.9660 . . . .
population 0.9298 . . . .
households 0.9706 . . . .
median_income . . 0.8746 . .
median_house_value . . 0.8778 . .

> display_pc(pca_result)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
longitude . 0.9802 . . .
latitude . -0.9794 . . .
housing_median_age . . . 0.9729 .
total_rooms 0.9438 . . . .
total_bedrooms 0.9718 . . . .
population 0.9398 . . . .
households 0.9806 . . . .
median_income . . 0.9207 . .
median_house_value . . 0.9143 . .
PCA - New values

The formulas of the principal components can then used to create new values. These
values can then be used in other analysis including a correlation with other values. From
the previous slides, PC1 (Y1) is formed by the following equation.

Therefore, taking each value of the given observation a new value (Y or PC1) can be
formed.

The first observation would yield a new value of 42307.166 for PC1, 119411.488 for
PC2, and 397388.40 for PC3
PCA - New values

By using matrix multiplication, the values for every observation can be easily computed.
The original data set is multiplied by the pca results, using only the dimensions desired.
Thus if only three dimensions are wanted the R code would be:
> as.matrix(cadata_fixed) %*% as.matrix(pca_result$var$coord[,1:3])
Dim.1 Dim.2 Dim.3
[1,] 42307.166 119411.488 397388.40
[2,] 43618.211 93676.707 315172.42 It should be noted that the data is a data frame and
[3,] 34040.219 92765.192 309213.68 the pca result is in a data frame as well. Therefore ,
both must be converted to a matrix to obtain the new
[4,] 33019.637 89912.699 299700.72
values.
[5,] 33527.932 90111.639 300523.49
[6,] 26018.521 71023.451 236828.53
[7,] 31447.907 78585.414 262804.58
[8,] 27141.364 63216.888 212109.56
[9,] 25263.030 59380.077 199142.19
[10,] 29801.267 68355.118 229407.46
PCA Object From FactoMiner

The PCA function returns a number of items from its analysis.


name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
PCA Object From FactoMiner

For variables and individual observations, PCA returns the following information

Coord - provides the coordinates of the variable or observation on the particular


dimension, if using the correlation matrix, the values will be in the unit circle (-1,1).

Cor - provides the correlation value of the variable or observation to the particular
dimension, if using the correlation matrix, the values will be identical to the Coord.

Cos2 - provides the "Quality of Representation". This provides an estimate of how much
contribution, of the variable or observation, there is to the dimension. It is the squared
cosine. The closer the value to 1, the better representation it has to the variable.

Contrib - provides the percent of contribution the variable or observation has on the
particular dimension.
Data Mining
Module 2.2: Exploratory Factor Analysis
Factor Analysis

Factor analysis, while sharing some similarities to PCA has a different goal. PCA works to
find orthogonal linear combination for descriptive purpose and the creation of
components. It presents a deeper analysis since the focus is on the latent factors
relationship to the individual variables. It is easier to see the difference using a graph:

Factor Analysis
Principal Component
Analysis

F
PC
Factor Analysis

In factor analysis we represent the variables y1, y2,... ,yp as linear combinations of a few random
variables f1, f2, . . . , fm (m < p) called factors.

The Common Factor Model was also very similar to a linear multiple regression model:

Xij is the response of person i (i = 1,...,n) on variable j (j = 1,...,p).


μj is the mean of variable j.
fik is the factor score for person i on factor k (k = 1,...,m)
λjk is the loading of variable j onto factor k.
εij is the random error term for person i and variable j
Factor Analysis

In factor analysis we wish to represent x1, x2,... ,xp as linear combinations of a few random
variables f1, f2, . . . , fm (m < p) called factors.

The Common Factor Model represents each indicator x , as a function of each factor. Therefore:

Xij is the response of person i (i = 1,...,n) on variable j (j = 1,...,p).


μj is the mean of variable j.
fik is the factor score for person i on factor k (k = 1,...,m)
λjk is the loading of variable j onto factor k.
εij is the random error term for person i and variable j
Factor Analysis

Thus the common factor model can be rewritten as :

Xi is the response vector of person i (i = 1,...,n) containing variables


j (j = 1,...,p).
μ is the mean vector containing all the means.
F is the factor score vector for person i containing factor scores (j = 1,...,m)
Λ is the factor loading matrix.
εiis the random error matrix
j (j=1,…,p)
Factor Analysis

Two types of factor analysis:

Exploratory Factor Analysis is performed when you are looking to determine the number
of factors that exist, and to understand the relationship to each variable. In other words, we
are letting the math, reveal insight to us.

Confirmatory Factor Analysis is conducted to validate a structure that has already been
presumed, either through theory or some other means, and we wish to measure the
relationship between each factor.

Structural Equation Modeling is a variant of CFA


EFA — Dimensions (Latent Variables)
Latent variables appear in ovals
Latent variables are not observed directly
Latent variables represent the shared variances of a set of indicators

Investment
Satisfaction
Attitude

48
EFA — One Dimension

y1 - y are called indicators of the latent variable


7
y1 - y7 could be 7 observed scores
Could be 7 individual items
Could be 4 items, 2 scales, & 1 observer rating Satisfaction

49
EFA — One Dimension
e1 - e7 are called errors or unique variances

e1 - e7 sometimes labeled as δ’s or ε’s

Arrow shows the errors explain part of the variances in the indicators

How is this error variance? How is this unique variance?

50
EFA — One Dimension

The satisfaction and the error term (ei) each explain the score on the observed variable
All arrows go to the observed indicators.
The score on y1 therefore depends on the true level of the variable and the error/unique
variance

Satisfaction

y1 y2 y3 y4 y5 y6 y7

e1 e2 e3 e4 e5 e6 e7

51
EFA — One Dimension (Satisfaction)
Errors/Unique variances may be correlated
◦ e1 and e6 might be measured the same method; hence a methods effect
◦ e4 and e5 might both deal with joint satisfaction

Variance of the variable


Is represented by Psi (Ψ)
Satisfaction
Loadings are
Represented by lambda (λ)

y1 y2 y3 y4 y5 y6 y7

e1 e2 e3 e4 e5 e6 e7

52
EFA — Two factors

ϕ
correlation
Satisfaction Ethical

y1 y2 y3 y4 y5 y6 y7 y1 y2 y3 y4 y5

e1 e2 e3 e4 e5 e6 e7 e1 e2 e3 e4 e5

53
Exploratory Factor Analysis
Due to the model parameterization and assumptions, the Common Factor Model
specifies the following covariance structure for the observable data:

Thus:

The model-specified covariance matrix, Σ, is something that illustrates the


background assumptions of the factor model: that variable intercorrelation is a
function of the factors in the model.

The Common Factor Model also specifies that the factor loadings give the
covariance between the observable variables and the unobserved factors:
Exploratory Factor Analysis - Communality

Communalities are a way to express the explained variance of an item by the factors, i.e.
how much of the variance for a single items is explained by the M factors.

Looking at the variance we have

Where the lambdas refer to the variance. Consider the following


Variable Communality
Assume our model had three factors, this would mean that, if we perform multiple
Climate .795 regression of climate against the three common factors, we obtain an R2 = 0.795,
indicating that about 79% of the variation in climate is explained by the factor
Housing .518 model.
Health .722

Transportation .512 If we added up all the communalities (3.057) and divide it by the number of items
we get .61, which means that 61% of the variance is explained by the three
Education .510 factors.
Exploratory Factor Analysis

We have thus far stated:

The proportion of variance of the ith variable contributed by

The proportion of variance of the ith variable due to the specific factor is often called
the uniqueness, or specific variance.
Exploratory Factor Analysis

Factor loadings are not unique, and thus factor loading matrices (Λ) can be rotated.

It can be shown that the same factor representation can be obtained by Λ or by Λ* where
Λ* = ΛT where T is an orthogonal matrix such that TT’ = I

We will show that rotation is done to make interpretation easier and more meaningful while
preserving the factor loading, since the covariance matrix is the same.
EFA-Principal Component Method

The principal component method for EFA takes a routine PCA and rescales the
eigenvalue weights to be factor loadings.

Recall that in PCA we created a set of new variables, Y1, . . . , Ym, that were called the
principal components.

These variables had variances equal to the eigenvalues of the covariance matrix, for
example Var(Y1) = λp1 (where λp1 represents the largest eigenvalue from a PCA).

Now, we must rescale the eigenvalue weights so that they are now factor loadings (which
correspond to factors that have unit variances).

Estimated factor loadings are computed by:

And unique variances are found by:


EFA-Principal Factor Method

The Principal Factor Method uses an iterative procedure to arrive at the final solution of
estimates.

To begin, the procedure picks a set of communality values (h2) and places these values along
the diagonal of the correlation matrix Rc.

The method then iterates between the following two steps until the change in communalities
becomes negligible:

1. Using Rc, find the Principal Components method estimates of communalities.

2. Replace the communalities in Rc with the current estimates.


EFA-Maximum Likelihood

Similar to the Principal Factor Method, ML proceeds iteratively.

The ML method uses the density function for the normal distribution as the function to
optimize (find the parameter estimates that lead to the maximum likelihood value).

Recall that this was a function of:


The data (X).
The mean vector (μ).
The covariance matrix (Σ).

Here, Σ is formed by the model predicted matrix equation: Σ = ΛΛ′ + Ψ (although some
uniqueness conditions are specified).
EFA-Issues and Comparison

With iterative algorithms sometimes solutions do not exist. When this happens, it is typically
caused by what is called a “Heywood” case - an instance where the unique variance becomes
less than or equal to zero (communalities are greater than one). There are some methods to deal
with this, but we aren’t concerned with that right now.

If you run a sample analysis using both methods you will notice some very small differences.

What you may discover when fitting the PCA method and the ML method is that the ML method
factors sometimes account for less variances than the factors extracted through PCA.

This is because of the optimality criterion used for PCA, which attempts to maximize the variance
accounted for by each factor.

The ML, however, has an optimality criterion that minimizes the differences between predicted and
observed covariance matrices, so the extraction will better resemble the observed data.
EFA-Choosing Factors

The number of factors to extract is somewhat arbitrary.


We sometimes use a screeplot to check for the number of factors. In which case, the
analyst looks for the point where the graph levels off, or when the change isn’t as
significant as the previous factors.

Levelling Off
Factor Rotations

We rotate factors in order to interpret the results better.

Orthogonal rotations are rotations which preserve orthogonality (perpendicular


components) and include:

Varimax
Equamax
Quartimax

Oblique rotations are rotations that will preserve the correlation between components and
include:
Promax
Procrustes
Harris-Kaiser
Factor Rotations – Orthogonal Rotation

An orthogonal rotation graphically allows us to see the factor loadings more clearly. We
can perform a graphical rotation when the number of factors (m=2). If m >2 then we must
use one of the mathematical techniques.

For the graphical representation let's look at a factor loading. Although we can make out
which variables belong to the factors, it's not crystal clear.

Rencher, A. C. (2003). Methods of multivariate analysis (Vol. 492). John Wiley & Sons.
Factor Rotations – Orthogonal Rotation

Graphical rotation means we rotate the axes in a way, that will best summarize the
differences between the two factors (along the axes).
Factor Rotations – Orthogonal Rotation (Varimax)

Varimax is the most widely used rotation, which seeks rotated loadings that maximize the
variance of the squared loadings in each column. The rotation is performed in almost every
statistical application. Below is the result of a varimax rotation. Notice the loadings.

Rencher, A. C. (2003). Methods of multivariate analysis (Vol. 492). John Wiley & Sons.
Factor Rotations – Oblique Rotation (Promax)

An oblique rotation, is not exactly a rotation, but is popular convention to say it that way.
Basically, the axes are not kept perpendicular (orthogonal). In this case it allows for more of a
correlation between factors.

Rencher, A. C. (2003). Methods of multivariate analysis (Vol. 492). John Wiley & Sons.
Factor Analysis – Example

Using the BFI dataset in the psych package, the following example demonstrates some of
the techniques discussed in factor analysis.

BFI dataset has 25 personality self report items taken from the International Personality Item
Pool (ipip.ori.org) were included as part of the Synthetic Aperture Personality Assessment
(SAPA) web based personality assessment project.

The dataset has 2800 observations, however, removing observations with NA results in just
over 2236 observations.

The 25 variables are categorized already as A (Agreeableness), C (Conscientiousness), E


(Extraversion), N (Neuroticism) and O (Openness).

68
Factor Analysis – Example

Develop a correlation matrix and test to ensure that #Remove rows with missing values and keep only
the correlations are significant , i.e. not zero. #complete cases

This is accomplished using the Bartlett test. bfi_data=bfi[complete.cases(bfi),]


cormat = cor(bfi_data[,c(-26,-27,-28)])
cortest.bartlett(cormat, n=nrow(bfi_data))
> cortest.bartlett(cormat, n=nrow(bfi_data))
Null Hypothesis is that the correlations are zero. $chisq
[1] 16484.78

$p.value
[1] 0

$df
[1] 300

69
Factor Analysis – Example

Once certain that the correlations are significant, the # run the factanal on the bfi-data without
analyst can conduct the factor analysis. In R there # the last three columns
# these are excluded since the aim
are a number of different ways to run factor analysis # is to identify the factors of the core
fa ( ) - psych package # atributes

factanal() – stats package f =factanal(bfi_data[,c(-26,-27,-28)],factors=5)

cfa() – lavaan package


This example uses the factanal, since it standardizes
the variables and provides an excellent
summarization

70
Factor Analysis – Example

The output of factanal function first provides the uniqueness values. Uniqueness is the
unexplained variance of the variable not shared by the other variables. Recall that communality is
the variance shared with other variables and thus uniqueness is equal to 1 – communality. The
greater the uniqueness, the lower the relevance of the variable in the model.
Thus below A1 has a uniqueness of .84, meaning the variables relevance is low in the factor model
since the shared variance (communality is low).

Call:
factanal(x = bfi_data[, c(-26, -27, -28)], factors = 5)

Uniquenesses:
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4
0.843 0.602 0.485 0.694 0.525 0.669 0.579 0.675 0.516 0.561 0.640 0.454 0.543 0.461
E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5
0.585 0.277 0.341 0.474 0.502 0.657 0.676 0.725 0.516 0.758 0.714

71
Factor Analysis – Example

Next, the function returns the loadings. Loadings:


Factor1 Factor2 Factor3 Factor4 Factor5
The loadings allow the analyst to determine A1 -0.375
A2 0.195 0.143 0.579
how the individual variables load (correlated) A3 0.280 0.113 0.649
A4 0.172 0.226 0.453 -0.132
on a factor. A5 -0.118 0.337 0.581
C1 0.528 0.215
C2 0.617 0.137 0.125
Generally, loadings above .3 are considered C3 0.556 0.120
C4 0.222 -0.647
good. Sometimes it may be lower. However, C5 0.266 -0.193 -0.572
E1 -0.578 -0.139
this assessment should be done after rotation. E2 0.227 -0.675 -0.100 -0.157
E3 0.498 0.326 0.311
E4 -0.123 0.602 0.390
E5 0.498 0.314 0.128 0.224
N1 0.814 -0.208
N2 0.783 -0.203
N3 0.717
N4 0.563 -0.374 -0.191
N5 0.521 -0.183 0.109 -0.150
O1 0.176 0.112 0.523
O2 0.173 -0.115 0.119 -0.467
O3 0.273 0.149 0.619
O4 0.211 -0.221 0.130 0.360
O5 -0.524

72
Factor Analysis – About Rotation

Much of the literature on factor analysis provides varying assessments of the rotation. However, a few
notes:
1) Generally, orthogonal rotation (like varimax) may be easier to interpret, but not always and there is no
consensus on this.
2) Oblique rotations maintain the correlations between factors which may be useful in analysis, since
orthogonality is not forced , i.e. forcing covariance = 0.
3) Orthogonal rotations will lose information (if factors have correlation) since the method forces
orthogonality, i.e. it loses the information related to the correlation between factors
4) If factors are not believed to be correlated, the orthogonal rotations will produce the same general
results.

Costello, Anna B. & Jason Osborne (2005). Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis.
Practical Assessment Research & Evaluation, 10(7). Available online: http://pareonline.net/getvn.asp?v=10&n=7

73
Factor Analysis – Example

f = factanal(bfi_data[,c(-26,-27,-28)],factors=5,
The first set of loadings in the previous slide rotation="varimax")
print(f,cutoff=.4)
uses the varimax rotation as the default. The
Loadings:
numbers to the right, are also using the varimax Factor1 Factor2 Factor3 Factor4 Factor5
A1
rotation. A2 0.579
A3 0.649
A4 0.453
A5 0.581
Using the print command with an associated C1
C2
0.528
0.617
C3 0.556
cutoff will yield a much more readable result. C4 -0.647
C5 -0.572
E1 -0.578
E2 -0.675
To the right, it can easily be seen that using the E3 0.498
E4 0.602
cutoff to not print values below .3, the factors E5 0.498
N1 0.814
become very clear. N2 0.783
N3 0.717
N4 0.563
N5 0.521
O1 0.523
O2 -0.467
O3 0.619
O4
O5 -0.524

74
Factor Analysis – Example (No Rotation)

> f = factanal(bfi_data[,c(-26,-27,-28)],factors=6, rotation="none")


If the rotation removed, the results of the > print(f,cutoff=.4)

factors are clearly not as interpretable. Loadings:


Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
A1
A2
However, the orthogonal rotation from the A3 -0.456
A4
previous slide creates the separation in the A5 -0.534
C1 -0.458
loadings to make the assessment clearer. C2 -0.504
C3
C4 0.451 0.532
C5 0.481
E1
E2 0.587
E3 -0.465 0.439
E4 -0.566
E5 -0.421 0.427
N1 0.589 0.577
N2 0.573 0.566
N3 0.519 0.492
N4 0.590
N5 0.420
O1 -0.405
O2 0.407
O3 -0.503
O4
O5 0.457

75
Factor Analysis – Example (Oblique Rotation)

> f = factanal(bfi_data[,c(-26,-27,-28)],factors=5,
If the rotation is changed to an oblique rotation, rotation="promax")
> print(f,cutoff=.4)
where the correlations between the factors are
Loadings:
preserved, the general factors haven’t changed. Factor1 Factor2 Factor3 Factor4 Factor5
A1
A2 0.582
A3 0.646
The oblique and orthogonal rotations should A4 0.453
A5 0.558
generally preserve the factor structure in most C1
C2
0.549
0.658
cases and when there is no correlation between C3
C4
0.593
-0.675
C5 -0.581
the factors, the results should be very similar. E1 -0.632
E2 -0.715
E3 0.468
E4 0.605
E5 0.473
N1 0.909
N2 0.860
N3 0.682
N4
N5 0.433
O1 0.525
O2 -0.473
O3 0.629
O4
O5 -0.533

76
Factor Analysis – Example

Next, the analyst must determine the number of > f = factanal(bfi_data[,c(-26,-27,-28)],factors=4,


rotation="promax")
factors. Using the methods described in the > print(f,cutoff=.4)
previous slides. Analyzing the difference between Factor1 Factor2 Factor3 Factor4
4, 5 and 6 factors. SS loadings 3.218 2.579 2.051 1.563
Proportion Var 0.129 0.103 0.082 0.063
Cumulative Var 0.129 0.232 0.314 0.376

In this example: > f = factanal(bfi_data[,c(-26,-27,-28)],factors=5,


rotation="promax")
1) All eigenvalues are greater than 1 (SS Loading), > print(f,cutoff=.4)

at 7 factors, eigenvalue drops below 1 Factor1 Factor2 Factor3 Factor4 Factor5


SS loadings 2.617 2.293 2.038 1.807 1.576
2) Cumulative variances are below .5 , but factors Proportion Var 0.105 0.092 0.082 0.072 0.063
Cumulative Var 0.105 0.196 0.278 0.350 0.413
of 5 & 6 are above .4
> f = factanal(bfi_data[,c(-26,-27,-28)],factors=6,
3) Analysis of screeplot confirms 5 & 6 are rotation="promax")
> print(f,cutoff=.4)
necessary
Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
4) Only 5 factor solution however, has clear SS loadings 2.788 2.611 2.255 1.516 1.396 1.186
Proportion Var 0.112 0.104 0.090 0.061 0.056 0.047
factors, with no overlapping of variables and no Cumulative Var 0.112 0.216 0.306 0.367 0.423 0.470
variables that are isolated in their own factor

77
Factor Analysis – Example

Factor Correlations:
Since, the previous example used oblique Factor1 Factor2 Factor3 Factor4 Factor5
Factor1 1.000 0.3698 0.376 0.1253 0.234
rotation, the correlations between the factors Factor2 0.370 1.0000 0.247 -0.0245 -0.088
Factor3 0.376 0.2468 1.000 0.2205 0.198
can be analyzed. Factor4 0.125 -0.0245 0.221 1.0000 0.183
Factor5 0.234 -0.0880 0.198 0.1826 1.000

In this example, Factor 2 (E) and 1 (N) are


slightly correlated, and Factor 3 (C) and 1 (N) are
slightly correlated as well.

Factor one (Neuroticism) therefore is slightly correlated with Factor 2 (Extraversion)

Factor one (Neuroticism) therefore is slightly correlated with Factor 3 (Conscientiousness)

78
Factor Analysis – Example

The final piece to the output includes a chi-square test statistic. The chi-square test statistic
determines whether or not the model has a good fit.

The Null Hypothesis is that the model has perfect fit, or that the number of factors is sufficient. Thus if
the p-value > .05 the null hypothesis cannot be rejected.

In the previous example, the chi-square indicates that the model does not fit well, and indicates more
factors are needed. However, there are other ways to analyze the model.

Test of the hypothesis that 5 factors are sufficient.


The chi square statistic is 1357.5 on 185 degrees of freedom.
The p-value is 1.88e-177

The actual interpretation of this is that the covariance of factor matrix is different that the covariance
matrix of the dataset.

79
Factor Analysis – Approximate Fit

There are some additional measures that are output from the other method for factor analysis fa and
also from cfa in the lavaan package. These provide some additional measures for assessing fit,
especially when the Chi-Square test fails.

The mean item complexity is also known as the average of the Hoffman Index of complexity for each
item, which measures the average number of latent variables needed to account for the manifest
variables.
Mean item complexity = 1.4
Test of the hypothesis that 5 factors are sufficient.

The degrees of freedom for the null model are 300 and the objective function was 7.41 0.3
The degrees of freedom for the model are 185 and the objective function was 0.63 0.3
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.04 0.3
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
PA2 PA1 PA3 PA5 PA4
Correlation of (regression) scores with factors 0.93 0.91 0.88 0.86 0.84
Multiple R square of scores with factors 0.86 0.83 0.78 0.74 0.70
Minimum correlation of possible factor scores 0.72 0.67 0.57 0.48 0.40

80
Factor Analysis – Approximate Fit

RMSR – Root Mean Sqaure of the Residuals. The smaller the better , .01 is generally good.

Fit based upon the off diagonal values - This is just (1 – Rmdiag), one minus the relative magnitude of the
squared off diagonal residuals to the squared off diagonal original values. Closer to 1 is better.

Correlation with factor scores provide an assessment in a similar manner to R 2

Mean item complexity = 1.4


Test of the hypothesis that 5 factors are sufficient.

The degrees of freedom for the null model are 300 and the objective function was 7.41 0.3
The degrees of freedom for the model are 185 and the objective function was 0.63 0.3
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.04 0.3
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
PA2 PA1 PA3 PA5 PA4
Correlation of (regression) scores with factors 0.93 0.91 0.88 0.86 0.84
Multiple R square of scores with factors 0.86 0.83 0.78 0.74 0.70
Minimum correlation of possible factor scores 0.72 0.67 0.57 0.48 0.40

81
Factor Analysis – Approximate Fit

Other measures of Goodness of Fit (GOF) include:

RSMEA

Tucker Lewis Index

Comparative Fit Index

82
Factor Analysis – Final Diagram

-.088 .221

-.025
.376 .125 .234

F1 F2 F3 F4 F5
.370 .247 .221 .183

.909 -.632 .549 NL


.473 .525
.433 -.715 .468 .605 -.581 .558 -.533
.860 .682 NL .658 .593 -.675 .582 .646 .453 -.473 .629 NL
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

.277 .341 .474 .502 .657 .640 .454 .543 .461 .585 .669 .579 .675 .516 .561 .843 .602 .485 .694 .525 .676 .725 .516 .758 .714

Neuroticism Extraversion Conscientiousness Agreeableness Openness

83
Data Mining
Module 2.3: Confirmatory Factor Analysis
Confirmatory Factor Analysis

Rather than trying to determine the number of factors, and subsequently, what the factors
mean (as in EFA), if you already know the structure of your data, you can use a confirmatory
approach.

Confirmatory factor analysis (CFA) is a way to specify which variables load onto which factors.

The loadings of all variables not related to a given factor are then set to zero.

For a reasonable number of parameters, the factor correlation can be estimated directly from
the analysis (rotations are not needed).
CFA — Two factors

ϕ
correlation
Satisfaction Ethical

y1 y2 y3 y4 y5 y6 y7 y1 y2 y3 y4 y5

e1 e2 e3 e4 e5 e6 e7 e1 e2 e3 e4 e5

86
Confirmatory Factor Analysis

Using an optimization routine (and some type of criterion function, such as ML), the
parameter estimates that minimize the function are found.

To assess the fit of the model, the predicted covariance matrix is subtracted from the
observed covariance matrix, and the residuals are summarized into fit statistics.

Based on the goodness-of-fit of the model, the result is taken as-is, or modifications are
made to the structure.

CFA is a measurement model - the factors are measured by the data.

SEM is a model for the covariance between the factors.


Visit, Follow, Share

Please Subscribe to our Youtube Channel

Visit our blog site for news on analytics and code samples
http://blogs.5eanalytics.com

You might also like