Chapter 3
Chapter 3
Hank Roark
Senior Data Scientist at Boeing
Two methods of clustering
Two methods of clustering - nding groups of homogeneous items
Aid in visualization
UNSUPERVISED LEARNING IN R
Dimensionality reduction
A popular method is principal component analysis (PCA)
UNSUPERVISED LEARNING IN R
PCA intuition
UNSUPERVISED LEARNING IN R
PCA intuition
UNSUPERVISED LEARNING IN R
PCA intuition
UNSUPERVISED LEARNING IN R
Visualization of high dimensional data
UNSUPERVISED LEARNING IN R
Visualization
UNSUPERVISED LEARNING IN R
PCA in R
pr.iris <- prcomp(x = iris[-5],
scale = FALSE,
center = TRUE)
summary(pr.iris)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Visualizing and
interpreting PCA
results
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Biplot
UNSUPERVISED LEARNING IN R
Scree plot
UNSUPERVISED LEARNING IN R
Biplots in R
# Creating a biplot
pr.iris <- prcomp(x = iris[-5],
scale = FALSE,
center = TRUE)
biplot(pr.iris)
UNSUPERVISED LEARNING IN R
Scree plots in R
# Getting proportion of variance for a scree plot
pr.var <- pr.iris$sdev^2
pve <- pr.var / sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
UNSUPERVISED LEARNING IN R
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Practical issues with
PCA
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Practical issues with PCA
Scaling the data
Missing values:
Drop observations with missing values
Categorical data:
Do not use categorical data features
UNSUPERVISED LEARNING IN R
mtcars dataset
data(mtcars)
head(mtcars)
UNSUPERVISED LEARNING IN R
Scaling
# Means and standard deviations vary a lot
round(colMeans(mtcars), 2)
round(apply(mtcars, 2, sd), 2)
UNSUPERVISED LEARNING IN R
Importance of scaling data
UNSUPERVISED LEARNING IN R
Scaling and PCA in R
prcomp(x, center = TRUE, scale = FALSE)
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Additional uses of
PCA and wrap-up
UN SUP ERVISED L EARN IN G IN R
Hank Roark
Senior Data Scientist at Boeing
Dimensionality reduction
UNSUPERVISED LEARNING IN R
Data visualization
UNSUPERVISED LEARNING IN R
Interpreting PCA results
UNSUPERVISED LEARNING IN R
Importance of data scaling
UNSUPERVISED LEARNING IN R
Up next
# URL to cancer dataset hosted on DataCamp servers
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R