0% found this document useful (0 votes)

161 views31 pages

Chapter 3

This document provides an introduction to principal component analysis (PCA), an unsupervised machine learning technique for dimensionality reduction. PCA aims to find a lower dimensional representation of data while retaining as much information as possible. It works by identifying patterns in high-dimensional data and expressing the data in such a way that highlights their similarities and differences. The document discusses PCA intuition, visualization, implementation in R, and practical issues like scaling data.

Uploaded by

Anto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

161 views31 pages

Chapter 3

Uploaded by

Anto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Introduction to PCA

UN SUP ERVISED L EARN IN G IN R

Hank Roark
Senior Data Scientist at Boeing
Two methods of clustering
Two methods of clustering - nding groups of homogeneous items

Next up, dimensionality reduction

Find structure in features

Aid in visualization

UNSUPERVISED LEARNING IN R
Dimensionality reduction
A popular method is principal component analysis (PCA)

Three goals when nding lower dimensional representation of features:

Find linear combination of variables to create principal components

Maintain most variance in the data

Principal components are uncorrelated (i.e. orthogonal to each other)

UNSUPERVISED LEARNING IN R
PCA intuition

UNSUPERVISED LEARNING IN R
Visualization of high dimensional data

UNSUPERVISED LEARNING IN R
Visualization

UNSUPERVISED LEARNING IN R
PCA in R
pr.iris <- prcomp(x = iris[-5],
scale = FALSE,
center = TRUE)
summary(pr.iris)

Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion 0.9246 0.97769 0.9948 1.00000

UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Visualizing and
interpreting PCA
results
UN SUP ERVISED L EARN IN G IN R

Hank Roark
Senior Data Scientist at Boeing
Biplot

UNSUPERVISED LEARNING IN R
Scree plot

UNSUPERVISED LEARNING IN R
Biplots in R
# Creating a biplot
pr.iris <- prcomp(x = iris[-5],
scale = FALSE,
center = TRUE)
biplot(pr.iris)

UNSUPERVISED LEARNING IN R
Scree plots in R
# Getting proportion of variance for a scree plot
pr.var <- pr.iris$sdev^2
pve <- pr.var / sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")

UNSUPERVISED LEARNING IN R
UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Practical issues with
PCA
UN SUP ERVISED L EARN IN G IN R

Hank Roark
Senior Data Scientist at Boeing
Practical issues with PCA
Scaling the data

Missing values:
Drop observations with missing values

Impute / estimate missing values

Categorical data:
Do not use categorical data features

Encode categorical features as numbers

UNSUPERVISED LEARNING IN R
mtcars dataset
data(mtcars)
head(mtcars)

mpg cyl disp hp drat wt qsec vs

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1

UNSUPERVISED LEARNING IN R
Scaling
# Means and standard deviations vary a lot
round(colMeans(mtcars), 2)

mpg cyl disp hp drat wt qsec vs

20.09 6.19 230.72 146.69 3.60 3.22 17.85 0.44

round(apply(mtcars, 2, sd), 2)

mpg cyl disp hp drat wt qsec vs

6.03 1.79 123.94 68.56 0.53 0.98 1.79 0.50

UNSUPERVISED LEARNING IN R
Importance of scaling data

UNSUPERVISED LEARNING IN R
Scaling and PCA in R
prcomp(x, center = TRUE, scale = FALSE)

UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R
Additional uses of
PCA and wrap-up
UN SUP ERVISED L EARN IN G IN R

Hank Roark
Senior Data Scientist at Boeing
Dimensionality reduction

UNSUPERVISED LEARNING IN R
Data visualization

UNSUPERVISED LEARNING IN R
Interpreting PCA results

UNSUPERVISED LEARNING IN R
Importance of data scaling

UNSUPERVISED LEARNING IN R
Up next
# URL to cancer dataset hosted on DataCamp servers
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df

wisc.df <- read.csv(url)
wisc.data[1:6, 1:5]

radius_mean texture_mean perimeter_mean area_mean smoothness_mean

842302 17.99 10.38 122.80 1001.0 0.11840
842517 20.57 17.77 132.90 1326.0 0.08474
84300903 19.69 21.25 130.00 1203.0 0.10960
84348301 11.42 20.38 77.58 386.1 0.14250
84358402 20.29 14.34 135.10 1297.0 0.10030
843786 12.45 15.70 82.57 477.1 0.12780

UNSUPERVISED LEARNING IN R
Let's practice!
UN SUP ERVISED L EARN IN G IN R

MG 602 Probability Theories Exercise
No ratings yet
MG 602 Probability Theories Exercise
5 pages
Sample Size Calculations Thabane
No ratings yet
Sample Size Calculations Thabane
42 pages
2003 Makipaa 1
No ratings yet
2003 Makipaa 1
15 pages
Assignment IV OC Download Shakher S070
No ratings yet
Assignment IV OC Download Shakher S070
2 pages
Stat Course Outline Unity University
No ratings yet
Stat Course Outline Unity University
3 pages
722.6 Exploded Parts View PDF
No ratings yet
722.6 Exploded Parts View PDF
0 pages
Ignou PGDAST Assignment Booklet Jan-Dec 2020
No ratings yet
Ignou PGDAST Assignment Booklet Jan-Dec 2020
30 pages
Fundamentals of Applied Statistics
No ratings yet
Fundamentals of Applied Statistics
8 pages
PSCV Unit-Iii Digital Notes
No ratings yet
PSCV Unit-Iii Digital Notes
46 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Principal Components Analysis
No ratings yet
Principal Components Analysis
59 pages
Data Validation & Research
No ratings yet
Data Validation & Research
41 pages
Statistical Computing I
No ratings yet
Statistical Computing I
187 pages
Sbe10 10 Simple Regression
No ratings yet
Sbe10 10 Simple Regression
100 pages
Univariate, Bivariate and Multivariate Methods in Corpus-Based Lexicography - A Study of Synonymy
100% (1)
Univariate, Bivariate and Multivariate Methods in Corpus-Based Lexicography - A Study of Synonymy
614 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Sampling Distribution and Simulation in R
No ratings yet
Sampling Distribution and Simulation in R
10 pages
17ME-ENV-48 SPSS Practical
No ratings yet
17ME-ENV-48 SPSS Practical
41 pages
Design of Experiments
No ratings yet
Design of Experiments
65 pages
Chapter 2756
No ratings yet
Chapter 2756
30 pages
Multivariate Analysis IBS
No ratings yet
Multivariate Analysis IBS
20 pages
Network Analysis: Programme Evaluation and Review Techniques (PERT)
No ratings yet
Network Analysis: Programme Evaluation and Review Techniques (PERT)
3 pages
Desug12 Pisati
No ratings yet
Desug12 Pisati
91 pages
Unit 2 Classification of Data
100% (1)
Unit 2 Classification of Data
6 pages
Chapter 1 Data Analysis
No ratings yet
Chapter 1 Data Analysis
18 pages
Spss Project (Prashant Rajput)
No ratings yet
Spss Project (Prashant Rajput)
23 pages
Statistical Packages - SPSS - ABH
No ratings yet
Statistical Packages - SPSS - ABH
68 pages
Groebner ch05
No ratings yet
Groebner ch05
69 pages
Final Exam in Statistics 1ST Sem
100% (1)
Final Exam in Statistics 1ST Sem
7 pages
MODULE 12 Populations and Samples
No ratings yet
MODULE 12 Populations and Samples
21 pages
SPSS Lecture Note 2022
No ratings yet
SPSS Lecture Note 2022
226 pages
Sas Semma
100% (1)
Sas Semma
39 pages
Statistical Inference For Decision Making
No ratings yet
Statistical Inference For Decision Making
9 pages
Principles of Data Science
No ratings yet
Principles of Data Science
3 pages
SAS Part001
No ratings yet
SAS Part001
15 pages
Class 7
No ratings yet
Class 7
42 pages
PG Teacher Approval
No ratings yet
PG Teacher Approval
4 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
EC2303 Final Formula Sheet PDF
No ratings yet
EC2303 Final Formula Sheet PDF
8 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
21 pages
Applications of Statistical Software For Data Analysis
No ratings yet
Applications of Statistical Software For Data Analysis
5 pages
Ch7. Hypothesis Testing
100% (1)
Ch7. Hypothesis Testing
86 pages
Statistical Computing Using Statistical Computing Using
No ratings yet
Statistical Computing Using Statistical Computing Using
128 pages
Groebner Business Statistics 7 Ch07
No ratings yet
Groebner Business Statistics 7 Ch07
34 pages
STA4C04 - Statistical Inference and Quality Control
No ratings yet
STA4C04 - Statistical Inference and Quality Control
170 pages
Modified Correlation
No ratings yet
Modified Correlation
53 pages
Statistics
No ratings yet
Statistics
41 pages
Chapter 9: Correlation and Regression: Solutions
No ratings yet
Chapter 9: Correlation and Regression: Solutions
8 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
Multiple Regression
No ratings yet
Multiple Regression
12 pages
Assignment of Econometrics
No ratings yet
Assignment of Econometrics
12 pages
Normality, T-Test, ANOVA, Chi Square, Correlation
No ratings yet
Normality, T-Test, ANOVA, Chi Square, Correlation
31 pages
Introduction To The Case Study: Hank Roark
No ratings yet
Introduction To The Case Study: Hank Roark
25 pages
chapter3 (2)
No ratings yet
chapter3 (2)
36 pages
W9a Autoencoders Pca
No ratings yet
W9a Autoencoders Pca
7 pages
DSA5105 Lecture8
No ratings yet
DSA5105 Lecture8
35 pages
Linear Regression: Dimensionality Reduction
No ratings yet
Linear Regression: Dimensionality Reduction
7 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
DSA5102_lecture9
100% (1)
DSA5102_lecture9
35 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Chapter 2
No ratings yet
Chapter 2
39 pages
Tabel Swasta Apr19 Value
No ratings yet
Tabel Swasta Apr19 Value
215 pages
Cavalry Brochure v1.01
No ratings yet
Cavalry Brochure v1.01
2 pages
Abp
No ratings yet
Abp
1,029 pages
Serial Number SMADAV 9.8.1
100% (1)
Serial Number SMADAV 9.8.1
1 page
semi-m119-lesson-1
No ratings yet
semi-m119-lesson-1
25 pages
Ray Optics NCERT
No ratings yet
Ray Optics NCERT
42 pages
Ansi Isa 12.13.01 2003
No ratings yet
Ansi Isa 12.13.01 2003
108 pages
Cross Yagi Polarization
100% (1)
Cross Yagi Polarization
9 pages
Amazon Case Study
50% (6)
Amazon Case Study
22 pages
SEMCO
No ratings yet
SEMCO
17 pages
Course Title: Principles of Management Course Code: MGT 512: Assignment: Decision Making Process and Solving A Problem
No ratings yet
Course Title: Principles of Management Course Code: MGT 512: Assignment: Decision Making Process and Solving A Problem
8 pages
Determinants of Customer Satisfaction in Telecom Industry
No ratings yet
Determinants of Customer Satisfaction in Telecom Industry
8 pages
26 Site Installation Plan
No ratings yet
26 Site Installation Plan
20 pages
Top 20 Most Often Asked Interview Questions and Answers
No ratings yet
Top 20 Most Often Asked Interview Questions and Answers
6 pages
Chapter 7
No ratings yet
Chapter 7
17 pages
Principles of Management
No ratings yet
Principles of Management
17 pages
5e A Better Character Sheet (A4)
No ratings yet
5e A Better Character Sheet (A4)
12 pages
Essay Length: 1900 Words (Excluding Title Page, References and Appendices If Used) Instructions
No ratings yet
Essay Length: 1900 Words (Excluding Title Page, References and Appendices If Used) Instructions
2 pages
Civil Engineers 11-2022
100% (2)
Civil Engineers 11-2022
173 pages
VW 50125 en
No ratings yet
VW 50125 en
12 pages
Using Artificial Intelligence For Concurrent Design in The Steel Building Industry
No ratings yet
Using Artificial Intelligence For Concurrent Design in The Steel Building Industry
8 pages
12 01 0192 PDF
No ratings yet
12 01 0192 PDF
9 pages
Morita Philosophy PDF
No ratings yet
Morita Philosophy PDF
16 pages
(Draft) MYNI 2019 of The RSPO Principles and Criteria 2018-English
No ratings yet
(Draft) MYNI 2019 of The RSPO Principles and Criteria 2018-English
34 pages
Surf PAC
No ratings yet
Surf PAC
4 pages
Digital Speedometer Using FPGA
No ratings yet
Digital Speedometer Using FPGA
4 pages
BUSINESS ANALYS-WPS Office
No ratings yet
BUSINESS ANALYS-WPS Office
15 pages
Project Name: Design of Fpga Based Single Phase Inverter Project Guide Project Members
No ratings yet
Project Name: Design of Fpga Based Single Phase Inverter Project Guide Project Members
2 pages
Static Failure Theories PDF
No ratings yet
Static Failure Theories PDF
28 pages
All About Concrete Pavement Joint Design PDF
100% (1)
All About Concrete Pavement Joint Design PDF
141 pages
ARM-JTAG Wiggler Compatible Dongle For Programming and Debugging
No ratings yet
ARM-JTAG Wiggler Compatible Dongle For Programming and Debugging
1 page
Armine Babayan EDUC 5810 Unit 2 Written Assignment
No ratings yet
Armine Babayan EDUC 5810 Unit 2 Written Assignment
3 pages
FEA Summary
No ratings yet
FEA Summary
4 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Introduction to PCA

UN SUP ERVISED L EARN IN G IN R

Next up, dimensionality reduction

Three goals when nding lower dimensional representation of features:

Maintain most variance in the data

Principal components are uncorrelated (i.e. orthogonal to each other)

Impute / estimate missing values

Encode categorical features as numbers

mpg cyl disp hp drat wt qsec vs

mpg cyl disp hp drat wt qsec vs

mpg cyl disp hp drat wt qsec vs

# Download the data: wisc.df

radius_mean texture_mean perimeter_mean area_mean smoothness_mean

You might also like