0% found this document useful (0 votes)
1 views

Module 2 Lab 2

This document provides a comprehensive guide to Principal Component Analysis (PCA), explaining its purpose, process, and applications in a beginner-friendly manner. It details the steps involved in PCA, including standardization, covariance matrix calculation, eigenvectors/eigenvalues, and data transformation, while emphasizing the importance of explained variance and visualization. PCA is highlighted as a valuable technique for simplifying complex datasets, enhancing data analysis, and improving machine learning models.

Uploaded by

katrao39798
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Module 2 Lab 2

This document provides a comprehensive guide to Principal Component Analysis (PCA), explaining its purpose, process, and applications in a beginner-friendly manner. It details the steps involved in PCA, including standardization, covariance matrix calculation, eigenvectors/eigenvalues, and data transformation, while emphasizing the importance of explained variance and visualization. PCA is highlighted as a valuable technique for simplifying complex datasets, enhancing data analysis, and improving machine learning models.

Uploaded by

katrao39798
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Detailed Explanation of Module 2 Lab 2: Principal Components Analysis (PCA) –

Updated with All Your Questions


This guide explains every concept and step in the PCA lab, integrating your follow-up questions
and using beginner-friendly language and examples. Each section is structured for clarity and
depth, with practical context and simple analogies.

Section 1: What is Principal Component Analysis (PCA) and Why Use It?
PCA is a technique for simplifying complex datasets by reducing the number of features
(dimensions) while preserving as much important information (variance) as possible [1] [2] .
Why use PCA?
To visualize high-dimensional data in 2D or 3D.
To speed up machine learning and reduce overfitting.
To remove noise and redundancy from data.
To find patterns or groupings that are hard to see in the original data.

Section 2: Step-by-Step PCA Process

Step 1: Standardization (Normalization)


Why?
Features may have different units or scales (e.g., height in cm, weight in kg).
Standardization rescales all features to have mean = 0 and standard deviation = 1, so
each feature contributes equally [2] [3] .
How?
For each value, subtract the mean of that feature and divide by its standard deviation:

Example:
If one feature ranges from 1–1000 and another from 0–1, the first would dominate the
analysis unless standardized.
Step 2: Covariance Matrix Calculation
What is Covariance?
It measures how two features vary together.
Positive covariance: features increase together; negative: one increases as the other
decreases.
Covariance Matrix:
A table showing the covariance between every pair of features.
For 3 features, it’s a 3x3 matrix.
Why?
It helps find relationships between features and is the foundation for finding principal
components [2] [3] .

Step 3: Eigenvectors and Eigenvalues


Eigenvectors:
Directions (axes) in the data space along which variance is maximized.
Eigenvalues:
Tell how much variance is along each eigenvector.
Why?
The eigenvector with the largest eigenvalue points in the direction of the greatest
variance in the data [2] [3] .
How?
Solve the equation , where is the covariance matrix, is the eigenvector,
[2]
and is the eigenvalue .

Step 4: Computing the Principal Components (Selection and Construction)


What are Principal Components?
New axes (directions) created from the original features, capturing the most variance.
The first principal component (PC1) captures the most variance; the second (PC2)
captures the next most, and so on [2] [3] .
How are they computed?
1. Pair eigenvectors and eigenvalues.
2. Sort eigenvectors by their eigenvalues in descending order.
3. Select the top k eigenvectors (those with the highest eigenvalues); these are your
principal components [2] [3] .
4. Project the standardized data onto these new axes (multiply the data by the selected
eigenvectors), creating new features (PC1, PC2, ...) [2] [3] .
Simple Example:
Imagine you have two features (height and weight).
After standardization and covariance calculation, you find two eigenvectors.
The first points in the direction where the data is most spread out (maybe a diagonal
through the data cloud).
The second is perpendicular to the first.
You keep the first one (or both) as your principal components and project your data
onto these axes for easier analysis [2] [3] .

Step 5: Transforming Data to the New Subspace


How?
Multiply the original standardized data by the matrix of selected eigenvectors.
Each data point now has coordinates in terms of the principal components, not the
original features [2] [3] .
Why?
This transformation gives you a new dataset with fewer features but most of the
important information preserved.

Section 3: Interpreting and Using Principal Components

Explained Variance and Choosing Number of Components


Explained Variance:
The proportion of the dataset’s total variance captured by each principal component.
The first few PCs often capture most of the variance.
Cumulative Explained Variance:
Add up the explained variance for the top PCs to see how much total information you
keep.
Rule of Thumb: Keep enough PCs to explain 90% of the variance [3] .
Example:
If PC1 explains 70% and PC2 explains 20%, the first two together explain 90% of the
data’s variance.
Visualization
2D or 3D Scatter Plots:
Plot data using the first two or three PCs as axes.
Color points by class (e.g., benign/malignant).
If classes separate well, PCA has revealed useful structure [3] .
Loadings:
Show how much each original feature contributes to each principal component.
High positive or negative values mean strong influence.

Section 4: Practical Applications of PCA


Data Visualization:
Make complex, high-dimensional data visible and understandable [3] .
Noise Reduction:
Remove less important components (those with low variance) to reduce noise.
Feature Engineering:
Use PCs as new features for machine learning models.
Real-World Examples:
Image compression, facial recognition, anomaly detection, recommendation systems,
and healthcare data analysis [1] [3] .

Section 5: PCA in Action – Simple Example


Suppose you have three features (A, B, C) for each sample.

Sample A B C

1 2 3 4

2 3 4 5

3 4 5 6

Step-by-step:
1. Standardize A, B, C.
2. Compute covariance matrix (3x3).
3. Find eigenvectors/eigenvalues.
4. Sort and select top 2 eigenvectors (PC1, PC2).
5. Project data onto PC1 and PC2 to get new values for each sample.
6. Plot samples on a 2D graph using PC1 and PC2.
Section 6: Summary Table
Step What Happens? Why It Matters

Prevents features with large values


Standardization Rescale features to mean 0, std 1
dominating

Covariance Matrix Measures how features vary together Finds relationships between features

Find new axes (directions) of max


Eigenvectors/Eigenvalues Basis for principal components
variance

Principal Components New axes capturing most information Reduce data size, keep important info

Shows how much info each Helps choose how many components
Explained Variance
component keeps to keep

Enables visualization and better


Data Projection Transform data onto new axes
modeling

Section 7: Key Takeaways


PCA is a powerful tool for simplifying complex data and improving analysis.
It works by finding new axes (principal components) that capture the most variance.
The process involves standardization, covariance calculation, finding
eigenvectors/eigenvalues, selecting top components, and projecting data.
PCA is widely used for visualization, noise reduction, and as a preprocessing step for
machine learning.

If you want a deeper explanation of any step, or a code example for a specific part, just ask!

1. https://www.pickl.ai/blog/a-step-by-step-complete-guide-to-principal-component-analysis-pca-for-b
eginners/
2. https://www.turing.com/kb/guide-to-principal-component-analysis
3. https://www.datacamp.com/tutorial/pca-analysis-r

You might also like