program-3
program-3
PROGRAM - 3
Develop a program to implement Principal Component Analysis (PCA) for
reducing the dimensionality of the Iris dataset from 4 features to 2.
Objective
To implement Principal Component Analysis (PCA) to reduce the dataset's
dimensionality from large features to small principal components, enabling
visualization of the data in a lower-dimensional space.
--------------------------------------------------------------------------------------------------- Program 3 2
3. Introduction
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-
dimensional data into a lower-dimensional space while preserving as much variance as possible.
In this implementation, we apply PCA to the classic Iris dataset, reducing its 4 features
(sepal length, sepal width, petal length, and petal width) to just 2 principal components. This
transformation allows us to visualize the natural structure of the data and observe the separation
between the three Iris species (setosa, versicolor, and virginica), demonstrating how effective
dimensionality reduction can simplify data analysis while maintaining the most important patterns
and relationships in the dataset.
• Principal Component Analysis (PCA) is a technique that reduces the dimensionality of large
datasets by transforming many variables into a smaller set while preserving most of the
original information.
• While reducing variables inevitably sacrifices some accuracy, PCA strategically trades
minor precision for significant simplification. This creates datasets that are more
manageable to explore and visualize, enabling machine learning algorithms to process data
more efficiently by eliminating unnecessary variables.
• In essence, PCA aims to minimize the number of variables in a dataset while maximizing
the retention of important information.
Principal Components
Principal components are newly constructed variables formed as linear combinations of the
original variables. These combinations are designed with two key properties:
1. The new variables (principal components) are uncorrelated with each other
2. Information from the original dataset is distributed optimally, with the first component
capturing the maximum possible variance, the second component capturing the maximum
remaining variance, and so on
In practice, this means that when analyzing 10-dimensional data, PCA will generate 10 principal
components, but the information is redistributed so that earlier components contain more
information than later ones. This approach allows analysts to focus on the first few components
that contain most of the dataset's information, effectively achieving dimensionality reduction while
minimizing information loss.
3 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
The PCA computation process follows these steps, showing how principal components are
calculated and relate to the original data:
where:
X is the original value
𝜇 is the mean of the variable
𝜎 is the standard deviation
2. Compute the Covariance Matrix: The covariance matrix measures how variables are
correlated with each other. If two variables have a high covariance, it means they are highly
correlated.
The eigenvectors form the principal components, and the corresponding eigenvalues show
the importance of each component.
where,
Vk is the matrix of the top k eigenvectors.
We often use a scree plot (plot of eigenvalues) or keep components that capture a certain
percentage (e.g., 95%) of the variance.
3.4 Program
--------------------------------------------------------------------------------------------------- Program 3 6
7 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
Viva Questions