0% found this document useful (0 votes)
6 views

Apple Data

Uploaded by

manohargade19
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Apple Data

Uploaded by

manohargade19
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Name:Arpit Chauhan

Roll No:40
Practical :-Implementation of Principal Component Analysis (PCA)

Setting up the environment :

corrr package in R

This is an R package for correlation analysis. It mainly focuses on creating and handling R data frames.
Below are the steps to install and load the library.

install.packages("corrr")
library('corrr’)

ggcorrplot package in R

The ggcorrplot package provides multiple functions but is not limited to the ggplot2 function that makes it
easy to visualize correlation matrix. Similarly to the above instruction, the installation is straightforward.

install.packages("ggcorrplot")
library(ggcorrplot)

FactoMineR package in R

Mainly used for multivariate exploratory data analysis; the factoMineR package gives access to the PCA
module to perform principal component analysis.

install.packages("FactoMineR")
library("FactoMineR")

factoextra package in R

This last package provides all the relevant functions to visualize the outputs of the principal component
analysis. These functions include but are not limited to scree plot, biplot, only to mention two of the
visualization techniques covered later in the article.

install.packages("factoextra")
library(factoextra)
1) Exploring the data :

df=read.csv("C:/ProgramData/Microsoft/Windows/Start Menu/Programs/RStudio/apple_quality.csv")
Str(df))
Output

Interpretation:
We can see that the data set has [50 x 12] observations, and each variable is numerical Except A_id,Acidity & Quality.
2) Check for null values
colSums(is.na(df))
Output

Interpretation:
As we can see Above , There are missing Values in all the columns Except Quality
lets Remove
The missing values
df=na.omit(df)
colSums(is.na(df))
Output

Interpretation:
Now As we can see Above , none of the columns have missing values.

3) Normalizing the data

As stated early in the article, PCA only works with numerical values. So, we need to get rid of the Quality column. Also
The data type of Acidity is chr.. So we need convert it From Chr to numerical and also there is no need for A_id.

df$Acidity=as.numeric(df$Acidity)
df=df[,2:8]
data_normalized =scale(df)
head(data_normalized)
Output

Interpretation:
The Data has Been Normalized
4) Compute the Correlation matrix :

corr_matrix <- cor(data_normalized)


Corr_matrix
Output

ggcorrplot(corr_matrix)
Output

Interpretation:
The result of the correlation matrix can be interpreted as follow:
• The higher the value, the most positively correlated the two variables are.
like Acidity&size,Acidity&Juiciness,Crunchiness&Size etc
• The closer the value to -1, the most negatively correlated they
are. like Juiciness &Size, Crunchiness &Sweetness etc

5) Applying PCA
data.pca <- princomp(corr_matrix)
summary(data.pca)
Output

Interpretation:
• Each component explains a percentage of the total variance in the data set.
• In the Cumulative Proportion section, the first principal component explains almost 27.86 % of the total variance.
• The second one explains 53.13% of the total variance.
• The Third one explains 72.67% of the total variance.
• The Fourth One Explains 89.58% of the total variance
• The cumulative proportion of Comp.1 ,Comp.2,Comp.3 , Comp.4 explains nearly 89% of the total variance.
This means that the first Four principal components can accurately represent the data.
6) Loading matrix :

data.pca$loadings[, 1:4]

Output

Interpretation:

• The Loading Matrix Shows that the first principal component has high positive values for
Size & Crunchiness .However, the values for Weight, Sweetness, Juiciness and
Acidity,are relatively negative

• When it comes to the second principal component, it has high negative values for Ripeness ,
And has high positive values for Weight,Acidity,Sweetness,Juiciness

• For the Third principal component has high positive values for Size, Juiciness , Ripeness
However the values for Sweetness, Crunchiness, Weight are relatively negative

• For The Fourth Component has high positive values for Size, Juiciness , Weight and
the values for Sweetness, Crunchiness,Ripeness are relatively negative
7) Visualization of the principal component :

fviz_eig(data.pca, addlabels = TRUE)

Output

Interpretation:

This plot shows the eigenvalues in a downward curve, from highest to lowest.
The first Four components can be considered to be the most significant since they contain almost 89%
of the total information of the data.
8) Biplot of the attributes :

With the biplot, it is possible to visualize the similarities and dissimilarities between the samples, and
further shows the impact of each attribute on each of the principal components.

fviz_pca_var(data.pca, col.var = "black")

Output

Interpretation:

Three main pieces of information can be observed from the previous plot.

• First, all the variables that are grouped together are positively correlated to each other, and that is
the case for instance Crunchiness,Size have a positive correlation to each. This result is surprising
because they have the highest values in the loading matrix with respect to the first principal
component.

• Then, the higher the distance between the variable and the origin, the better represented that variable
is. From the biplot Acidity, wetness,juiciness have higher magnitude compared to Weight, and hence
are well represented compared to Weight.

• Variables that are negatively correlated are displayed to the opposite sides of the biplot’s origin like
Ripeness.
9) Contribution of each variable :

The Goal of the third visualization is to determine how much each variable is represented in a given
component.Such a quality off representation is call the Cos2 and Corresponds to the square cosine,and it is
Computed using the Fviz cos2 function
fviz_cos2(data.pca, choice = "var", axes = 1:3)

Output

Interpretation:

• A low value means that the variable is not perfectly represented by that component.
like Acidity,Juiciness etc
• A high value, on the other hand, means a good representation of the variable on that component.
Ripeness,Sweetness
10) Biplot Combined With cos2 :
The last two visualization approaches: biplot and attributes importance can be combined to create a single
biplot, where attributes with similar cos2 scores will have similar colors.
This is achieved by fine-tuning the fviz_pca_var function as follows:

fviz_pca_var(data.pca, col.var = "cos2",


gradient.cols = c("black", "orange", "green"),
repel = TRUE)

Output

Interpretation:
From the biplot Above:

• High cos2 attributes are colored in Blue: Ripeness


• Mid cos2 attributes have an orange color:Acidity,Crunchiness,Juiciness
• Finally, low cos2 attributes have a black color: Weight

Conclusion
This article has covered what principal component analysis is and its importance in data analytics using the
correlation matrix from the corrr package. In addition to covering some real world applications, it has also
walked you through a PCA example with different visualization strategies from using the existing function
to fine-tuning them using the combination of biplot and cos2 for better understanding and visualization of
the relationship between pca analysis in r and the attributes.
We hope it provides you with the relevant skills to efficiently visualize and understand the hidden insights
from your data.

You might also like