SlideShare a Scribd company logo
Principal Component Analysis
and Clustering
Professor Daymond
27-Nov-2016
UNDERSTANDING BORROWER SEGMENTS
Majority of the accounts are of credit based borrowers whose revolving utilization with the most
revolving accounts and bankcards
Credit based
accounts
The accounts are mostly with fixed instalments like car loans, student loans etc.,
Most instalment accounts and instalment utilization are the major factors of this segment
Fixed
Instalment
accounts
These are borrowers with past due records and most of the late fees of credit and loan amount. Also
with the recent history of delinquency this segment is medium risk
Past due
accounts
These are borrowers who are highly inquired for loans which exhibits the most credit card purchase
behaviour and attempt to try all possible loans for one
Highly
Inquired
accounts
Debt to collection accounts holds the most number of public records like tax liens etc.,
Collections money owed and tax liens are the major factors of this segment
With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times
makes this segment as high risk
Debt
Collections
accounts
High risk
delinquent
accounts
IDENTIFYINGTHEPRINCIPALCOMPONENTS
With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be
envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data
based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with
all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components
The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is
explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the
cumulative variance should be at least 75%.
From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately
76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of
the inputs which implies the axis length and the direction of each principal components
.From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small
to the variance. Hence there are total of 18 principal components that provides a significant variance of data
Figure 1 Figure 2
INTERPRETINGTHEPRINCIPALCOMPONENTS
In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original
variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e.
either positive or negative are said to be highly correlated with the original variables.
PRINCIPAL COMPONENT 1
From Figure 4, it is observed that the highest coefficients
are correlated with the various number of accounts i.e. how
valuable are the customers in terms of usage and the least
correlated with the duration since the recent account i.e.
how credible the customers are?
Similarly, each of the principal component is analysed for
the highest and the lowest coefficients and tabulated for
reference.
Figure 3
Figure 4
IDENTIFYINGTHECLUSTERS
Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with
various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the
approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest
values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in
clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the
cluster mean.
Summary
The summary of statistics of clusters displays the frequency of observations in each
cluster and the root mean square deviation. The next column displays the largest
distance from the seed to the observation i.e. the total spread of the cluster
approximately. The last column displays the distance from the centre of the cluster
to the centre of the nearest cluster.
Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration.
Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be
the nearest cluster for all the clusters
Goodness-of-fit metrics
The higher values of Pseudo F Statistic are preferred to attain good number of
clusters
R-square accounts for the variance accounted by the clusters
The higher CCC values are indicate good clustering generally expected to be more
than 2 or 3.
Higher F Statistic and CCC implies that the clustering solution is good
IDENTIFYINGTHECLUSTERS
Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the
cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments.
Figure 4
The clusters are analysed and derived with respect to the loan data
variables. Figure 4, displays the customer segment identified after
the analysis of the coefficient matrix. These are the major segments
of the loan data
• Credit based – revolving accounts
• Fixed instalment based loan accounts
• accounts who are mostly past due of credit and late fees
• accounts who are highly inquired
• accounts who more than 75% and creates many new accounts
Further PROC UNIVARIATE is executed with the new cluster dataset
and the output are approximately same with respect to the box plot.
Hence it is ensured that the segments are almost correct
Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
SCORINGTHENEWDATA
The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps:
• The outputs stats from the PRINCOMP is used to score the new dataset
• The output from STDIZE is used as input to standardize the new scored dataset
• The output stat from the FASTCLUS is used as input stat for the new dataset
Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are
approximately the same and the segments have been identified correctly.
OLD DATA
NEW DATA
LEARNINGS
Identifying the principal components is complex and after clustering the same
gives a much more clear picture
With very less business knowledge, identifying the clusters and the segment
verification was difficult
Learnt how to write a macro to run the clusters from 3 to 20 and then identify the
best one from the batch
Use of UNIVARIATE was a revelation when my segments matched with the box
plot even though I am not sure if the segments are correct as such.
APPENDIX1–EIGENVALUESWHENCURVECHANGES
APPENDIX2–EIGENVECTORS OFFIRST10PRINCIPALCOMPONENTS

More Related Content

What's hot (20)

PPTX
Random forest
Ujjawal
 
PPTX
Principal component analysis
Partha Sarathi Kar
 
PPTX
Machine Learning with R
Barbara Fusinska
 
PPTX
Statistics for data science
zekeLabs Technologies
 
PPTX
ML - Multiple Linear Regression
Andrew Ferlitsch
 
PPT
Decision tree and random forest
Lippo Group Digital
 
PPTX
Logistic regression
YashwantGahlot1
 
PDF
Machine Learning in R
Alexandros Karatzoglou
 
PDF
Outlier detection method introduction
DaeJin Kim
 
PPTX
Introduction to principal component analysis (pca)
Mohammed Musah
 
PDF
Logistic regression in Machine Learning
Kuppusamy P
 
PPT
Support Vector Machines
nextlib
 
PDF
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
PPTX
Feature scaling
Gautam Kumar
 
PDF
Model selection
Animesh Kumar
 
PDF
Logistic regression
MartinHogg9
 
PPTX
support vector regression
Akhilesh Joshi
 
PDF
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
PPTX
Principal component analysis
Farah M. Altufaili
 
PPTX
Lect5 principal component analysis
hktripathy
 
Random forest
Ujjawal
 
Principal component analysis
Partha Sarathi Kar
 
Machine Learning with R
Barbara Fusinska
 
Statistics for data science
zekeLabs Technologies
 
ML - Multiple Linear Regression
Andrew Ferlitsch
 
Decision tree and random forest
Lippo Group Digital
 
Logistic regression
YashwantGahlot1
 
Machine Learning in R
Alexandros Karatzoglou
 
Outlier detection method introduction
DaeJin Kim
 
Introduction to principal component analysis (pca)
Mohammed Musah
 
Logistic regression in Machine Learning
Kuppusamy P
 
Support Vector Machines
nextlib
 
Machine Learning Algorithm - Decision Trees
Kush Kulshrestha
 
Feature scaling
Gautam Kumar
 
Model selection
Animesh Kumar
 
Logistic regression
MartinHogg9
 
support vector regression
Akhilesh Joshi
 
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
Principal component analysis
Farah M. Altufaili
 
Lect5 principal component analysis
hktripathy
 

Viewers also liked (20)

PPTX
Steps for Principal Component Analysis (pca) using ERDAS software
Swetha A
 
PDF
Colgate Precision - Harvard Business Case Analysis
Usha Vijay
 
PDF
Visual Merchandising - Marketing Research
Usha Vijay
 
PDF
Principal component analysis and matrix factorizations for learning (part 1) ...
zukun
 
PDF
Regularized Principal Component Analysis for Spatial Data
Wen-Ting Wang
 
PPSX
PCA
mathurnidhi
 
PDF
Hosting Dergi - 9.SAYI
Hosting Dergi
 
PPSX
Olena teliga pr.-konf.
TOBM Ternopil
 
PPTX
Mi auto biografía
dayanna2016ramirez
 
PDF
ting-cert-BI
Zhi Ning Ting
 
PPTX
Colgate-Palmolive Company: The Precision Toothbrush
Priyadarsini Somasundaram
 
DOCX
Ejercicio 2 programación algoritmos Valentino Spina.
Valentino Spina
 
PDF
Reglamento interno itei 2014
Cultura San Gabriel
 
PPTX
Panorama sobre Teste de Software
Patrícia Araújo Gonçalves
 
PDF
2° informe s. gabriel 2014
Cultura San Gabriel
 
PDF
Principal component analysis and matrix factorizations for learning (part 2) ...
zukun
 
PDF
fauvel_igarss.pdf
grssieee
 
PDF
Nonlinear component analysis as a kernel eigenvalue problem
Michele Filannino
 
PDF
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
grssieee
 
PPTX
Different kind of distance and Statistical Distance
Khulna University
 
Steps for Principal Component Analysis (pca) using ERDAS software
Swetha A
 
Colgate Precision - Harvard Business Case Analysis
Usha Vijay
 
Visual Merchandising - Marketing Research
Usha Vijay
 
Principal component analysis and matrix factorizations for learning (part 1) ...
zukun
 
Regularized Principal Component Analysis for Spatial Data
Wen-Ting Wang
 
Hosting Dergi - 9.SAYI
Hosting Dergi
 
Olena teliga pr.-konf.
TOBM Ternopil
 
Mi auto biografía
dayanna2016ramirez
 
ting-cert-BI
Zhi Ning Ting
 
Colgate-Palmolive Company: The Precision Toothbrush
Priyadarsini Somasundaram
 
Ejercicio 2 programación algoritmos Valentino Spina.
Valentino Spina
 
Reglamento interno itei 2014
Cultura San Gabriel
 
Panorama sobre Teste de Software
Patrícia Araújo Gonçalves
 
2° informe s. gabriel 2014
Cultura San Gabriel
 
Principal component analysis and matrix factorizations for learning (part 2) ...
zukun
 
fauvel_igarss.pdf
grssieee
 
Nonlinear component analysis as a kernel eigenvalue problem
Michele Filannino
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
grssieee
 
Different kind of distance and Statistical Distance
Khulna University
 
Ad

Similar to Principal Component Analysis and Clustering (20)

PDF
Telecom customer churn prediction
Saleesh Satheeshchandran
 
PPTX
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
PPT
Statistics final seminar
Tejas Jagtap
 
PPTX
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
Smarten Augmented Analytics
 
PDF
Building the Professional of 2020: An Approach to Business Change Process Int...
Dr Harris Apostolopoulos EMBA, PfMP, PgMP, PMP, IPMO-E
 
PPTX
522323444-Presentation-HousePricePredictionSystem.pptx
aasthamahajan2003
 
PDF
Eviews forecasting
Rafael Bustamante Romaní
 
PDF
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
inventionjournals
 
PDF
Guide for building GLMS
Ali T. Lotia
 
PDF
A Comparative Study for Anomaly Detection in Data Mining
IRJET Journal
 
PDF
Final SAS Day 2015 Poster
Reuben Hilliard
 
PDF
Bank loan purchase modeling
Saleesh Satheeshchandran
 
DOCX
Data Science Using Python
Lakshmi Sarvani Videla
 
PDF
JEDM_RR_JF_Final
Jonathan Fivelsdal
 
PPTX
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PDF
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
ijmvsc
 
PDF
ai-for-finance-and-banking-application-study-material.pdf
NehaKaleK
 
PDF
(Gaurav sawant & dhaval sawlani)bia 678 final project report
Gaurav Sawant
 
Telecom customer churn prediction
Saleesh Satheeshchandran
 
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
Statistics final seminar
Tejas Jagtap
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
Smarten Augmented Analytics
 
Building the Professional of 2020: An Approach to Business Change Process Int...
Dr Harris Apostolopoulos EMBA, PfMP, PgMP, PMP, IPMO-E
 
522323444-Presentation-HousePricePredictionSystem.pptx
aasthamahajan2003
 
Eviews forecasting
Rafael Bustamante Romaní
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
inventionjournals
 
Guide for building GLMS
Ali T. Lotia
 
A Comparative Study for Anomaly Detection in Data Mining
IRJET Journal
 
Final SAS Day 2015 Poster
Reuben Hilliard
 
Bank loan purchase modeling
Saleesh Satheeshchandran
 
Data Science Using Python
Lakshmi Sarvani Videla
 
JEDM_RR_JF_Final
Jonathan Fivelsdal
 
Predict Backorder on a supply chain data for an Organization
Piyush Srivastava
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
ijmvsc
 
ai-for-finance-and-banking-application-study-material.pdf
NehaKaleK
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
Gaurav Sawant
 
Ad

Recently uploaded (20)

PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 

Principal Component Analysis and Clustering

  • 1. Principal Component Analysis and Clustering Professor Daymond 27-Nov-2016
  • 2. UNDERSTANDING BORROWER SEGMENTS Majority of the accounts are of credit based borrowers whose revolving utilization with the most revolving accounts and bankcards Credit based accounts The accounts are mostly with fixed instalments like car loans, student loans etc., Most instalment accounts and instalment utilization are the major factors of this segment Fixed Instalment accounts These are borrowers with past due records and most of the late fees of credit and loan amount. Also with the recent history of delinquency this segment is medium risk Past due accounts These are borrowers who are highly inquired for loans which exhibits the most credit card purchase behaviour and attempt to try all possible loans for one Highly Inquired accounts Debt to collection accounts holds the most number of public records like tax liens etc., Collections money owed and tax liens are the major factors of this segment With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times makes this segment as high risk Debt Collections accounts High risk delinquent accounts
  • 3. IDENTIFYINGTHEPRINCIPALCOMPONENTS With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the cumulative variance should be at least 75%. From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately 76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of the inputs which implies the axis length and the direction of each principal components .From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small to the variance. Hence there are total of 18 principal components that provides a significant variance of data Figure 1 Figure 2
  • 4. INTERPRETINGTHEPRINCIPALCOMPONENTS In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e. either positive or negative are said to be highly correlated with the original variables. PRINCIPAL COMPONENT 1 From Figure 4, it is observed that the highest coefficients are correlated with the various number of accounts i.e. how valuable are the customers in terms of usage and the least correlated with the duration since the recent account i.e. how credible the customers are? Similarly, each of the principal component is analysed for the highest and the lowest coefficients and tabulated for reference. Figure 3 Figure 4
  • 5. IDENTIFYINGTHECLUSTERS Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the cluster mean. Summary The summary of statistics of clusters displays the frequency of observations in each cluster and the root mean square deviation. The next column displays the largest distance from the seed to the observation i.e. the total spread of the cluster approximately. The last column displays the distance from the centre of the cluster to the centre of the nearest cluster. Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration. Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be the nearest cluster for all the clusters Goodness-of-fit metrics The higher values of Pseudo F Statistic are preferred to attain good number of clusters R-square accounts for the variance accounted by the clusters The higher CCC values are indicate good clustering generally expected to be more than 2 or 3. Higher F Statistic and CCC implies that the clustering solution is good
  • 6. IDENTIFYINGTHECLUSTERS Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments. Figure 4 The clusters are analysed and derived with respect to the loan data variables. Figure 4, displays the customer segment identified after the analysis of the coefficient matrix. These are the major segments of the loan data • Credit based – revolving accounts • Fixed instalment based loan accounts • accounts who are mostly past due of credit and late fees • accounts who are highly inquired • accounts who more than 75% and creates many new accounts Further PROC UNIVARIATE is executed with the new cluster dataset and the output are approximately same with respect to the box plot. Hence it is ensured that the segments are almost correct Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
  • 7. SCORINGTHENEWDATA The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps: • The outputs stats from the PRINCOMP is used to score the new dataset • The output from STDIZE is used as input to standardize the new scored dataset • The output stat from the FASTCLUS is used as input stat for the new dataset Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are approximately the same and the segments have been identified correctly. OLD DATA NEW DATA
  • 8. LEARNINGS Identifying the principal components is complex and after clustering the same gives a much more clear picture With very less business knowledge, identifying the clusters and the segment verification was difficult Learnt how to write a macro to run the clusters from 3 to 20 and then identify the best one from the batch Use of UNIVARIATE was a revelation when my segments matched with the box plot even though I am not sure if the segments are correct as such.