Principal Component Analysis and Clustering

Principal Component Analysis
and Clustering
Professor Daymond
27-Nov-2016

UNDERSTANDING BORROWER SEGMENTS
Majority of the accounts are of credit based borrowers whose revolving utilization with the most
revolving accounts and bankcards
Credit based
accounts
The accounts are mostly with fixed instalments like car loans, student loans etc.,
Most instalment accounts and instalment utilization are the major factors of this segment
Fixed
Instalment
accounts
These are borrowers with past due records and most of the late fees of credit and loan amount. Also
with the recent history of delinquency this segment is medium risk
Past due
accounts
These are borrowers who are highly inquired for loans which exhibits the most credit card purchase
behaviour and attempt to try all possible loans for one
Highly
Inquired
accounts
Debt to collection accounts holds the most number of public records like tax liens etc.,
Collections money owed and tax liens are the major factors of this segment
With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times
makes this segment as high risk
Debt
Collections
accounts
High risk
delinquent
accounts

IDENTIFYINGTHEPRINCIPALCOMPONENTS
With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be
envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data
based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with
all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components
The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is
explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the
cumulative variance should be at least 75%.
From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately
76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of
the inputs which implies the axis length and the direction of each principal components
.From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small
to the variance. Hence there are total of 18 principal components that provides a significant variance of data
Figure 1 Figure 2

INTERPRETINGTHEPRINCIPALCOMPONENTS
In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original
variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e.
either positive or negative are said to be highly correlated with the original variables.
PRINCIPAL COMPONENT 1
From Figure 4, it is observed that the highest coefficients
are correlated with the various number of accounts i.e. how
valuable are the customers in terms of usage and the least
correlated with the duration since the recent account i.e.
how credible the customers are?
Similarly, each of the principal component is analysed for
the highest and the lowest coefficients and tabulated for
reference.
Figure 3
Figure 4

IDENTIFYINGTHECLUSTERS
Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with
various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the
approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest
values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in
clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the
cluster mean.
Summary
The summary of statistics of clusters displays the frequency of observations in each
cluster and the root mean square deviation. The next column displays the largest
distance from the seed to the observation i.e. the total spread of the cluster
approximately. The last column displays the distance from the centre of the cluster
to the centre of the nearest cluster.
Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration.
Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be
the nearest cluster for all the clusters
Goodness-of-fit metrics
The higher values of Pseudo F Statistic are preferred to attain good number of
clusters
R-square accounts for the variance accounted by the clusters
The higher CCC values are indicate good clustering generally expected to be more
than 2 or 3.
Higher F Statistic and CCC implies that the clustering solution is good

IDENTIFYINGTHECLUSTERS
Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the
cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments.
Figure 4
The clusters are analysed and derived with respect to the loan data
variables. Figure 4, displays the customer segment identified after
the analysis of the coefficient matrix. These are the major segments
of the loan data
• Credit based – revolving accounts
• Fixed instalment based loan accounts
• accounts who are mostly past due of credit and late fees
• accounts who are highly inquired
• accounts who more than 75% and creates many new accounts
Further PROC UNIVARIATE is executed with the new cluster dataset
and the output are approximately same with respect to the box plot.
Hence it is ensured that the segments are almost correct
Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters

SCORINGTHENEWDATA
The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps:
• The outputs stats from the PRINCOMP is used to score the new dataset
• The output from STDIZE is used as input to standardize the new scored dataset
• The output stat from the FASTCLUS is used as input stat for the new dataset
Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are
approximately the same and the segments have been identified correctly.
OLD DATA
NEW DATA

LEARNINGS
Identifying the principal components is complex and after clustering the same
gives a much more clear picture
With very less business knowledge, identifying the clusters and the segment
verification was difficult
Learnt how to write a macro to run the clusters from 3 to 20 and then identify the
best one from the batch
Use of UNIVARIATE was a revelation when my segments matched with the box
plot even though I am not sure if the segments are correct as such.

APPENDIX1–EIGENVALUESWHENCURVECHANGES

APPENDIX2–EIGENVECTORS OFFIRST10PRINCIPALCOMPONENTS

Principal Component Analysis and Clustering

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Principal Component Analysis and Clustering (20)

Recently uploaded (20)

Principal Component Analysis and Clustering