Discriminant analysis
Discriminant analysis
1
• Discriminant analysis is an appropriate statistical technique when the
dependent variable is a categorical (nominal or nonmetric) variable
and the independent variables are metric variables.
• In many cases, the dependent variable consists of two groups or
classifications, for example, good versus bad or high versus low. In
other circumstances, more than two groups are involved, such as low,
medium, and high classifications.
• Discriminant analysis is capable of handling either two groups or
multiple groups.
2
• When two classifications are involved, the technique is referred to as
two-group discriminant analysis. When three or more classifications
are identified, the technique is referred to as Multiple Discriminant
Analysis (MDA).
• Discriminant analysis involves deriving a variate. The discriminant
variate is the linear combination of the two (or more) independent
variables that will discriminate best between the objects in the
groups defined a priori. Discrimination is achieved by calculating the
variate’s weight for each independent variable to maximize the
differences between the groups (i.e., the between-group variance
relative to the within-group variance).
3
Objectives of Discriminant Analysis
1. Development of discriminant functions, or linear combinations of the
predictor or independent variables, that best discriminate between the
categories of the criterion or dependent variable (groups).
2. Examination of whether significant differences exist among the groups,
in terms of the predictor variables.
3. Determination of which predictor variables contribute to most of the
inter-group differences.
4. Classification of cases to one of the groups based on the values of the
predictor variables.
5. Evaluation of the accuracy of classification
4
Assumptions
1. Cases or the individuals should be independent
2. Predictor variables should have a multivariate normal distribution
3. Within-group variance-covariance matrices should be equal across
the groups
4. Group membership is assumed to be mutually exclusive, i.e., no
case belongs to more than one group
5. Group membership should be collectively exhaustive, i.e., all cases
are members of a group
5
Discriminant analysis model
• The variate for a discriminant analysis, also known as the discriminant
function, is derived from an equation much like that seen in multiple
regression. It takes the following form:
Zjk= a+W1X1k+W2X2k+…..+WnXnk
6
Discriminant analysis model
• The coefficients or weights (Wi) are estimated so that the groups
differ as much as possible on the values of the discriminant function.
This occurs when the ratio of between-group sum of squares to
within-group sum of squares for the discriminant scores is at a
maximum. Any other linear combination of the predictors will result in
a smaller ratio.
• If the dependent variable consists of more than two groups,
discriminant analysis will calculate more than one discriminant
function. In fact, it will calculate NG-1 functions, where NG is the
number of groups. Each discriminant function will calculate a separate
discriminant Z score.
7
Discriminant analysis is the appropriate statistical technique for testing
the hypothesis that the group means of a set of independent variables
for two or more groups are equal.
By averaging the discriminant scores for all the individuals within a
particular group, we arrive at the group mean. This group mean is
referred to as a centroid.
When the analysis involves two groups, there are two centroids; with
three groups , there are three centroids; and so forth. The centroids
indicate the most typical location of any member from a particular
group, and a comparison of the group centroids shows how far apart
the groups are in terms of that discriminant function.
8
Assessing overall model fit
This assessment involve three tasks:
1. Calculating discriminant Z scores for each observation
2. Evaluating group differences on the discriminant z scores
3. Assessing group membership prediction accuracy
9
Assessing overall model fit
1. Calculating discriminant Z scores for each observation
10
Assessing overall model fit
2. Evaluating group differences on the discriminant z scores
• Once the discriminant Z scores are calculated, the first assessment of overall
model fit is to determine the magnitude of differences between the members
of each group in terms of the discriminant Z scores.
• A summary measure of the group differences is a comparison of the group
centroids. Each group will have a normal distribution of discriminant Z scores.
The degree of overlap between the discriminant score distributions can then
be used as a measure of the success of the technique
11
Assessing overall model fit
2. Evaluating group differences on the discriminant z scores
• The difference between centroids are measured in terms
of Mahalanobis D2 measure. It measures how much a
case's values on the independent variables differ from the
average of all cases. A large Mahalanobis distance
identifies a case as having extreme values on one or more
of the independent variables.
• Another test is based on likelihood ratio test, known as
Wilk’s Lamda test. It is the ratio of the determinant of the
within-group covariance matrix to the determinant of the
total covariance matrix.
12
Assessing overall model fit
3. Assessing group membership prediction accuracy
• To determine the predictive ability of a discriminant function, the researcher
must construct classification matrices.
• The classification matrix procedures provides a perspective on practical
significance. With multiple discriminant analysis, the percentage correctly
classified, also termed the hit ratio, reveals how well the discriminant
function classified the objects.
13
Assessing overall model fit
3. Assessing group membership prediction accuracy
• Classifying Individual observations
• The basic formula for computing the optimal cutting score
between any two groups is:
𝑁𝐴𝑍𝐵+𝑁𝐵𝑍𝐴
ZCS =
𝑁𝐴+𝑁𝐵
Where
ZCS = cutting score between groups A and B
𝑁𝐴 = number of observations in group A
𝑁𝐵 = number of observations in group B
𝑍𝐴 = centroid for group A
𝑍𝐵 = centroid for group B
14
Assessing overall model fit
3. Assessing group membership prediction accuracy
• Classifying Individual observations
• If the groups are specified to be of equal size, then optimum
cutting score will be halfway between the two group centroids
and becomes simply of the two centroids:
𝑍𝐵 +𝑍𝐴
ZCS =
2
15
The output of discriminant analysis consists of the following statistics:
17
The output of discriminant analysis consists of the following statistics:
7. Pooled within-group correlation matrix. The pooled within-group correlation
matrix is computed by averaging the separate covariance matrices for all the
groups.
8. Box’s M test: It check the assumption of homogeneity of covariance matrices.
The null hypothesis for this test is that the observed covariance matrices for
the dependent variables are equal across groups
9. Standardised discriminant function coefficients. These are standardised
discriminant function coefficients or discriminant weights
10. Structure Matrix: this gives the correlation of predictor variables with the
discriminant function.
11. Wilk’s Lambda. Wilk’s λ for each predictor is the ratio of the within-group sum
of squares to the total sum of squares. It is defined as the proportion of the
total variance in the discriminant score not explained by difference among the
groups. Its value varies between 0 and 1. Large values of λ (near 1) indicate that
group means do not seem to be different. Small values of λ (near 0) indicate
that the group means seem to be different.
18
Validation of Results
Predicted Group
Membership Total
No Yes
Original Count No 384 133 517
Yes 48 135 183