Dimensionality Reduction: Linear Discriminant Analysis (LDA)
Dimensionality Reduction: Linear Discriminant Analysis (LDA)
We use LDA when we want to have better separability for classification problems. It is also a
dimensionality reduction technique.
Even with 3D representation, it’s hard for us to tell if the data is separated properly because
the depth is not perceived on a 2D graph.
Linear Discriminant Analysis (LDA) is like PCA, but it focuses on maximizing the separability
among known categories.
LDA 1
Using a simple example, we will try to reduce a 2D graph into a 1D graph.
2nd Criteria:
Is that we want to minimize the variation within
each category (which LDA calls ‘Scatter’, denoted by S2).
- The S2 is the scatter for green category and S2 is the scatter
for red category.
And both the criteria are considered simultaneously using the formula:
( μ−μ )2 Ideally large
2 2
⟶
S +S Ideally small
Here, in the above formula, the numerator is squared because the mean of green category
may be greater than mean of red category and the difference may be negative. Therefore,
we square the numerator.
Also, in the above formula, we can call: ( μ−μ ) as distance, ‘d’. Hence the formula becomes:
2
d
2 2
S +S
We shall now see why the distance b/w the two means and the scatters are important.
LDA 2
Consider this dataset:
Here, the data is spread pretty good on the x-axis but there is an overlap along
the y-axis. In this case, if we only maximize the distance b/w the means, then
we will get something like this:
And this separation isn’t great as we’ll have a lot of overlap in the middle.
However, if we optimize the distance b/w the means and the scatter, we’d get:
In this case, we’d get a nice separation b/w the categories.
Although the means in this graph are a little closer than they were in the graph in
the middle, but the scatter is much less and therefore, the separation is good.
LDA 3
2 2 2
d +d +d
2 2 2
S +S +S
When we only use 2 genes, this is no big deal. The data started out on a X/Y plot and plotting
them on a new X/Y plot doesn't change much.
But what if we used 10,000 genes? That would mean we’d need 10,000 dimensions to draw
the data. And here is where being able to create 2 axes that maximize the separation of 3
categories becomes important.
In LDA graph, although the separation isn’t perfect, it is still easy to see the 3 categories
(using the 3 colored circles).
In the PCA graph, it does not separate the categories nearly as well as LDA. We can see a lot
of overlap b/w black and blue points. However, PCA wasn’t even trying to separate the
categories; it was just looking for the genes with the most variation.
These are a few differences b/w LDA & PCA; now let’s talk about some similarities:
o Both methods rank the new axes that they create in order of importance.
PC1 (the first new axis that PCA creates) accounts for the most variation in
the data.
PC2 (the second new axis) does the second-best job.
LD1 (the first new axis that LDA creates) accounts for the most variation
between the categories.
LD2 (the second new axis) does the second-best job.
o Also, both methods can let you dig in and see which genes are driving the new axes.
In PCA, this means looking at the loading scores.
In LDA, we can see which genes or which variables correlate with the new
axes.
SUMMARY
LDA is like PCA — both try to reduce dimensions.
— PCA looks at the genes with the most variation.
— LDA tries to maximize the separation of known categories.
LDA 4
More Notes
LDA is also closely related to principal component analysis (PCA) and factor analysis in that
they both look for linear combinations of variables which best explain the data.
LDA explicitly attempts to model the difference between the classes of data.
PCA, in contrast, does not take into account any difference in class.
Factor Analysis builds the feature combinations based on differences rather than similarities.
LDA works when the measurements made on independent variables for each observation
are continuous quantities.
o When dealing with categorical independent variables, the equivalent
technique is Discriminant Correspondence Analysis.
Consider the image on the right:
We see that there are two categories.
Updated
We need to enhance the separability of these difference
categories.
Difference b/w
classes
`Lastly, for these datapoints below, they will be projected as follows after LDA:
LDA 5
QDA (Quadratic Discriminant Analysis). The idea in QDA is to increase the dimensions so that
we can classify.
Thus the convex curve in 2D in QDA will look like a plane in 3D and it will easily separate the
classses in higher dimensions.
PCA LDA
Purpose: Purpose:
To capture maximum variance To provide maximum separable boundary
Objective:
Objective:
LDA, on the other hand, is a supervised
PCA is an unsupervised technique that
technique that aims to maximize the
aims to maximize the variance of the data
separation between different classes in the
along the principal components. The goal is
data. The goal is to identify the directions
to identify the directions that capture the
that capture the most separation between
most variation in the data
the classes
Dimensionality Reduction:
Dimensionality Reduction:
LDA reduces the dimensionality of the data
PCA reduces the dimensionality of the data
by creating a linear combination of the
by projecting it onto a lower-dimensional
features that maximizes the separation
space
between the classes
Output:
Output:
PCA outputs principal components, which
LDA outputs discriminant functions, which
are linear combinations of the original
are linear combinations of the original
features. These principal components are
features that maximize the separation
orthogonal to each other and capture the
between the classes
most variation in the data
Interpretation:
Interpretation:
PCA is often used for exploratory data
LDA is often used for classification tasks, as
analysis, as the principal components can
the discriminant functions can be used to
be used to visualize the data and identify
separate the classes
patterns
Performance: Performance:
PCA is generally faster and more However, LDA may be more effective at
computationally efficient than LDA, as it capturing the most important information in
does not require labeled data the data when class labels are available
- It’s a supervised ML used for classification
- It’s an unsupervised ML tasks
(Meaning there’s no numeric target or (Target is involved)
label involved) - It focuses on finding features such that the
separability b/w groups is maximum
Used to enhance classification by increasing
Used to reduce dimensionality mostly
separability
While some finer details are lost, LDA will
retain the most discriminative information
Some info is lost as we only choose PCs
for classification purposes. It aims to
that provide >90% variance
preserve the information that helps
distinguish between different classes
Code:
Code:
pca . fit ( x)
lda. fit transform (x , y)
pca .transform (x)
Here, ‘y’ or target is passed as it’s a
No ‘y’ or target is passed because it is an
supervised ML technique
unsupervised ML technique
LDA 6
LDA vs PCA: When to use which method?
PCA is an unsupervised learning algorithm while LDA is a supervised learning
algorithm.
This means that PCA finds directions of maximum variance regardless of class labels
while LDA finds directions of maximum class separability.
So now that you know how each method works, when should you use PCA vs LDA for
dimensionality reduction?
In general, you should use LDA when your goal is classification – that is, when you
have labels for your data points and want to predict which label new points will have
based on their feature values.
On the other hand, if you don’t have labels for your data or if your goal is simply to
find patterns in your data (not classification), then PCA will likely work better.
That said, there are some situations where LDA may outperform PCA even when you’re not
doing classification.
For example, imagine that your data has 100 features but only 10% of those features
are actually informative (the rest are noise).
If you run PCA on this dataset, it will identify all 100 components since its goal is
simply to maximize variance.
However, because only 10% of those components are actually informative, 90% of
them will be useless.
If you were to run LDA on this same dataset, it would only identify 10 components
since its goal capturing class separability would be better served by discarding noisy
features.
Thus, if noise dominates your dataset, then LDA may give better results even if your
goal isn’t classification!
Because LDA makes stronger assumptions about the structure of your data, it will
often perform better than PCA when your dataset satisfies those assumptions but
worse when it doesn’t.
LDA 7
Assumptions of LDA:
o LDA assumes that the data has a Gaussian distribution and that the covariance
matrices of the different classes are equal.
o It also assumes that the data is linearly separable, meaning that a linear decision
boundary can accurately classify the different classes.
Advantages of LDA:
o It is a simple and computationally efficient algorithm.
o It can work well even when the number of features is much larger than the number
of training samples.
o It can handle multicollinearity (correlation between features) in the data.
Disadvantages of LDA:
o It assumes that the data has a Gaussian distribution, which may not always be the
case.
o It assumes that the covariance matrices of the different classes are equal, which may
not be true in some datasets.
o It assumes that the data is linearly separable, which may not be the case for some
datasets.
o It may not perform well in high-dimensional feature spaces.
o LDA is used specifically in solving supervised classification problems for multiple
classes; something impossible if using logistic regression. But LDA does not work in
cases when the mean of the distributions is shared.
o In such a situation, LDA cannot produce a new axis that can linearly separate both
classes. To solve this problem, non-linear discriminant analysis is used in machine
learning
LDA 8