0% found this document useful (0 votes)
6 views8 pages

Dimensionality Reduction: Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis Notes

Uploaded by

Raj Narayanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Dimensionality Reduction: Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis Notes

Uploaded by

Raj Narayanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Dimensionality Reduction

Linear Discriminant Analysis (LDA)


 Linear discriminant analysis (LDA) is a method used in statistics to find a linear combination
of features that separates two or more classes of objects or events.
The resulting combination may be used as a linear classifier, or, more commonly, for
dimensionality reduction before classification.

 We use LDA when we want to have better separability for classification problems. It is also a
dimensionality reduction technique.

 Consider the following example:


- Here, the Green drug means that the drug is
effective
- Here, the Red drug means that the drug is
ineffective

In this case, we have two dimensions of data


(Gene X & Y). To separate these two genes, we
can draw the dotted lines. However, there is a
misclassification and it’s not perfect.

Can we try to better the classification using a 3D data?

Even with 3D representation, it’s hard for us to tell if the data is separated properly because
the depth is not perceived on a 2D graph.

What if we need 4 or more genes to separate the data!


There are problems with this:
- We can’t draw a 4D or more dimensional graph on a flat surface.
- The same issue was faced in PCA as well;
- In PCA, we reduce the dimensions by focusing on the genes(variables) with the most
variation.
This is useful for plotting data with a lot of dimensions (or a lot of genes) onto a
simple X/Y plot.
- In LDA, we will not be interested in the genes with most variation, but in maximizing the
separability between two groups so we can make the best classification decisions.

 Linear Discriminant Analysis (LDA) is like PCA, but it focuses on maximizing the separability
among known categories.

LDA 1
 Using a simple example, we will try to reduce a 2D graph into a 1D graph.

We can start with the bad ways of reducing dimensions:


- One bad option would be to ignore Gene Y and project the data
entirely on X-axis.
This way is bad because it ignores the useful information that Gene Y
provides;
Likewise, projecting the genes onto the Y-axis (i.e., ignoring Gene X)
isn't any better either.

To overcome these issues, LDA provides a better solution.

 What LDA does is that, it uses the information from both


the genes (variables) to create a new axis and projects the
data onto this new axis in a way to maximize the
separation of the two categories.

 Now, how does LDA create the new axis?


The new axis is created according to two criteria
(considered simultaneously)
1st Criteria:
Is that once the data is projected onto the new axis,
we want to maximize the distance between the two means.
- Here, μ is the mean for the green category and μ is the
mean for red category.

2nd Criteria:
Is that we want to minimize the variation within
each category (which LDA calls ‘Scatter’, denoted by S2).
- The S2 is the scatter for green category and S2 is the scatter
for red category.

And both the criteria are considered simultaneously using the formula:
( μ−μ )2 Ideally large
2 2

S +S Ideally small
Here, in the above formula, the numerator is squared because the mean of green category
may be greater than mean of red category and the difference may be negative. Therefore,
we square the numerator.

Also, in the above formula, we can call: ( μ−μ ) as distance, ‘d’. Hence the formula becomes:
2
d
2 2
S +S
We shall now see why the distance b/w the two means and the scatters are important.
LDA 2
 Consider this dataset:
Here, the data is spread pretty good on the x-axis but there is an overlap along
the y-axis. In this case, if we only maximize the distance b/w the means, then
we will get something like this:

And this separation isn’t great as we’ll have a lot of overlap in the middle.

However, if we optimize the distance b/w the means and the scatter, we’d get:
In this case, we’d get a nice separation b/w the categories.
Although the means in this graph are a little closer than they were in the graph in
the middle, but the scatter is much less and therefore, the separation is good.

 Now, what if we have more than 2 genes (more than 2 dimensions)?


The process is the same:
- Create an axis that maximizes the
distance between the means for the two
categories while minimizing the scatter.

Let’s take an example: Here’s the


dataset with 3 genes.
- LDA will create a new axis:
- Then the data are projected onto the new
axis:
The axis was chosen to maximize the distance
between the two means (between the two
categories) while minimizing the "scatter".

 Now, what if we have 3 categories?


In this case, two things change:
1) 1st difference is how we measure the distances among
the means.
2) 2nd difference is that LDA created 2 axes instead of 1.
Consider the dataset where we have 2 genes but 3 categories:

In the 1st point,


- We will calculate the centroid of the overall dataset:
- Then we measure the distances between a point that is
central in each category and the main central point:
- And then we maximize the
distance between each
category and the central
point while minimizing the
scatter for each category.
The equation that we want to
optimize is:

LDA 3
2 2 2
d +d +d
2 2 2
S +S +S

In the 2nd point,


LDA creates 2 axes in case of 3 categories to separate the data
because the 3 central points for each category define a plane.
(Remember: 2 points define a line; 3 points define a plane)

Therefore, we create x & y axes and these are optimized to separate


the categories better.

When we only use 2 genes, this is no big deal. The data started out on a X/Y plot and plotting
them on a new X/Y plot doesn't change much.

But what if we used 10,000 genes? That would mean we’d need 10,000 dimensions to draw
the data. And here is where being able to create 2 axes that maximize the separation of 3
categories becomes important.

 Now, let’s compare LDA & PCA!


To do that, let’s consider a
dataset:

Here, we’ve got 3 categories we’re


trying to separate and we’ve got
10,000 genes.

Plotting this as a raw data would


require 10,000 axes. But we used
LDA to reduce this number to 2.

In LDA graph, although the separation isn’t perfect, it is still easy to see the 3 categories
(using the 3 colored circles).

In the PCA graph, it does not separate the categories nearly as well as LDA. We can see a lot
of overlap b/w black and blue points. However, PCA wasn’t even trying to separate the
categories; it was just looking for the genes with the most variation.

These are a few differences b/w LDA & PCA; now let’s talk about some similarities:
o Both methods rank the new axes that they create in order of importance.
 PC1 (the first new axis that PCA creates) accounts for the most variation in
the data.
 PC2 (the second new axis) does the second-best job.
 LD1 (the first new axis that LDA creates) accounts for the most variation
between the categories.
 LD2 (the second new axis) does the second-best job.

o Also, both methods can let you dig in and see which genes are driving the new axes.
 In PCA, this means looking at the loading scores.
 In LDA, we can see which genes or which variables correlate with the new
axes.
SUMMARY
LDA is like PCA — both try to reduce dimensions.
— PCA looks at the genes with the most variation.
— LDA tries to maximize the separation of known categories.
LDA 4
More Notes

 LDA is also closely related to principal component analysis (PCA) and factor analysis in that
they both look for linear combinations of variables which best explain the data.

 LDA explicitly attempts to model the difference between the classes of data.
 PCA, in contrast, does not take into account any difference in class.
 Factor Analysis builds the feature combinations based on differences rather than similarities.

 Discriminant analysis is also different from factor analysis in that it is not an


interdependence technique: a distinction between independent variables and dependent
variables (also called criterion variables) must be made.

 LDA works when the measurements made on independent variables for each observation
are continuous quantities.
o When dealing with categorical independent variables, the equivalent
technique is Discriminant Correspondence Analysis.
 Consider the image on the right:
We see that there are two categories.
Updated
We need to enhance the separability of these difference
categories.
Difference b/w
classes

To do this, we will create a unit vector from origin and


project the data points onto this unit vector.

Once done, we will observe that the variance within


each category of the projected datapoints on the unit
vector is less compared to the variance of the original
dataset.

We will also observe that the difference b/w classes


before projection is less. After the projection, the distance b/w the categories on the unit
vector is higher.

Therefore, LDA, increases the inter-cluster distance


decreases the intra-cluster variance.

 `Lastly, for these datapoints below, they will be projected as follows after LDA:

Before LDA After LDA

 In the 3rd example, LDA will not work as


there is overlap, in this case, we will apply

LDA 5
QDA (Quadratic Discriminant Analysis). The idea in QDA is to increase the dimensions so that
we can classify.
Thus the convex curve in 2D in QDA will look like a plane in 3D and it will easily separate the
classses in higher dimensions.

PCA LDA
Purpose: Purpose:
To capture maximum variance To provide maximum separable boundary
Objective:
Objective:
LDA, on the other hand, is a supervised
PCA is an unsupervised technique that
technique that aims to maximize the
aims to maximize the variance of the data
separation between different classes in the
along the principal components. The goal is
data. The goal is to identify the directions
to identify the directions that capture the
that capture the most separation between
most variation in the data
the classes
Dimensionality Reduction:
Dimensionality Reduction:
LDA reduces the dimensionality of the data
PCA reduces the dimensionality of the data
by creating a linear combination of the
by projecting it onto a lower-dimensional
features that maximizes the separation
space
between the classes
Output:
Output:
PCA outputs principal components, which
LDA outputs discriminant functions, which
are linear combinations of the original
are linear combinations of the original
features. These principal components are
features that maximize the separation
orthogonal to each other and capture the
between the classes
most variation in the data
Interpretation:
Interpretation:
PCA is often used for exploratory data
LDA is often used for classification tasks, as
analysis, as the principal components can
the discriminant functions can be used to
be used to visualize the data and identify
separate the classes
patterns
Performance: Performance:
PCA is generally faster and more However, LDA may be more effective at
computationally efficient than LDA, as it capturing the most important information in
does not require labeled data the data when class labels are available
- It’s a supervised ML used for classification
- It’s an unsupervised ML tasks
(Meaning there’s no numeric target or (Target is involved)
label involved) - It focuses on finding features such that the
separability b/w groups is maximum
Used to enhance classification by increasing
Used to reduce dimensionality mostly
separability
While some finer details are lost, LDA will
retain the most discriminative information
Some info is lost as we only choose PCs
for classification purposes. It aims to
that provide >90% variance
preserve the information that helps
distinguish between different classes
Code:
Code:
pca . fit ( x)
lda. fit transform (x , y)
pca .transform (x)
Here, ‘y’ or target is passed as it’s a
No ‘y’ or target is passed because it is an
supervised ML technique
unsupervised ML technique

LDA 6
LDA vs PCA: When to use which method?
 PCA is an unsupervised learning algorithm while LDA is a supervised learning
algorithm.
 This means that PCA finds directions of maximum variance regardless of class labels
while LDA finds directions of maximum class separability.

So now that you know how each method works, when should you use PCA vs LDA for
dimensionality reduction?
 In general, you should use LDA when your goal is classification – that is, when you
have labels for your data points and want to predict which label new points will have
based on their feature values.
 On the other hand, if you don’t have labels for your data or if your goal is simply to
find patterns in your data (not classification), then PCA will likely work better.

That said, there are some situations where LDA may outperform PCA even when you’re not
doing classification.
 For example, imagine that your data has 100 features but only 10% of those features
are actually informative (the rest are noise).
 If you run PCA on this dataset, it will identify all 100 components since its goal is
simply to maximize variance.
 However, because only 10% of those components are actually informative, 90% of
them will be useless.
 If you were to run LDA on this same dataset, it would only identify 10 components
since its goal capturing class separability would be better served by discarding noisy
features.
 Thus, if noise dominates your dataset, then LDA may give better results even if your
goal isn’t classification!
 Because LDA makes stronger assumptions about the structure of your data, it will
often perform better than PCA when your dataset satisfies those assumptions but
worse when it doesn’t.

LDA 7
 Assumptions of LDA:
o LDA assumes that the data has a Gaussian distribution and that the covariance
matrices of the different classes are equal.
o It also assumes that the data is linearly separable, meaning that a linear decision
boundary can accurately classify the different classes.

 Advantages of LDA:
o It is a simple and computationally efficient algorithm.
o It can work well even when the number of features is much larger than the number
of training samples.
o It can handle multicollinearity (correlation between features) in the data.

 Disadvantages of LDA:
o It assumes that the data has a Gaussian distribution, which may not always be the
case.
o It assumes that the covariance matrices of the different classes are equal, which may
not be true in some datasets.
o It assumes that the data is linearly separable, which may not be the case for some
datasets.
o It may not perform well in high-dimensional feature spaces.
o LDA is used specifically in solving supervised classification problems for multiple
classes; something impossible if using logistic regression. But LDA does not work in
cases when the mean of the distributions is shared.
o In such a situation, LDA cannot produce a new axis that can linearly separate both
classes. To solve this problem, non-linear discriminant analysis is used in machine
learning

LDA 8

You might also like