0% found this document useful (0 votes)
25 views

Pca Smote

Uploaded by

kobaya7455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Pca Smote

Uploaded by

kobaya7455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Principal Component Analysis, or PCA

• Is a dimensionality-reduction method

• It is often used to reduce the dimensionality of large data sets

How?
• by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
Principal Component Analysis, or PCA
• Reducing the number of variables of a data set naturally comes at the
expense of accuracy

• But the trick in dimensionality reduction is to trade a little accuracy


for simplicity.

• Because smaller data sets are easier to explore and visualize and
make analyzing data much easier and faster for machine learning
algorithms without extraneous variables to process.
Idea of PCA
• Reduce the number of variables of a data set, while preserving as
much information as possible.
Principal Component Analysis (PCA)

• Given a set of points, how do we know


if they can be compressed like in
the previous example?
– The answer is to look into the
correlation between the points
– The tool for doing this is
called PCA
PCA
• By finding the eigenvalues and eigenvectors of the
covariance matrix, we find that the eigenvectors
with the largest eigenvalues correspond to the
dimensions that have the strongest correlation in
the dataset.
• This is the principal component.
• PCA is a useful statistical technique that has
found application in:
– fields such as face recognition and image compression
– finding patterns in data of high dimension.
Imbalanced Data Set
Imbalanced Data Set
• Classification predictive modeling involves predicting a class label for
a given observation.

• An imbalanced classification problem is an example of a classification


problem where the distribution of examples across the known classes
is biased or skewed.

• The distribution can vary from a slight bias to a severe imbalance


where there is one example in the minority class for hundreds,
thousands, or millions of examples in the majority class or classes.
Example
• Cancer Prediction

No Cancer – 900 --- Majority Class


Yes Cancer – 100 ----Minority Class

If 1000 record are given which is biased towards NC – still Accuracy is 90%

Most algorithm work towards Majority class

Business Problems Minority class is the focus class eg: Spam / Non Spam

If accuracy is taken as metric algorithm tend to bias towards majority class


Methods to handle
• Under sampling

100 – NC
100 – C

====
200 -- perfectly balanced
========
• ML data is very important , loosing data is not recommended
Methods to handle
• Over Sampling

900 – NC
900 – C
===================================
Cancer
Take 30 records randomly
Till Reach – 900
Random Duplication
Few records may be more duplicated , few records less duplicated
900 – 800 are duplicates
===================================
1800 -- perfectly balanced --- focus is on minority class
===================================
• ML data is very important , loosing data is not recommended
Under Sampling vs Over Sampling
Methods to handle
• SMOTE (Synthetic Minority Oversampling Technique )
SMOTE
• Calculate the linear distance
between two vectors and
SMOTE multiply it by random number
between 0 -1 and plot the new
data point with the output

• The new point is the synthetic


data point continue
SMOTE – Repeat the process till you reach
the desired points

You might also like