0% found this document useful (0 votes)
2 views

CSE-1-PPT-MiniTest-12feb24-Correlation (3)

The document discusses data redundancy, its causes, and its impact on machine learning, emphasizing the importance of eliminating redundant data for improved model performance. It also covers correlation analysis, its types, and applications in feature selection and engineering, highlighting how correlated features can lead to redundancy. Additionally, it explains Pearson's and Spearman's correlation coefficients, providing formulas and examples for calculating and interpreting correlation values.

Uploaded by

Sanjay Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CSE-1-PPT-MiniTest-12feb24-Correlation (3)

The document discusses data redundancy, its causes, and its impact on machine learning, emphasizing the importance of eliminating redundant data for improved model performance. It also covers correlation analysis, its types, and applications in feature selection and engineering, highlighting how correlated features can lead to redundancy. Additionally, it explains Pearson's and Spearman's correlation coefficients, providing formulas and examples for calculating and interpreting correlation values.

Uploaded by

Sanjay Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Redundancy Causes of data redundancy

 Data redundancy means when identical or highly similar 1. Poor Database Design
information is stored multiple times within a dataset
2. Lack of Data Normalization
Exact copies of data entries

3. Inefficient Data Integration Processes
Data entries that are not exact duplicates but are very similar
4. Intentional Duplication for Backup

 Features that provide overlapping information (derived attributes)


5. De-normalization for Performance optimization
 Independent variables that are highly correlated (Multicolinearity)
 Repeated or unchanging data points over time 6. Human Error
 Repeated phrases or sentences 7. Software or Application Errors

1 2

Impact of Data Redundancy in Machine Learning

 In machine learning, such redundancy can lead to  Are two variables related?
increased computational costs, overfitting, and
 Does one increase as the other increases?
decreased model performance.
 e. g. Web_Page_hits and Web_Log_size
 Identifying and eliminating redundant data is crucial
for efficient data processing and to enhance the  Does one decrease as the other increases?
performance of machine learning models  e. g. Memory_Space and No_of_Process

 How can we get a numerical measure of the


Correlation analysis is used to detect Data redundancy degree of relationship?

Meaning of Correlation Uses of Correlation Analysis

• Correlation is the degree of inter-  It is used in deriving the degree and direction
relationship among the two or more of relationship within the variables.
variables.
 It is used in reducing the range of uncertainty
• Correlation analysis is a process to find in matter of prediction.
out the degree of relationship between
two or more variables by applying
 It is used in presenting the average relationship
various statistical tools and techniques.
between any two variables through a single
value of coefficient of correlation.

1
Contd.. Feature Selection

 Used in data preprocessing for  High correlation between features can lead to
redundancy.
 Feature Selection  Removing correlated features reduces over
 Dimensionality Reduction fitting.
 Model Interpretation
 Exploratory Data Analysis (EDA)
Feature Engineering
Example:

 Improving model performance and efficiency Height in cm vs. Height in inches


– keeping both is unnecessary

7 8

Dimensionality Reduction Model Interpretability

 Detecting Multi collinearity


Redundant features increase computation and Multicollinearity occurs when two or more independent
complexity variables in a regression model are highly correlated
with each other.
PCA (Principal Component Analysis) use Solution: Remove or combine highly correlated variables
correlation to create uncorrelated
components.  Helps interpret how features influence predictions.
 Example:
If vehicle detection probability is strongly correlated
Reduces model over fitting while preserving with shadow intensity, either remove one or combine
essential information them.

9 10

Understanding Data Relationships Feature Engineering

 Helps in exploratory data analysis (EDA) to  Correlated variables may be combined (e.g.,
uncover relationships between features creating interaction terms).

For example,  Examples :


If "Total Income" and "Savings" are highly correlated, a new
In road extraction, correlation between pixel feature like Savings Ratio (Savings / Income) may be more
intensity and edge features can reveal useful useful.

patterns
If "Hours Studied" and "Number of Practice Tests Taken" are
correlated, a combined feature like Total Study Effort
(Hours * Tests) may better capture student engagement.

11 12

2
Types of correlation
14

On the basis of On the basis of


On the basis of
degree of linearity
number of variables
correlation

•Positive • Simple
• Linear
correlation correlation
correlation
• Partial correlation
• Negative •Non – linear
correlation correlation
• Multiple
correlation

Correlation : On the basis of degree

Positive Correlation
if one variable is increasing and with its
impact other variable is also increasing
that will be Positive correlation.

For Example:-
Log file Size (MB): 20 21 22 24
Time stamp(AM): 7:00 7:02 7:05 7:10

Income(Rs) : 350 360 370 380


Weight (Kg) : 40 50 60 70
15

Correlation : On the basis of


Correlation : On the basis of degree
number of variables
Negative correlation Simple correlation
Correlation is said to be simple when
if one variable is increasing and with its
only two variables are analyzed.
impact other variable is decreasing that will
be Negative correlation.
For example :
For Example:- Consider maximum log file size is 50 MB
Correlation is said to be simple when it
Log file Size (MB) : 20 25 30 35 is calculated between
Remaining size in log file (MB) : 30 25 20 15
demand and supply or
Income (Rs) : 350 360 370 380
Weight(Kg) : 80 70 60 50 income and expenditure etc.

3
Correlation : On the basis of Correlation : On the basis of
number of variables number of variables
Multiple correlation :
Partial correlation :
In case of multiple correlation three or more
When three or more variables are considered variables are studied simultaneously.
for analysis but only two influencing variables
are studied and rest influencing variables are
For example :
kept constant.
Rainfall
For example :
Production of rice
Correlation analysis is done with demand, Price of rice
supply and income are studied simultaneously will be known as
where income is kept constant. multiple correlation.

Correlation : On the basis of Correlation : On the basis of


Linearity Linearity
Linear correlation :
Non - Linear correlation :
If the change in amount of one variable
If the change in amount of one variable
tends to make changes in amount of other
tends to make changes in amount of other
variable bearing constant changing ratio it
variable but not bearing constant changing
is said to be linear correlation.
ratio it is said to be non - linear correlation.
For example : For example :
Income ( Rs.) : 320 360 410 490
Income ( Rs.) : 350 360 370 380
Weight ( Kg.) : 21 33 49 56
Weight ( Kg.) : 30 40 50 60

KARL PEARSON’S
COEFFICIENT OF CORRELATION Formula to calculate r

 It is a measure of the strength of a linear association between


two variables and is denoted by r or rxy (x and y being the
two variables involved)

 This method of correlation attempts to draw a line of best fit


through the data of two variables, and the value of the Pearson
correlation coefficient, r, indicates how far away all these data
points are to this line of best fit

n= number of observations

23 24

4
Interpretation of ‘r’

 The value of the coefficient of correlation will always lie  The coefficient correlation describes the magnitude of
between -1 and +1., i.e., –1 ≤ r ≤ 1. correlation.

 When r = +1, it means, there is perfect positive  r= +0.8 indicates that correlation is positive because the
correlation between the variables.
sign of r is plus and the degree of correlation is high
because the numerical value of r(0.8) is close to 1.
 When r = -1, there is perfect negative correlation
between the variables. If r = -0.4, it indicates that there is low degree of

negative correlation because the sigh of r is negative e


 When r = 0, there is no relationship between the two and the numerical value of r is less than 0.5.
variables.

25 26

Properties of the Pearson’s


Assumptions Correlation Coefficient

 While calculating the Pearson‘s Correlation Coefficient,


we make the following assumptions
1. r lies between -1 and +1, or –1 ≤ r ≤ 1, or the
numerical value of r cannot exceed one (unity)
 There is a linear relationship between the two
variables
2. The correlation coefficient is independent of the change
 Both variables should be normally distributed of origin and scale.
 Outliers are kept either to a minimum or remove
them entirely

27 28

Example 1 Solution 1

29 30

5
Formula to use Substituting the values in the formula

=
6×293−36×60
√[6×266−362][6×706−602]

=
1758−2160
√ [1596−1296] √[4236−3600]

−402 = -0.92
=
436.85
High degree of negative correlation

31 32

Example2 Solution 2

33 34

Substituting the values in the formula Example 3

35 36

6
Solution 3
 If the resulting value is greater than 0, then A and B are
positively correlated, meaning that the values of A
increase as the value of B increase.
 The higher the value, the more each attribute implies
each other.
 Hence, a high value may indicate that A (or B) may be
removed as a redundancy. If the resulting value is equal
to 0, then A and B are independent and there is no
correlation between them.
 If the resulting value is less than 0, then A and B are
negatively correlated, where the values of one attribute
increase as the value of other attribute decreases.
37 38

Spearman correlation coefficient Formula

 The Spearman’s rank coefficient of correlation is a Here,


nonparametric measure of rank correlation (statistical n = number of data points of the two variables
dependence of ranking between two variables). di = difference in ranks of the “ith” element
The Spearman Coefficient, ⍴, can take a value between +1 to -1 where,
 Named after Charles Spearman, it is often denoted by
A ⍴ value of +1 means a perfect association of rank
the Greek letter ‘ρ’ (rho)
A ⍴ value of 0 means no association of ranks
A ⍴ value of -1 means a perfect negative association between ranks
 It measures the strength of the association between
two ranked variables.
Closer the ⍴ value to 0, weaker is the association between the two
ranks.
39 40

Example-1
 The scores of 9 students in History and Geography are
mentioned in the table below.  Step 1- Create a table of the data obtained.
 Step 2- Start by ranking the two data sets. Data ranking
can be achieved by assigning the rank “1” to the biggest
number in the column, “2” to the second biggest number
and so forth. The smallest value will get the lowest
ranking. This should be done for both sets of
measurements.

41 42

7
Step 3- Add a third column d to your data set, d here
denotes the difference between ranks.
For example, if the first student’s physics rank is 3
and the math rank is 5 then the difference in the rank is 3.
In the fourth column, square your d values.

Step 4- Add up all your d square values, which is 12


(∑d square)

43 44

 Insert these values in the formula

The Spearman’s Rank Correlation for this data is 0.9


(The ⍴ value is nearing +1 then they have a perfect
association of rank)
45 46

Example-2 Ranking the marks

Let us consider the following example data regarding


the marks achieved in a maths and English exam:

47 48

8
ρ (or rs) is 0.67. This indicates
a strong positive relationship
between the ranks individuals
obtained in the maths and
English exam. That is, the
higher you ranked in maths,
the higher you ranked in
English also, and vice versa.

49 50

You might also like