0% found this document useful (0 votes)

2 views

CSE-1-PPT-MiniTest-12feb24-Correlation (3)

The document discusses data redundancy, its causes, and its impact on machine learning, emphasizing the importance of eliminating redundant data for improved model performance. It also covers correlation analysis, its types, and applications in feature selection and engineering, highlighting how correlated features can lead to redundancy. Additionally, it explains Pearson's and Spearman's correlation coefficients, providing formulas and examples for calculating and interpreting correlation values.

Uploaded by

Sanjay Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

CSE-1-PPT-MiniTest-12feb24-Correlation (3)

Uploaded by

Sanjay Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Redundancy Causes of data redundancy

 Data redundancy means when identical or highly similar 1. Poor Database Design
information is stored multiple times within a dataset
2. Lack of Data Normalization
Exact copies of data entries

3. Inefficient Data Integration Processes
Data entries that are not exact duplicates but are very similar
4. Intentional Duplication for Backup


 Features that provide overlapping information (derived attributes)

5. De-normalization for Performance optimization
 Independent variables that are highly correlated (Multicolinearity)
 Repeated or unchanging data points over time 6. Human Error
 Repeated phrases or sentences 7. Software or Application Errors

1 2

Impact of Data Redundancy in Machine Learning

 In machine learning, such redundancy can lead to  Are two variables related?
increased computational costs, overfitting, and
 Does one increase as the other increases?
decreased model performance.
 e. g. Web_Page_hits and Web_Log_size
 Identifying and eliminating redundant data is crucial
for efficient data processing and to enhance the  Does one decrease as the other increases?
performance of machine learning models  e. g. Memory_Space and No_of_Process

 How can we get a numerical measure of the

Correlation analysis is used to detect Data redundancy degree of relationship?

Meaning of Correlation Uses of Correlation Analysis

• Correlation is the degree of inter-  It is used in deriving the degree and direction
relationship among the two or more of relationship within the variables.
variables.
 It is used in reducing the range of uncertainty
• Correlation analysis is a process to find in matter of prediction.
out the degree of relationship between
two or more variables by applying
 It is used in presenting the average relationship
various statistical tools and techniques.
between any two variables through a single
value of coefficient of correlation.

1
Contd.. Feature Selection

 Used in data preprocessing for  High correlation between features can lead to
redundancy.
 Feature Selection  Removing correlated features reduces over
 Dimensionality Reduction fitting.
 Model Interpretation
 Exploratory Data Analysis (EDA)
Feature Engineering
Example:


 Improving model performance and efficiency Height in cm vs. Height in inches

– keeping both is unnecessary

7 8

Dimensionality Reduction Model Interpretability

 Detecting Multi collinearity

Redundant features increase computation and Multicollinearity occurs when two or more independent
complexity variables in a regression model are highly correlated
with each other.
PCA (Principal Component Analysis) use Solution: Remove or combine highly correlated variables
correlation to create uncorrelated
components.  Helps interpret how features influence predictions.
 Example:
If vehicle detection probability is strongly correlated
Reduces model over fitting while preserving with shadow intensity, either remove one or combine
essential information them.

9 10

Understanding Data Relationships Feature Engineering

 Helps in exploratory data analysis (EDA) to  Correlated variables may be combined (e.g.,
uncover relationships between features creating interaction terms).

For example,  Examples :

If "Total Income" and "Savings" are highly correlated, a new
In road extraction, correlation between pixel feature like Savings Ratio (Savings / Income) may be more
intensity and edge features can reveal useful useful.

patterns
If "Hours Studied" and "Number of Practice Tests Taken" are
correlated, a combined feature like Total Study Effort
(Hours * Tests) may better capture student engagement.

11 12

2
Types of correlation
14

On the basis of On the basis of

On the basis of
degree of linearity
number of variables
correlation

•Positive • Simple
• Linear
correlation correlation
correlation
• Partial correlation
• Negative •Non – linear
correlation correlation
• Multiple
correlation

Correlation : On the basis of degree

Positive Correlation
if one variable is increasing and with its
impact other variable is also increasing
that will be Positive correlation.

For Example:-
Log file Size (MB): 20 21 22 24
Time stamp(AM): 7:00 7:02 7:05 7:10

Income(Rs) : 350 360 370 380

Weight (Kg) : 40 50 60 70
15

Correlation : On the basis of

Correlation : On the basis of degree
number of variables
Negative correlation Simple correlation
Correlation is said to be simple when
if one variable is increasing and with its
only two variables are analyzed.
impact other variable is decreasing that will
be Negative correlation.
For example :
For Example:- Consider maximum log file size is 50 MB
Correlation is said to be simple when it
Log file Size (MB) : 20 25 30 35 is calculated between
Remaining size in log file (MB) : 30 25 20 15
demand and supply or
Income (Rs) : 350 360 370 380
Weight(Kg) : 80 70 60 50 income and expenditure etc.

3
Correlation : On the basis of Correlation : On the basis of
number of variables number of variables
Multiple correlation :
Partial correlation :
In case of multiple correlation three or more
When three or more variables are considered variables are studied simultaneously.
for analysis but only two influencing variables
are studied and rest influencing variables are
For example :
kept constant.
Rainfall
For example :
Production of rice
Correlation analysis is done with demand, Price of rice
supply and income are studied simultaneously will be known as
where income is kept constant. multiple correlation.

Correlation : On the basis of Correlation : On the basis of

Linearity Linearity
Linear correlation :
Non - Linear correlation :
If the change in amount of one variable
If the change in amount of one variable
tends to make changes in amount of other
tends to make changes in amount of other
variable bearing constant changing ratio it
variable but not bearing constant changing
is said to be linear correlation.
ratio it is said to be non - linear correlation.
For example : For example :
Income ( Rs.) : 320 360 410 490
Income ( Rs.) : 350 360 370 380
Weight ( Kg.) : 21 33 49 56
Weight ( Kg.) : 30 40 50 60

KARL PEARSON’S
COEFFICIENT OF CORRELATION Formula to calculate r

 It is a measure of the strength of a linear association between

two variables and is denoted by r or rxy (x and y being the
two variables involved)

 This method of correlation attempts to draw a line of best fit

through the data of two variables, and the value of the Pearson
correlation coefficient, r, indicates how far away all these data
points are to this line of best fit

n= number of observations

23 24

4
Interpretation of ‘r’

 The value of the coefficient of correlation will always lie  The coefficient correlation describes the magnitude of
between -1 and +1., i.e., –1 ≤ r ≤ 1. correlation.

 When r = +1, it means, there is perfect positive  r= +0.8 indicates that correlation is positive because the
correlation between the variables.
sign of r is plus and the degree of correlation is high
because the numerical value of r(0.8) is close to 1.
 When r = -1, there is perfect negative correlation
between the variables. If r = -0.4, it indicates that there is low degree of


negative correlation because the sigh of r is negative e

 When r = 0, there is no relationship between the two and the numerical value of r is less than 0.5.
variables.

25 26

Properties of the Pearson’s

Assumptions Correlation Coefficient

 While calculating the Pearson‘s Correlation Coefficient,

we make the following assumptions
1. r lies between -1 and +1, or –1 ≤ r ≤ 1, or the
numerical value of r cannot exceed one (unity)
 There is a linear relationship between the two
variables
2. The correlation coefficient is independent of the change
 Both variables should be normally distributed of origin and scale.
 Outliers are kept either to a minimum or remove
them entirely

27 28

Example 1 Solution 1

29 30

5
Formula to use Substituting the values in the formula

=
6×293−36×60
√[6×266−362][6×706−602]

=
1758−2160
√ [1596−1296] √[4236−3600]

−402 = -0.92
=
436.85
High degree of negative correlation

31 32

Example2 Solution 2

33 34

Substituting the values in the formula Example 3

35 36

6
Solution 3
 If the resulting value is greater than 0, then A and B are
positively correlated, meaning that the values of A
increase as the value of B increase.
 The higher the value, the more each attribute implies
each other.
 Hence, a high value may indicate that A (or B) may be
removed as a redundancy. If the resulting value is equal
to 0, then A and B are independent and there is no
correlation between them.
 If the resulting value is less than 0, then A and B are
negatively correlated, where the values of one attribute
increase as the value of other attribute decreases.
37 38

Spearman correlation coefficient Formula

 The Spearman’s rank coefficient of correlation is a Here,

nonparametric measure of rank correlation (statistical n = number of data points of the two variables
dependence of ranking between two variables). di = difference in ranks of the “ith” element
The Spearman Coefficient, ⍴, can take a value between +1 to -1 where,
 Named after Charles Spearman, it is often denoted by
A ⍴ value of +1 means a perfect association of rank
the Greek letter ‘ρ’ (rho)
A ⍴ value of 0 means no association of ranks
A ⍴ value of -1 means a perfect negative association between ranks
 It measures the strength of the association between
two ranked variables.
Closer the ⍴ value to 0, weaker is the association between the two
ranks.
39 40

Example-1
 The scores of 9 students in History and Geography are
mentioned in the table below.  Step 1- Create a table of the data obtained.
 Step 2- Start by ranking the two data sets. Data ranking
can be achieved by assigning the rank “1” to the biggest
number in the column, “2” to the second biggest number
and so forth. The smallest value will get the lowest
ranking. This should be done for both sets of
measurements.

41 42

7
Step 3- Add a third column d to your data set, d here
denotes the difference between ranks.
For example, if the first student’s physics rank is 3
and the math rank is 5 then the difference in the rank is 3.
In the fourth column, square your d values.

Step 4- Add up all your d square values, which is 12

(∑d square)

43 44

 Insert these values in the formula

The Spearman’s Rank Correlation for this data is 0.9

(The ⍴ value is nearing +1 then they have a perfect
association of rank)
45 46

Example-2 Ranking the marks

Let us consider the following example data regarding

the marks achieved in a maths and English exam:

47 48

8
ρ (or rs) is 0.67. This indicates
a strong positive relationship
between the ranks individuals
obtained in the maths and
English exam. That is, the
higher you ranked in maths,
the higher you ranked in
English also, and vice versa.

49 50

Types of Correlation and Their Specific Applications
No ratings yet
Types of Correlation and Their Specific Applications
25 pages
Correlation Analysis and Its Types
No ratings yet
Correlation Analysis and Its Types
50 pages
202003241550009941rajeev Pandey Correlation Research
No ratings yet
202003241550009941rajeev Pandey Correlation Research
87 pages
STATISTICS Documentary
No ratings yet
STATISTICS Documentary
18 pages
Correlation: By: Nathaniel S. Antero
No ratings yet
Correlation: By: Nathaniel S. Antero
13 pages
Correlation
100% (1)
Correlation
78 pages
Correlation
No ratings yet
Correlation
34 pages
Lecture-1 Correlation & Types
No ratings yet
Lecture-1 Correlation & Types
15 pages
Peter
No ratings yet
Peter
48 pages
Unit 3 Fod
No ratings yet
Unit 3 Fod
21 pages
Coorelation
No ratings yet
Coorelation
8 pages
Correlation and Regression - Interview Questions in Business Analytics
No ratings yet
Correlation and Regression - Interview Questions in Business Analytics
5 pages
Department OF Business and Industrial Management: Presentation On Correlation
No ratings yet
Department OF Business and Industrial Management: Presentation On Correlation
29 pages
20200519072923cce68d4cc4
No ratings yet
20200519072923cce68d4cc4
28 pages
Correlation and Regression
No ratings yet
Correlation and Regression
45 pages
Correlation Qmt-Students - 13 May 2022
No ratings yet
Correlation Qmt-Students - 13 May 2022
14 pages
Correlation
No ratings yet
Correlation
83 pages
5 - Correlation Analysis
No ratings yet
5 - Correlation Analysis
34 pages
Correlation Bmlt
No ratings yet
Correlation Bmlt
5 pages
Correlation
No ratings yet
Correlation
4 pages
Correlation 1
100% (1)
Correlation 1
57 pages
Correlation Anad Regression
No ratings yet
Correlation Anad Regression
13 pages
Correlation Analysis
100% (1)
Correlation Analysis
51 pages
Chapter 6-Correlation Analysis
No ratings yet
Chapter 6-Correlation Analysis
35 pages
Chapter 6 PDF
No ratings yet
Chapter 6 PDF
3 pages
Ch-11-Correlation (Prashant Kirad)
No ratings yet
Ch-11-Correlation (Prashant Kirad)
11 pages
UNIT III PORIYAN NOTES (1)
No ratings yet
UNIT III PORIYAN NOTES (1)
33 pages
Correlation
No ratings yet
Correlation
30 pages
Assignment On Correlation Analysis Name: Md. Arafat Rahman
No ratings yet
Assignment On Correlation Analysis Name: Md. Arafat Rahman
6 pages
Correlation and Regression-1
No ratings yet
Correlation and Regression-1
32 pages
Correlation KDK DHH W
No ratings yet
Correlation KDK DHH W
16 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Pearson Correlation Analysis
100% (1)
Pearson Correlation Analysis
26 pages
Presentation On: Correlation and Rank Correlation: Submitted To
100% (3)
Presentation On: Correlation and Rank Correlation: Submitted To
23 pages
Correlation Analysis
No ratings yet
Correlation Analysis
48 pages
Correlation
No ratings yet
Correlation
17 pages
Chapter 6 Correlation and Regression
No ratings yet
Chapter 6 Correlation and Regression
29 pages
Correlation Analysis Notes-2
No ratings yet
Correlation Analysis Notes-2
5 pages
Correlation Rev 1.0
No ratings yet
Correlation Rev 1.0
5 pages
Correlation SBC
No ratings yet
Correlation SBC
4 pages
Correlation
No ratings yet
Correlation
8 pages
Chapter - Six
No ratings yet
Chapter - Six
8 pages
Correlation
No ratings yet
Correlation
6 pages
Correlation
No ratings yet
Correlation
33 pages
Correlation Analysis: Concept of Univariate, Bivariate Data
No ratings yet
Correlation Analysis: Concept of Univariate, Bivariate Data
48 pages
Statistics
No ratings yet
Statistics
21 pages
Topic 4.5 Correlational Analysis
No ratings yet
Topic 4.5 Correlational Analysis
28 pages
ECAP790 U06L01 Correlation
No ratings yet
ECAP790 U06L01 Correlation
37 pages
Correlation
No ratings yet
Correlation
41 pages
BONGGA Statistics-and-Probability 4Q SLM8
No ratings yet
BONGGA Statistics-and-Probability 4Q SLM8
10 pages
Unit 3 - CORRELATION AND REGRESSION
No ratings yet
Unit 3 - CORRELATION AND REGRESSION
85 pages
Measures of Correlation Module
No ratings yet
Measures of Correlation Module
24 pages
Correlation 6th Sem
No ratings yet
Correlation 6th Sem
11 pages
Stat Lec-6
No ratings yet
Stat Lec-6
21 pages
Unit 3 Fod
No ratings yet
Unit 3 Fod
18 pages
Mastering Java Design Patterns: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Java Design Patterns: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Rearching Mathology
No ratings yet
Rearching Mathology
8 pages
Hypotheses Testing 2018
No ratings yet
Hypotheses Testing 2018
35 pages
A Machine Learning Approach To Big Data Regression Analysis of Real Estate Prices For Inferential and Predictive Purposes
No ratings yet
A Machine Learning Approach To Big Data Regression Analysis of Real Estate Prices For Inferential and Predictive Purposes
39 pages
ECN 2331 Tutorial Sheet 2 - 26.04
No ratings yet
ECN 2331 Tutorial Sheet 2 - 26.04
4 pages
Statistics Assignment,.....
No ratings yet
Statistics Assignment,.....
8 pages
Propensity Score Matching Example
No ratings yet
Propensity Score Matching Example
4 pages
BS-chapter1-2022-Intro Statistics-Descrptv N Sumary M & Measures of Location
No ratings yet
BS-chapter1-2022-Intro Statistics-Descrptv N Sumary M & Measures of Location
54 pages
Question Bank - ML - Unit1,2,3
No ratings yet
Question Bank - ML - Unit1,2,3
3 pages
1 Research III Chapter 4 Student
No ratings yet
1 Research III Chapter 4 Student
78 pages
Solution Manual Business Analytics 3rd Edition by Camm & Cochran
No ratings yet
Solution Manual Business Analytics 3rd Edition by Camm & Cochran
56 pages
Sultan Abu Bakar 2013 M2 (Q)
100% (1)
Sultan Abu Bakar 2013 M2 (Q)
2 pages
Reflection
No ratings yet
Reflection
2 pages
Skor Hasil Lempar Cakram B. Data Terserak (Tunggal)
No ratings yet
Skor Hasil Lempar Cakram B. Data Terserak (Tunggal)
8 pages
Basic Statistics For Business and Economics Canadian 5th Edition Lind Solutions Manual
100% (44)
Basic Statistics For Business and Economics Canadian 5th Edition Lind Solutions Manual
22 pages
Cpk
No ratings yet
Cpk
1 page
Group 6 Market Research SPSS Assignment
No ratings yet
Group 6 Market Research SPSS Assignment
7 pages
Dr. Arsham's Statistics Site
No ratings yet
Dr. Arsham's Statistics Site
139 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Dalton 1999
No ratings yet
Dalton 1999
14 pages
Group 8 - EFC Project Report
No ratings yet
Group 8 - EFC Project Report
21 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
LJ Math (Unit-6) .
No ratings yet
LJ Math (Unit-6) .
3 pages
TS CV
No ratings yet
TS CV
208 pages
Population Proportion
No ratings yet
Population Proportion
3 pages
BMA1104 Prob_Stats1
No ratings yet
BMA1104 Prob_Stats1
74 pages
Chapter 6 Efficient Diversification and Portfolio Risk
No ratings yet
Chapter 6 Efficient Diversification and Portfolio Risk
13 pages
ID Maxel Test
No ratings yet
ID Maxel Test
4 pages
Data Analyst Cheat Sheet FROM Parth Roy
No ratings yet
Data Analyst Cheat Sheet FROM Parth Roy
59 pages
The Effect of Eating Instant Noodles Everyday To A Person's Lifespan
No ratings yet
The Effect of Eating Instant Noodles Everyday To A Person's Lifespan
3 pages

CSE-1-PPT-MiniTest-12feb24-Correlation (3)

Uploaded by

CSE-1-PPT-MiniTest-12feb24-Correlation (3)

Uploaded by

Data Redundancy Causes of data redundancy

 Features that provide overlapping information (derived attributes)

Impact of Data Redundancy in Machine Learning

 How can we get a numerical measure of the

Meaning of Correlation Uses of Correlation Analysis

 Improving model performance and efficiency Height in cm vs. Height in inches

Dimensionality Reduction Model Interpretability

 Detecting Multi collinearity

Understanding Data Relationships Feature Engineering

For example,  Examples :

On the basis of On the basis of

Correlation : On the basis of degree

Income(Rs) : 350 360 370 380

Correlation : On the basis of

Correlation : On the basis of Correlation : On the basis of

 It is a measure of the strength of a linear association between

 This method of correlation attempts to draw a line of best fit

negative correlation because the sigh of r is negative e

Properties of the Pearson’s

 While calculating the Pearson‘s Correlation Coefficient,

Substituting the values in the formula Example 3

Spearman correlation coefficient Formula

 The Spearman’s rank coefficient of correlation is a Here,

Step 4- Add up all your d square values, which is 12

 Insert these values in the formula

The Spearman’s Rank Correlation for this data is 0.9

Example-2 Ranking the marks

Let us consider the following example data regarding

You might also like