CSE-1-PPT-MiniTest-12feb24-Correlation (3)
CSE-1-PPT-MiniTest-12feb24-Correlation (3)
Data redundancy means when identical or highly similar 1. Poor Database Design
information is stored multiple times within a dataset
2. Lack of Data Normalization
Exact copies of data entries
3. Inefficient Data Integration Processes
Data entries that are not exact duplicates but are very similar
4. Intentional Duplication for Backup
1 2
In machine learning, such redundancy can lead to Are two variables related?
increased computational costs, overfitting, and
Does one increase as the other increases?
decreased model performance.
e. g. Web_Page_hits and Web_Log_size
Identifying and eliminating redundant data is crucial
for efficient data processing and to enhance the Does one decrease as the other increases?
performance of machine learning models e. g. Memory_Space and No_of_Process
• Correlation is the degree of inter- It is used in deriving the degree and direction
relationship among the two or more of relationship within the variables.
variables.
It is used in reducing the range of uncertainty
• Correlation analysis is a process to find in matter of prediction.
out the degree of relationship between
two or more variables by applying
It is used in presenting the average relationship
various statistical tools and techniques.
between any two variables through a single
value of coefficient of correlation.
1
Contd.. Feature Selection
Used in data preprocessing for High correlation between features can lead to
redundancy.
Feature Selection Removing correlated features reduces over
Dimensionality Reduction fitting.
Model Interpretation
Exploratory Data Analysis (EDA)
Feature Engineering
Example:
7 8
9 10
Helps in exploratory data analysis (EDA) to Correlated variables may be combined (e.g.,
uncover relationships between features creating interaction terms).
patterns
If "Hours Studied" and "Number of Practice Tests Taken" are
correlated, a combined feature like Total Study Effort
(Hours * Tests) may better capture student engagement.
11 12
2
Types of correlation
14
•Positive • Simple
• Linear
correlation correlation
correlation
• Partial correlation
• Negative •Non – linear
correlation correlation
• Multiple
correlation
Positive Correlation
if one variable is increasing and with its
impact other variable is also increasing
that will be Positive correlation.
For Example:-
Log file Size (MB): 20 21 22 24
Time stamp(AM): 7:00 7:02 7:05 7:10
3
Correlation : On the basis of Correlation : On the basis of
number of variables number of variables
Multiple correlation :
Partial correlation :
In case of multiple correlation three or more
When three or more variables are considered variables are studied simultaneously.
for analysis but only two influencing variables
are studied and rest influencing variables are
For example :
kept constant.
Rainfall
For example :
Production of rice
Correlation analysis is done with demand, Price of rice
supply and income are studied simultaneously will be known as
where income is kept constant. multiple correlation.
KARL PEARSON’S
COEFFICIENT OF CORRELATION Formula to calculate r
n= number of observations
23 24
4
Interpretation of ‘r’
The value of the coefficient of correlation will always lie The coefficient correlation describes the magnitude of
between -1 and +1., i.e., –1 ≤ r ≤ 1. correlation.
When r = +1, it means, there is perfect positive r= +0.8 indicates that correlation is positive because the
correlation between the variables.
sign of r is plus and the degree of correlation is high
because the numerical value of r(0.8) is close to 1.
When r = -1, there is perfect negative correlation
between the variables. If r = -0.4, it indicates that there is low degree of
25 26
27 28
Example 1 Solution 1
29 30
5
Formula to use Substituting the values in the formula
=
6×293−36×60
√[6×266−362][6×706−602]
=
1758−2160
√ [1596−1296] √[4236−3600]
−402 = -0.92
=
436.85
High degree of negative correlation
31 32
Example2 Solution 2
33 34
35 36
6
Solution 3
If the resulting value is greater than 0, then A and B are
positively correlated, meaning that the values of A
increase as the value of B increase.
The higher the value, the more each attribute implies
each other.
Hence, a high value may indicate that A (or B) may be
removed as a redundancy. If the resulting value is equal
to 0, then A and B are independent and there is no
correlation between them.
If the resulting value is less than 0, then A and B are
negatively correlated, where the values of one attribute
increase as the value of other attribute decreases.
37 38
Example-1
The scores of 9 students in History and Geography are
mentioned in the table below. Step 1- Create a table of the data obtained.
Step 2- Start by ranking the two data sets. Data ranking
can be achieved by assigning the rank “1” to the biggest
number in the column, “2” to the second biggest number
and so forth. The smallest value will get the lowest
ranking. This should be done for both sets of
measurements.
41 42
7
Step 3- Add a third column d to your data set, d here
denotes the difference between ranks.
For example, if the first student’s physics rank is 3
and the math rank is 5 then the difference in the rank is 3.
In the fourth column, square your d values.
43 44
47 48
8
ρ (or rs) is 0.67. This indicates
a strong positive relationship
between the ranks individuals
obtained in the maths and
English exam. That is, the
higher you ranked in maths,
the higher you ranked in
English also, and vice versa.
49 50