Correlation and Regression: An Overview
1. Definition of Correlation
Correlation is a statistical measure that describes the extent to which two variables are related. It indicates the direction and strength of a relationship between variables. The most common correlation coefficient is Pearson’s correlation coefficient, denoted as
𝑟
r, which ranges from -1 to 1.
Positive Correlation: When one variable increases, the other variable tends to increase as well (e.g., height and weight).
Negative Correlation: When one variable increases, the other variable tends to decrease (e.g., temperature and the number of hot drinks sold).
No Correlation: No discernible relationship exists between the two variables.
2. Calculating Correlation
The formula for Pearson’s correlation coefficient is:
𝑟
=
𝑛
(
∑
𝑥
𝑦
)
−
(
∑
𝑥
)
(
∑
𝑦
)
[
𝑛
∑
𝑥
2
−
(
∑
𝑥
)
2
]
[
𝑛
∑
𝑦
2
−
(
∑
𝑦
)
2
]
r=
[n∑x
2
−(∑x)
2
][n∑y
2
−(∑y)
2
]
n(∑xy)−(∑x)(∑y)
Where:
𝑛
n is the number of data points.
𝑥
x and
𝑦
y are the two variables being analyzed.
3. Interpretation of Correlation Coefficient
1: Perfect positive correlation
0: No correlation
-1: Perfect negative correlation
Values closer to 1 or -1 indicate a stronger relationship, while values near 0 indicate a weak relationship.
4. Limitations of Correlation
Causation vs. Correlation: Correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other.
Outliers: Outliers can significantly affect correlation results, leading to misleading interpretations.
Linear Relationships: Pearson’s correlation measures only linear relationships. Non-linear relationships may not be well represented by this coefficient.
5. Definition of Regression
Regression analysis is a statistical method used to understand the relationship between a dependent variable and one or more independent variables. The primary goal of regression is to model this relationship and make predictions based on the input data.
6. Types of Regression
Simple Linear Regression: Involves two variables—one independent (predictor) and one dependent (outcome). The relationship is modeled as a straight line.
The equation for simple linear regression is:
𝑌
=
𝑎
+
𝑏
𝑋
Y=a+bX
Where:
𝑌
Y is the dependent variable.
𝑋
X is the independent variable.
𝑎
a is the intercept.
𝑏
b is the slope of the line.
Multiple Linear Regression: Involves two or more independent variables. The model is extended as:
𝑌
=
𝑎
+
𝑏
1
𝑋
1
+
𝑏
2
𝑋
2
+
.
.
.
+
𝑏
𝑛
𝑋
𝑛
Y=a+b
1
X
1
+b
2
X
2
+...+b
n
X
n
Non-linear Regression: Used when the relationship between variables is not linear, utilizing various mathematical forms to capture the relationship.
7. Calculating Regression Coefficients
The coefficients
𝑎
a and
𝑏
b in simple linear regression can be calculated using the least squares method, which minimizes the sum of the squares of the residuals .