Linear Regression
Linear Regression
between a dependent (target) variable and one or more independent (predictor) variables by
fitting a linear equation to observed data. It is one of the most fundamental and widely used
statistical techniques for predictive analysis and understanding data relationships.
Conceptual Overview
The core idea behind linear regression is straightforward: it assumes that the relationship
between the independent variable(s) and the dependent variable is linear in nature. In the
simplest scenario, known as Simple Linear Regression, the relationship is modeled
between a single predictor and the target variable. When multiple predictors are involved, it
becomes a Multiple Linear Regression problem.
Linear regression attempts to discover the best-fitting straight line (or hyperplane, in multiple
regression) that summarizes the observed data points. This best-fitting line is chosen such
that the overall prediction errors (typically measured as squared differences between
observed and predicted values) are minimized. The most commonly used method for
minimizing this error is called the Ordinary Least Squares (OLS) method.
Mathematical Formulation
Where:
y=β0+β1x1+β2x2+⋯+βnxn+εy
where:
Using calculus (taking derivatives of RSS with respect to each parameter and setting them to
zero), analytical solutions for the coefficients are obtained. Specifically, for simple linear
regression:
β1=∑i=1m(xi−xˉ)(yi−yˉ)∑i=1m(xi−xˉ)2,β0=yˉ−β1xˉ\beta_1 = \frac{\sum_{i=1}^{m}(x_i -
\bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m}(x_i - \bar{x})^2}, \quad \beta_0 = \bar{y} -
\beta_1\bar{x}β1=∑i=1m(xi−xˉ)2∑i=1m(xi−xˉ)(yi−yˉ),β0=yˉ−β1xˉ
Where xˉ\bar{x}xˉ and yˉ\bar{y}yˉrepresent the mean of xxx and yyy, respectively.
In multiple linear regression, coefficients are typically calculated using matrix algebra:
Here, XXX is a matrix representing the input data, yyy is a vector representing the observed
outputs, and β\boldsymbol{\beta}β is a vector of regression coefficients.
Evaluation Metrics
To assess the quality and accuracy of the linear regression model, several evaluation
metrics are commonly used:
MSE=1m∑i=1m(yi−y^i)2MSE = \frac{1}{m}\sum_{i=1}^{m}(y_i -
\hat{y}_i)^2MSE=m1i=1∑m(yi−y^i)2
RMSE=MSERMSE = \sqrt{MSE}RMSE=MSE
R2=1−∑i=1m(yi−y^i)2∑i=1m(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^{m}(y_i -
\hat{y}_i)^2}{\sum_{i=1}^{m}(y_i - \bar{y})^2}R2=1−∑i=1m(yi−yˉ)2∑i=1m(yi−y^i)2
1. Linearity: There must be a linear relationship between the independent and
dependent variables.
2. Independence: Observations are assumed independent of each other.
3. Homoscedasticity: The variance of residuals should be constant across all levels of
the predictors.
4. Normality: Residuals should be approximately normally distributed.
5. No multicollinearity: Independent variables in multiple regression should not be
highly correlated with each other.
Linear regression can be extended and adapted to various scenarios and more complex
situations:
Applications