0% found this document useful (0 votes)
14 views

Linear Regression

Linear regression is a supervised machine learning algorithm that models the relationship between a dependent variable and one or more independent variables using a linear equation. It aims to minimize prediction errors through methods like Ordinary Least Squares (OLS) and is evaluated using metrics such as Mean Squared Error (MSE) and R-squared (R2). Despite its simplicity and broad applicability across various domains, it relies on several assumptions that, if violated, can lead to inaccurate predictions.

Uploaded by

sh.t.tigranyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Linear Regression

Linear regression is a supervised machine learning algorithm that models the relationship between a dependent variable and one or more independent variables using a linear equation. It aims to minimize prediction errors through methods like Ordinary Least Squares (OLS) and is evaluated using metrics such as Mean Squared Error (MSE) and R-squared (R2). Despite its simplicity and broad applicability across various domains, it relies on several assumptions that, if violated, can lead to inaccurate predictions.

Uploaded by

sh.t.tigranyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Linear regression is a supervised machine learning algorithm that models the relationship

between a dependent (target) variable and one or more independent (predictor) variables by
fitting a linear equation to observed data. It is one of the most fundamental and widely used
statistical techniques for predictive analysis and understanding data relationships.

Conceptual Overview

The core idea behind linear regression is straightforward: it assumes that the relationship
between the independent variable(s) and the dependent variable is linear in nature. In the
simplest scenario, known as Simple Linear Regression, the relationship is modeled
between a single predictor and the target variable. When multiple predictors are involved, it
becomes a Multiple Linear Regression problem.

Linear regression attempts to discover the best-fitting straight line (or hyperplane, in multiple
regression) that summarizes the observed data points. This best-fitting line is chosen such
that the overall prediction errors (typically measured as squared differences between
observed and predicted values) are minimized. The most commonly used method for
minimizing this error is called the Ordinary Least Squares (OLS) method.

Mathematical Formulation

For simple linear regression, the mathematical formulation is:

y=β0+β1x+εy = \beta_0 + \beta_1 x + \varepsilony=β0​+β1​x+ε

Where:

●​ yyy is the dependent (response) variable.


●​ xxx is the independent (predictor) variable.
●​ β0\beta_0β0​is the intercept (the predicted value of yyy when x=0x = 0x=0).
●​ β1\beta_1β1​is the slope of the line (the expected change in yyy per unit increase in
xxx).
●​ ε\varepsilonε is the error term representing noise or variability not captured by the
model.

In multiple linear regression, the equation generalizes to:

y=β0+β1x1+β2x2+⋯+βnxn+εy

Here, multiple independent variables (x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​) influence


the dependent variable yyy, each with its corresponding coefficient (β1,β2,…,βn\beta_1,
\beta_2, \dots, \beta_nβ1​,β2​,…,βn​).

The Ordinary Least Squares Method (OLS)

The goal of linear regression is to estimate parameters β0,β1,…,βn\beta_0, \beta_1, \dots,


\beta_nβ0​,β1​,…,βn​that minimize the residual sum of squares (RSS):
RSS=∑i=1m(yi−y^i)2RSS = \sum_{i=1}^{m}(y_i - \hat{y}_i)^2RSS=i=1∑m​(yi​−y^​i​)2

where:

●​ yiy_iyi​is the actual value for the iii-th observation.


●​ y^i\hat{y}_iy^​i​is the predicted value for the iii-th observation based on the regression
model.
●​ mmm is the total number of observations.

Using calculus (taking derivatives of RSS with respect to each parameter and setting them to
zero), analytical solutions for the coefficients are obtained. Specifically, for simple linear
regression:

β1=∑i=1m(xi−xˉ)(yi−yˉ)∑i=1m(xi−xˉ)2,β0=yˉ−β1xˉ\beta_1 = \frac{\sum_{i=1}^{m}(x_i -
\bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m}(x_i - \bar{x})^2}, \quad \beta_0 = \bar{y} -
\beta_1\bar{x}β1​=∑i=1m​(xi​−xˉ)2∑i=1m​(xi​−xˉ)(yi​−yˉ​)​,β0​=yˉ​−β1​xˉ

Where xˉ\bar{x}xˉ and yˉ\bar{y}yˉ​represent the mean of xxx and yyy, respectively.

In multiple linear regression, coefficients are typically calculated using matrix algebra:

β=(XTX)−1XTy\boldsymbol{\beta} = (X^T X)^{-1} X^T yβ=(XTX)−1XTy

Here, XXX is a matrix representing the input data, yyy is a vector representing the observed
outputs, and β\boldsymbol{\beta}β is a vector of regression coefficients.

Evaluation Metrics

To assess the quality and accuracy of the linear regression model, several evaluation
metrics are commonly used:

1.​ Mean Squared Error (MSE):

MSE=1m∑i=1m(yi−y^i)2MSE = \frac{1}{m}\sum_{i=1}^{m}(y_i -
\hat{y}_i)^2MSE=m1​i=1∑m​(yi​−y^​i​)2

2.​ Root Mean Squared Error (RMSE):

RMSE=MSERMSE = \sqrt{MSE}RMSE=MSE​

3.​ Coefficient of Determination (R2R^2R2 score):

R2=1−∑i=1m(yi−y^i)2∑i=1m(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^{m}(y_i -
\hat{y}_i)^2}{\sum_{i=1}^{m}(y_i - \bar{y})^2}R2=1−∑i=1m​(yi​−yˉ​)2∑i=1m​(yi​−y^​i​)2​

●​ R2R^2R2 represents the proportion of variance in the dependent variable explained


by the independent variables. It ranges between 0 and 1 (though it can be negative if
the model fits worse than a horizontal line), where a higher R2R^2R2 indicates a
better fit.
Assumptions of Linear Regression

Linear regression relies on several assumptions for its results to be valid:

1.​ Linearity: There must be a linear relationship between the independent and
dependent variables.
2.​ Independence: Observations are assumed independent of each other.
3.​ Homoscedasticity: The variance of residuals should be constant across all levels of
the predictors.
4.​ Normality: Residuals should be approximately normally distributed.
5.​ No multicollinearity: Independent variables in multiple regression should not be
highly correlated with each other.

Violations of these assumptions might lead to incorrect conclusions or inaccurate


predictions. Diagnostic plots (residual plots, QQ plots) and tests are often used to check
these assumptions.

Extensions and Variations

Linear regression can be extended and adapted to various scenarios and more complex
situations:

●​ Polynomial Regression: Adds polynomial terms to account for nonlinear


relationships.
●​ Ridge and Lasso Regression: Introduce regularization to reduce overfitting and
handle multicollinearity.
●​ Elastic Net: Combines ridge and lasso regularization to balance between their
benefits.
●​ Generalized Linear Models (GLM): Extends regression to handle response
variables that have different distributions (e.g., binary outcomes with logistic
regression).

Applications

Linear regression has broad applicability in numerous domains:

●​ Economics: Predicting consumer spending, estimating demand.


●​ Finance: Forecasting stock prices, investment returns, risk assessment.
●​ Healthcare: Predicting medical outcomes, dosage-effect relationships.
●​ Marketing: Sales forecasting, analyzing pricing strategies.
●​ Social sciences: Understanding relationships between variables like education level
and income.

Linear regression’s simplicity, interpretability, and flexibility make it a fundamental and


valuable technique, despite its limitations when modeling more complex, nonlinear
relationships or datasets with substantial noise.
In summary, linear regression provides an intuitive yet powerful statistical framework for
modeling and analyzing linear relationships between variables, forming a cornerstone
technique of machine learning and statistical modeling.

You might also like