Linear Regression (1)
Linear Regression (1)
ION
Dr. Ishita Chatterjee, Associate Professor, Department of Applied
Psychology, University of Calcutta, Rajabazar Science College
(Rashbehari Sikhsha Prangan), Kolkata, West Bengal, India
INTRODUCTION
• The term regression was first used as a statistical concept in 1877 by Sir Francis
Galton.
• Galton made a study that showed the height of children born to tall parents
tends to move back, or “regress”, towards the mean height of the population. He
designated the word regression as the name of the general process of predicting
one variable (the height of the children) from another (the height of the parent).
Later statisticians coined the term multiple regression to describe the process by
which several variables are used to predict another.
• In regression analysis, researchers shall develop an estimating equation – that is,
a mathematical formula that relates the known variable. Then, after learning the
pattern of relationship, the researchers can apply correlation analysis to
determine the degree to which the variables are related. Correlation analysis,
then, describes, how well the estimating equation actually describes the
relationships.
IMPORTANCE
• The goal of the Linear Regression analysis is to be able to predict when different
behaviors will occur. This translates into predicting when someone has one score
on a variable and when they have a different score.
• It is the statistical procedure for using a relationship to predict scores.
• Linear regression is commonly used in basic and applied research, particularly in
educational, industrial and clinical settings.
• For Example: The reason that students take the Scholastic Aptitude Test (SAT) when
applying to some colleges is because, from previous research it has been known
that, SAT scores are somewhat positively correlated with college grades. Therefore,
through regression techniques, the SAT scores of applying students are used to
predict their future college performance. If the predicted grades are too low, the
student is not admitted to the college.
• This approach is also used when people take a test when applying for a job so that
the employer can predict who will be better workers, or when clinical patients are
tested to identify those at risk of developing emotional problems.
CONTD…
• Regression procedures center around drawing the linear regression line, the
summary line drawn through a scatterplot. The regression procedures in
conjunction with the Pearson correlation. While r is the stastic that summarizes
the linear relationship.
• The regression line is the line on the scatterplot that summarizes the linear
relationship.
• There is always need to compute r first to determine whether a relationship
exists or not.
• If the correlation coefficient is not 0 and passes the inferential test, then linear
regression can be performed to further summarize the relationship.
• The linear regression line is the straight line that summarizes the linear
relationship in a scatterplot by, on average, passing through the center of the Y
scores at each X.
DEFINITION
• Regression analysis is a statistical tool used to model the relationship between a dependent
variable (DV) and one or more independent variables (IV). Specifically, regression analysis
describes how the typical value of the dependent variable changes when any one of the
independent variables increases or decreases, while holding the other independent variables
constant.
For example, regression analysis has been utilized to demonstrate that sensory processing
dysfunction significantly influences externalizing behavior among preschool-aged children with
autism (Tseng, Fu, Lu & Shieh, 2011).
• Independent or ‘Predictor’ variable, refers to the variable used as a predictor, denoted by X.
• Dependent or ‘Outcome’ variable, refers to the variable used as a predicted variable, denoted
by Y.
• Here, Y ‘depends’ on X. In other words, If X is predicting Y then typically it is said that ‘X’ is
regressed on ‘Y’.
• It is important to consider that relationships found by regression to be relationships of
association but not necessarily of cause and effect. Unless there is specific reason for
believing that the values of dependent variable are caused by the values of the independent
variable(s), do not infer causality from the relationships find by regression.
REGRESSION EQUATION
• The key relationship in a regression is the regression equation. A regression
equation contains regression parameters whose values are estimated using data.
The estimated parameters measure the relationship between X and Y with such a
line.
• The equation helps to predict the score value of Y variable in correspondence with
any value of the X variable.
Y = X1 + X2 + X3
Y = Dependent/ Predicted/Response/Outcome – Variable
X = Independent/Predictor/Explanatory/Input - Variable
ASSUMPTION FOR
REGRESSION
• Assumptions of Normality: The values of the dependent variable Y
should be normally distributed for each value of the independent
variable X.
• Assumptions of Homoscedasticity: The variability of Y (variance or
standard deviation) should be the same for each value of X.
• Assumptions of Linearity: The relationship between the two
variables should be linear.
• Assumptions of Independence: The observations should be
independent.
LINEAR REGRESSION
• The simplest form of regression is simple linear regression (bivariate regression).
• Simple linear regression analysis is a common way to discover a relationship between one dependent and one
independent variables.
• Linear regression attempts to draw a line that comes closest to the data by finding the slope and intercept that
define the line and minimize regression errors.
• Linear regression is used for finding linear relationships between target and one or more predictors.
Y = a + bX
Ŷ = Dependent / Predictor - variable and represent the predicted value of Y
a = y-intercept (the value of y when x = 0) and represents the intercept of the best fitting line
b = the slope of the line. The slope is the change in the variable (Y) as the other variable (X) increases or decreases
by one unit. When b is positive there is a positive association, when b is negative there is a negative association
X = Independent value/ Criterion variable.
• The main limitation is that: it only works when we have TWO variables. Most
real world phenomena are multi-factorial in nature, meaning there is more than
one factor that impacts on, or cause changes in the dependent variable.
Y = a + b 1X 1 + b 2X 2
Ŷ = Dependent variable and represent the predicted value of Y
a = y-intercept (the value of y when x = 0) and represents the intercept of the best fitting line
b1= the slope of the line. The slope is the change in the variable (Y 1) as the other variable (X1)
increases or decreases by one unit. When b 1 is positive there is a positive association, when b is
negative there is a negative association
b2= the slope of the line. The slope is the change in the variable (Y 2) as the other variable (X2)
increases or decreases by one unit. When b 2 is positive there is a positive association, when b is
negative there is a negative association
X = Independent value
SUMMARY OF STEPS IN LINEAR
REGRESSION
• Compute r.
N (ΣXY) – (ΣX) (ΣY)
• Compute the slope, b, where b =
N (ΣX2) – (ΣX)2
Σ(Y – Ŷ)
• The definitional formula for the standard error of the estimate is: SŶ = √ N
PROBLEM ON LINEAR
REGRESSION
A group of individuals obtained the following scores on the two tests A
& B. Calculate the regression equations for both the tests.
Individual 1 2 3 4 5
Test A (X) 10 11 12 9 8
Test B (Y) 12 18 20 10 10
STEP-1: Computation of
Data and Square the
Computed
Test Scores X Data
Y X Y 2 2
XY
Individuals
4 9 10 81 100 90
5 8 10 64 100 80
Here,
n= 5, x̅ = ΣX ÷ n = 50 ÷ 5 = 10 Y̅ = ΣY
÷ n = 70 ÷ 5 = 14
(ΣX)2 = (50)2 = 2500 (ΣY)2 = (70)2 = 4900
STEP-3: Computation of
Regression Equation
Regression Equation – ‘X’ on ‘Y’:
SX = Σ X2 - (ΣX)2 ÷ n
= 510 – (2500 ÷ 5)
= 510 – 500 = 10