Regression Analysis - SSB
Regression Analysis - SSB
In regression analysis we have two types of variables: i) dependent (or explained) variable, and ii)
independent (or explanatory) variable. As the name (explained and explanatory) suggests the
dependent variable is explained by the independent variable.
In the simplest case of regression analysis there is one dependent variable and one independent
variable. Let us assume that consumptions expenditure of a household is related to the household
income. For example, it can be postulated that as household income increases, expenditure also
increases. Here consumption expenditure is the dependent variable and household income is the
independent variable.
Usually, we denote the dependent variable as Y and the independent variable as X. Suppose we took up
a household survey and collected n pairs of observations in X and Y. The next step is to find out the
nature of relationship between X and Y.
The relationship between X and Y can take many forms. The general practice is to express the
relationship in terms of some mathematical equation. The simplest of these equations is the linear
equation. This means that the relationship between X and Y is in the form of a straight line and is termed
linear regression. When the equation represents curves (not a straight line) the regression is called non-
linear or curvilinear.
Now the question arises, 'How do we identify the equation form?' There is no hard and fast rule as such.
The form of the equation depends upon the reasoning and assumptions made by us. However, we may
plot the X and Y variables on a graph paper to prepare a scatter diagram. From the scatter diagram, the
location of the points on the graph paper helps in identifying the type of equation to be fitted.'Ethe
points are more or less in a straight line, then linear equation is assumed. On the other hand, if the
points are not in a straight line and are in the form of a curve, a suitable non-linear equation (which
resembles the scatter) is assumed.
We have to take another decision, that is, the identification of dependent and independent variables.
This again depends on the logic put forth and purpose of analysis: whether 'Y depends on X' or 'X
depends on Y'. Thus, there can be two regression equations from the same set of data. These are i) Y is
assumed to be dependent on X (this is termed 'Y on X' line), and ii) X is assumed to be dependent on Y
(this is termed 'X on Y' line).
Regression analysis can be extended to cases where one dependent variable is explained by a number of
independent variables Such a case is termed multiple regression. In advanced regression models there
can be a number of both dependent as well as independent variables.
You may by now be wondering why the term 'regression', which means 'reduce'. This name is associated
with a phenomenon that was observed in a study on the relationship between the stature of father (x)
and son (j). It was obs6rved that the average stature of sons of the tallest fathers has a tendency to be
less than the average stature of these fathers. On the other hand, the average stature of sons of the
shortest fathers has a tendency to be more than the average stature of these fathers. This phenomenon
was called regression towards the mean. Although this appeared somewhat strange at that time, it was
found later that this is due to natural variation within subgroups of a group and the same phenomenon
occurred in most problems and data sets. The explanation is that many tall men come hm families with
average stature due to vagaries of natural variation and they produce sons who are shorter than them
on the whole. A similar phenomenon takes place at the lower end of the scale.
The simplest relationship between X and Y could perhaps be a linear deterministic function given by
In the above equation X is the independent variable or explanatory variable and Y is the dependent
variable or explained variable. You may recall that the subscript i represents the observation number, i
ranges from 1 to n. Thus Y, is the first observation of the dependent variable, X, is the i3.h observation of
the independent variable, and so on. Equation (1) implies that Y is completely determined by X and the
parameters a and b. Suppose we have parameter values a = 3 and b = 0.75, then our linear equation is Y
= 3 + 0.75 X. From this equation we can find out the value of Y for given values of X. For example, when
X = 8, we find that Y = 9. Thus if we have different values of X then we obtain corresponding Y values on
the basis of (9.1). Again, if Xi is the same for two observations, then the value of Y, will also be identical
for both the observations. A plot of Y on X will show no deviation from the straight line with intercept 'a'
and slope 'b'.
If we look into the detenninistic model given by (1) we find that it may not be appropriate for describing
economic interrelationship between variables. For example, let Y = consumption and X = income of
households. Suppose you record your income and consumption for successive months. For the months
when your income is the same, do your consumption remain the same? The point we are w g to make is
that economic relationship involves certain randomness.
Therefore, we assume the relationship between Y and X to be stochastic and add one error term in (1).
Thus our stochastic model is
where e, is the error term. In real life situations ei represents randomness in human behavior and
excluded variables, if any, in the model. Remember that the right-hand side of (2) has two parts, viz., i)
deterministic part (that is, a + bx,), and ii) stochastic or randomness part (that is, e,). Equation (2) implies
that even if X, remains the same for two observations, Y, need not be the same because of different e, .
Thus, if we plot (2) on a graph paper the observations will not remain on a straight line.