Lecture 2.2: Simple Regression Model-Linear Equation With One Independent Variable
Lecture 2.2: Simple Regression Model-Linear Equation With One Independent Variable
1
• Regression line: The line that best fits a set of data points according to the least-squares criterion.
2
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
We look at scatter plots to get an initial idea of the relationship between two random
variables. Is there an evident pattern to the data? Is the pattern linear or nonlinear? Are there
data points that are not part of the overall pattern? We would characterize the fuel price
relationship as linear (although not perfectly linear) and positive (as premium prices increase,
so do regular unleaded prices). We see one pair of values set slightly apart from the rest, above
and to the right. This happens to be the state of Hawaii.
Figure.2.2. An example of a visual display for bivariate data
3
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
The Sample Correlation Coefficient (r)
The product will be negative, that is, a negative correlation, when xi tends to be above its
mean while the associated yi is below its mean. Conversely, the correlation coefficient will be
positive when xi and the associated yi tend to be above their means at the same time or below
their means at the same time. To simplify the notation. we define three terms called sums of
squares:
n,
Using this notation, the formula for the sample correlation coefficient can be written as:
Figure 2.3 shows scatter plots for different values of the Correlation Coefficient ‘’r’’ (n =
100)
4
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
• Financial planners study correlations between asset classes over time, in order to help
their clients, diversify their portfolios.
• Marketing analysts study correlations between customer online purchases in order to
develop new web advertising strategies.
• Human resources experts study correlations between measures of employee performance
in order to devise new job-training programs.
Problem 2.1: Use the data in table 2.1 to:
(a) Make a scatter plot of the data.
(b) What does it suggest about the correlation between X and Y?
(c) Calculate the correlation coefficient “r”.
Table 2.1: Hours Worked and Weekly Pay for a sample of 5 college students
Yi = B1 + B2X1i+ui ………….(2.2)
The variable Y is known as the dependent variable, and the X variables are known as the
independent variables, and u is known as a random, or stochastic, error term. The subscript i
denotes the ith observation.
Equation (2.1), is known as the population or true model. It consists of two components: (1)
a deterministic component, (B1 + B2X2i + B2X2i +..+ BKXKi), and (2) a non-systematic, or
random component (ui). As shown below, (B1 + B2X2i + B2X2i +..+ BKXKi) can be interpreted
as the conditional mean of Yi, E (Yi|X), conditional upon the given X values.
Therefore, Eq. (2.1) states that an individual Yi value is equal to the mean value of the
population of which it is a member plus or minus a random term. The concept of population is
general and refers to a well-defined entity (people, firms, cities, states, countries, and so on)
that is the focus of a statistical analysis.
5
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
For example, if Y represents family expenditure on food and X represents family income,
Eq. (2.2) states that the food expenditure of an individual family is equal to the mean food
expenditure of all the families with the same level of income, plus or minus a random
component that may vary from individual to individual and that may depend on several factors.
In Eq. (2.1) B1 is known as the intercept and B2 to Bk are known as the slope coefficients.
Collectively, they are called regression coefficients or regression parameters. In regression
analysis our primary objective is to explain the mean, or average, behavior of Y in relation to
the regressors, that is, how mean Y responds to changes in the values of the X variables. An
individual Y value will hover around its mean value.
Each slope coefficient measures the (partial) rate of change in the mean value of Y for a unit
change in the value of a regressor, holding the values of all other regressors constant, hence
the adjective partial. How many regressors are included in the model depends on the nature of
the problem and will vary from problem to problem.
TABLE 2.2 Weekly Family Income X, $ (the unconditional mean= $121.20 ($7272/60))
The dark circled points in Figure 2.4 show the conditional mean values of Y against the
various X values. If we join these conditional mean values, we obtain what is known as the
population regression line (PRL), or more generally, the population regression curve. More
simply, it is the regression of Y on X. The adjective “population” comes from the fact that we
6
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
are dealing in this example with the entire population of 60 families. Of course, in reality a
population may have many families
Geometrically, then, a population regression curve is simply the locus of the conditional
means of the dependent variable for the fixed values of the explanatory variable(s). More
simply, it is the curve connecting the means of the subpopulations of Y corresponding to the
given values of the regressor X. It can be depicted as in Figure 2.5.
FIGURE 2.5 Population regression line (data of Table 2.2).
This figure shows that for each X (i.e., income level) there is a population of Y values (weekly
consumption expenditures) that are spread around the (conditional) mean of those Y values.
For simplicity, we are assuming that these Y values are distributed symmetrically around their
respective (conditional) mean values. And the regression line (or curve) passes through these
(conditional) mean values.
In the LRM it is assumed that the regression coefficients are some fixed numbers and not
random, even though we do not know their actual values. It is the objective of regression
analysis to estimate their values on the basis of sample data.
For our purpose the term “linear” in the linear regression model refers to linearity in the
regression coefficients, the Bs, and not linearity in the Y and X variables. For instance, the Y
and X variables can be logarithmic (e.g. ln X2), or reciprocal (1/X3) or raised to a power (e.g.
X32), where ln stands for natural logarithm, that is, logarithm to the base e.
Linearity in the B coefficients means that they are not raised to any power (e.g. B22) or are
divided by other coefficients (e.g. B2/B3) or transformed, such as ln B4. There are occasions
where we may have to consider regression models that are not linear in the regression
coefficients.
7
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
FIGURE 2.6 Linear-in-parameter functions.
2
1-Ratio scale: A ratio scale variable has three properties: (1) ratio of two variables, (2) distance between two
variables, and (3) ordering of variables. On a ratio scale if, say, Y takes two values, Y1 and Y2, the ratio Y1/Y2 and
the distance (Y2 – Y1) are meaningful quantities, as are comparisons or ordering such as Y1≤ Y2 or Y1≥Y2 . Most
economic variables belong to this category. Thus we can talk about whether GDP is greater this year than the last
year, or whether the ratio of GDP this year to the GDP last year is greater than or less than one.
2-Interval scale: Interval scale variables do not satisfy the first property of ratio scale variables. For example,
the distance between two time periods, say, 2007 and 2000 (2007 – 2000) is meaningful, but not the ratio
2007/2000.
3-Ordinal scale: Variables on this scale satisfy the ordering property of the ratio scale, but not the other two
properties. For examples, grading systems, such as A, B, C, or income classification, such as low income, middle
income, and high income, are ordinal scale variables, but quantities such as grade A divided by grade B are not
meaningful.
4-Nominal scale: Variables in this category do not have any of the features of the ratio scale variables.
Variables such as gender, marital status, and religion are nominal scale variables. Such variables are often called
dummy or categorical variables. They are often “quantified” as 1 or 0, 1 indicating the presence of an attribute and
0 indicating its absence. Thus, we can “quantify” gender as male = 1 and female = 0, or vice versa.
8
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
As a result, our regression analysis is conditional, that is, conditional on the given values of the
regressors.
9
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
5. Poor proxy variables: Although the classical regression model assumes that the variables Y
and X are measured accurately, in practice the data may be plagued by errors of measurement.
Consider, for example, Milton Friedman’s well-known theory of the consumption function. He
regards permanent consumption (Yp) as a function of permanent income (Xp). But since data
on these variables are not directly observable, in practice we use proxy variables, such as
current consumption (Y) and current income (X), which can be observable.
Since the observed Y and X may not equal Yp and Xp, there is the problem of errors of
measurement. The disturbance term u may in this case then also represent the errors of
measurement. As we will see in a later chapter, if there are such errors of measurement, they
can have serious implications for estimating the regression coefficients, the β’s.
6. Principle of parsimony: We would like to keep our regression model as simple as possible.
If we can explain the behavior of Y “substantially” with two or three explanatory variables and
if our theory is not strong enough to suggest what other variables might be included, why
introduce more variables? Let ui represent all other variables. Of course, we should not exclude
relevant and important variables just to keep the regression model simple.
7. Wrong functional form: Even if we have theoretically correct variables explaining a
phenomenon and even if we can obtain data on these variables, very often we do not know the
form of the functional relationship between the regressand and the regressors. Is consumption
expenditure a linear function of income or a nonlinear function? If it is the former, Yi = β1 +
β2 Xi + ui is the proper functional relationship between Y and X, but if it is the latter, Yi = β1
+ β2 Xi + β3 Xi2 + ui may be the correct functional form. In two-variable models the functional
form of the relationship can often be judged from the scattergram. But in a multiple regression
model, it is not easy to determine the appropriate functional form, for graphically we cannot
visualize scatter graphs in multiple dimensions.
For all these reasons, the stochastic disturbances ui assume an extremely critical role in
regression analysis, which we will see as we progress.
Having obtained the data, the important question is: how do we estimate the LRM given in
Eq. (2.1)? Suppose we want to estimate a wage function of a group of workers. To explain the
hourly wage rate (Y), we may have data on variables such as gender, ethnicity, union status,
education, work experience, and many others, which are the X regressors. Further, suppose
that we have a random sample of 1,000 workers. How then do we estimate Eq. (2.1)?
The method of Ordinary Least Squares (OLS) is one of the most commonly used techniques
to estimate the coefficients of a linear regression model. The goal of OLS is to find the line
(regression line) that minimizes the sum of the squared differences (errors) between the
observed values of the dependent variable (Y) and the values predicted by the regression model.
The linear regression model can be written as:
Yi = b0 + b1Xi+ui
10
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
Where:
Objective of OLS
The objective is to estimate the coefficients b0 and b1 such that the sum of the squared errors
(residuals) is minimized. The error (or residual) for each data point i is:
ui = Yi – (b0 + b1Xi)
Minimizing the sum of squared errors: The goal is to minimize SSR with respect to b^0 and
b^1. To do this, we take the partial derivatives of the SSR with respect to b^0 and b^1 and set
them equal to zero.
11
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
Partial derivatives: First, take the derivative of SSR with respect to b^0:
• Simplifying:
• Dividing by n:
12
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
• Final Estimates:
These formulas give us the estimated coefficients b^0 and b^1 for the regression line,
which minimizes the sum of squared errors between the observed values and the predicted
values.
• Interpretation
The intercept and slope of an estimated regression can provide useful information. For
example:
Sales =268 + 7.37 Ads
Each extra $1 million of advertising will generate $7.37 million of sales on average. The
firm would average $268 million of sales with zero advertising. However, the intercept may
not be meaningful because Ads = 0 may be outside the range of observed data.
Problem 2.2:
The provided data in the following table shows the relationship between Years of Schooling
and Mean Wage. Use the data to model this relationship using a simple linear regression.
13
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
• David P. Doane and Lori E. Seward (2016). Applied Statistics in Business and
Economics. 5TH EDITION. . McGraw-Hill Companies, Inc. Boston.
• Damodar N. Gujarati and Dawn C. Porter (2009). BASIC ECONOMETRICS. 5th edition.
McGraw-Hill Companies, Inc. Boston.
• Damodar Gujarati (2012). Econometrics By Examples. Palgrave Macmillan. London.
• Neil A. Weiss (2012). Introductory STATISTICS. 9TH EDITION. Pearson Education, Inc.
Boston. USA.
• Neil A. Weiss (2017). Introductory STATISTICS. 10TH EDITION. Pearson Education,
Inc. Boston. USA.
14