0% found this document useful (0 votes)
26 views14 pages

Lecture 2.2: Simple Regression Model-Linear Equation With One Independent Variable

Uploaded by

mimisnake71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

Lecture 2.2: Simple Regression Model-Linear Equation With One Independent Variable

Uploaded by

mimisnake71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Lecture 2.

2: Simple Regression Model-linear equation

with one independent variable

linear equation with one independent variable


Learning Outcomes: When you finish this lecture, you should be able to:
1- Understand the linear relationship, its visual display, and its mathematical formula.
2- Calculate a correlation coefficient.
3- Recognize the nature of regressors and the error term.
4- Understand OLS technique.
5- Estimate the simple regression equation and predict the Yi value.
Short content: This lecture will develop the following concepts:
2.1. What Does a linear relationship Look Like?
2.2. The Visual Display of a Linear Relationship.
2.3. Correlation Coefficient.
2.4. The linear regression model (LRM).
2.5. The meaning of linear regression (Linear coefficients).
2.6. The nature of X variables or regressors.
2.7. The nature of the error term.
2.8. The method of ordinary least squares (OLS).

2.1. What Does a linear relationship Look Like?


To understand linear regression, let’s first review linear equations with one independent
variable. The general form of a linear equation1 with one independent can be written as:
y = b0 + b1x,
where b0 and b1 are constants (fixed numbers), x is the independent variable, and y is the
dependent variable. The graph of a linear equation with one independent variable is a straight
line, or simply a line; furthermore, any nonvertical line can be represented by such an equation.
Examples of linear equations with one independent variable are y = 4 + 0.2x, y = −1.5 − 2x,
and y = −3.4 + 1.8x. The graphs of these three linear equations are shown in Fig 2.1.
For a linear equation y = b0 + b1x, the number b0 is the y-value of the point of intersection
of the line and the y-axis. The number b1 measures the steepness of the line; more precisely, b1
indicates how much the y-value changes when the x-value increases by 1 unit.

1
• Regression line: The line that best fits a set of data points according to the least-squares criterion.

• Regression equation: The equation of the regression line.


Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
Linear equations with one independent variable occur frequently in applications of
mathematics to many different fields, including the management, life, and social sciences, as
well as the physical and mathematical sciences.
Figure 2.1. Graphs of linear equations.

In management field, we find many examples of a linear equations. Examples of quantitative


variables in business that might be related to each other include: spending on advertising and
sales revenue, produce delivery time and percentage of spoiled produce, premium and regular
gas prices, preventive maintenance spending and manufacturing productivity rates.
Example 2.1: Sales and Advertising Spending: A business might find that for every $1,000
spent on advertising, it gains 100 additional sales. This represents a linear relationship between
ad spend and sales.
Sales=100+0.1×Ad Spend
For every $1,000 spent on advertising, sales increase by 100 units. The slope of 0.1 indicates
the rate of change, and the intercept 100 represents the base sales without advertising.
Example 2.2: Cost of Goods Sold (COGS) and Production Volume: For every unit produced,
the cost increases by a fixed amount, such as $10 per unit, reflecting a linear relationship.
COGS=500+10×Units Produced
The fixed cost is $500, and for each additional unit produced, costs increase by $10. These
equations represent simple linear models where each variable change at a constant rate.

2.2. The Visual Display of a Linear Relationship


Analyzing a relationship between two variables, or bivariate data, typically begins with a
scatter plot that displays each observed data pair (xi, yi) as a dot on an X-Y grid. This diagram
provides a visual indication of the strength of the relationship or association between the two
random variables. This simple display requires no assumptions or computation. A scatter plot
is typically the precursor to more complex analytical techniques. Figure 2.2 shows a scatter
plot comparing the average price per gallon of regular unleaded gasoline to the average price
per gallon of premium gasoline for all 50.

2
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
We look at scatter plots to get an initial idea of the relationship between two random
variables. Is there an evident pattern to the data? Is the pattern linear or nonlinear? Are there
data points that are not part of the overall pattern? We would characterize the fuel price
relationship as linear (although not perfectly linear) and positive (as premium prices increase,
so do regular unleaded prices). We see one pair of values set slightly apart from the rest, above
and to the right. This happens to be the state of Hawaii.
Figure.2.2. An example of a visual display for bivariate data

2.3. Correlation Coefficient


A visual display is a good first step in analysis, but we would also like to quantify the strength
of the association between two variables. Therefore, accompanying the scatter plot is the
sample correlation coefficient (also called the Pearson correlation coefficient.) This statistic
measures the degree of linearity in the relationship between two random variables X and Y and
is denoted r. It measures the strength and direction of the linear relationship between two
variables. It ranges from -1 to 1 (the interval [-1, +1]):
r = 1: Perfect positive linear relationship (as one variable increases, the other increases).
r = -1: Perfect negative linear relationship (as one variable increases, the other decreases).
r = 0: No linear relationship.
The closer the value of r is to 1 or -1, the stronger the linear relationship. It’s commonly used
in statistics to analyze and interpret data.

3
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
The Sample Correlation Coefficient (r)

The product will be negative, that is, a negative correlation, when xi tends to be above its
mean while the associated yi is below its mean. Conversely, the correlation coefficient will be
positive when xi and the associated yi tend to be above their means at the same time or below
their means at the same time. To simplify the notation. we define three terms called sums of
squares:

n,

Using this notation, the formula for the sample correlation coefficient can be written as:

Figure 2.3 shows scatter plots for different values of the Correlation Coefficient ‘’r’’ (n =
100)

Correlation analysis has many business applications. For example:

4
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
• Financial planners study correlations between asset classes over time, in order to help
their clients, diversify their portfolios.
• Marketing analysts study correlations between customer online purchases in order to
develop new web advertising strategies.
• Human resources experts study correlations between measures of employee performance
in order to devise new job-training programs.
Problem 2.1: Use the data in table 2.1 to:
(a) Make a scatter plot of the data.
(b) What does it suggest about the correlation between X and Y?
(c) Calculate the correlation coefficient “r”.
Table 2.1: Hours Worked and Weekly Pay for a sample of 5 college students

2.4. The linear regression model (LRM):


The LRM in its general form may be written as:

Yi = B1 + B2X2i + B2X2i +..+ BKXKi +ui ………….(2.1)

While its simple form takes the following formula:

Yi = B1 + B2X1i+ui ………….(2.2)

The variable Y is known as the dependent variable, and the X variables are known as the
independent variables, and u is known as a random, or stochastic, error term. The subscript i
denotes the ith observation.
Equation (2.1), is known as the population or true model. It consists of two components: (1)
a deterministic component, (B1 + B2X2i + B2X2i +..+ BKXKi), and (2) a non-systematic, or
random component (ui). As shown below, (B1 + B2X2i + B2X2i +..+ BKXKi) can be interpreted
as the conditional mean of Yi, E (Yi|X), conditional upon the given X values.
Therefore, Eq. (2.1) states that an individual Yi value is equal to the mean value of the
population of which it is a member plus or minus a random term. The concept of population is
general and refers to a well-defined entity (people, firms, cities, states, countries, and so on)
that is the focus of a statistical analysis.

5
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
For example, if Y represents family expenditure on food and X represents family income,
Eq. (2.2) states that the food expenditure of an individual family is equal to the mean food
expenditure of all the families with the same level of income, plus or minus a random
component that may vary from individual to individual and that may depend on several factors.
In Eq. (2.1) B1 is known as the intercept and B2 to Bk are known as the slope coefficients.
Collectively, they are called regression coefficients or regression parameters. In regression
analysis our primary objective is to explain the mean, or average, behavior of Y in relation to
the regressors, that is, how mean Y responds to changes in the values of the X variables. An
individual Y value will hover around its mean value.
Each slope coefficient measures the (partial) rate of change in the mean value of Y for a unit
change in the value of a regressor, holding the values of all other regressors constant, hence
the adjective partial. How many regressors are included in the model depends on the nature of
the problem and will vary from problem to problem.
TABLE 2.2 Weekly Family Income X, $ (the unconditional mean= $121.20 ($7272/60))

FIGURE 2.4 Conditional distribution of expenditure for various levels of income

The dark circled points in Figure 2.4 show the conditional mean values of Y against the
various X values. If we join these conditional mean values, we obtain what is known as the
population regression line (PRL), or more generally, the population regression curve. More
simply, it is the regression of Y on X. The adjective “population” comes from the fact that we

6
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
are dealing in this example with the entire population of 60 families. Of course, in reality a
population may have many families
Geometrically, then, a population regression curve is simply the locus of the conditional
means of the dependent variable for the fixed values of the explanatory variable(s). More
simply, it is the curve connecting the means of the subpopulations of Y corresponding to the
given values of the regressor X. It can be depicted as in Figure 2.5.
FIGURE 2.5 Population regression line (data of Table 2.2).

This figure shows that for each X (i.e., income level) there is a population of Y values (weekly
consumption expenditures) that are spread around the (conditional) mean of those Y values.
For simplicity, we are assuming that these Y values are distributed symmetrically around their
respective (conditional) mean values. And the regression line (or curve) passes through these
(conditional) mean values.

2.5. The meaning of linear regression (Linear coefficients)

In the LRM it is assumed that the regression coefficients are some fixed numbers and not
random, even though we do not know their actual values. It is the objective of regression
analysis to estimate their values on the basis of sample data.
For our purpose the term “linear” in the linear regression model refers to linearity in the
regression coefficients, the Bs, and not linearity in the Y and X variables. For instance, the Y
and X variables can be logarithmic (e.g. ln X2), or reciprocal (1/X3) or raised to a power (e.g.
X32), where ln stands for natural logarithm, that is, logarithm to the base e.
Linearity in the B coefficients means that they are not raised to any power (e.g. B22) or are
divided by other coefficients (e.g. B2/B3) or transformed, such as ln B4. There are occasions
where we may have to consider regression models that are not linear in the regression
coefficients.

7
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
FIGURE 2.6 Linear-in-parameter functions.

TABLE 2.3 Linear Regression Models

Note: LRM = linear regression model NLRM = nonlinear regression mode

2.6. The nature of X variables or regressors


The regressors can also be measured on any one of the scales 2 we have just discussed,
although in many applications the regressors are measured on ratio or interval scales. In the
standard, or linear regression model (LRM), which we will discuss shortly, it is assumed that
the regressors are non-random, in the sense that their values are fixed in repeated sampling.

2
1-Ratio scale: A ratio scale variable has three properties: (1) ratio of two variables, (2) distance between two
variables, and (3) ordering of variables. On a ratio scale if, say, Y takes two values, Y1 and Y2, the ratio Y1/Y2 and
the distance (Y2 – Y1) are meaningful quantities, as are comparisons or ordering such as Y1≤ Y2 or Y1≥Y2 . Most
economic variables belong to this category. Thus we can talk about whether GDP is greater this year than the last
year, or whether the ratio of GDP this year to the GDP last year is greater than or less than one.
2-Interval scale: Interval scale variables do not satisfy the first property of ratio scale variables. For example,
the distance between two time periods, say, 2007 and 2000 (2007 – 2000) is meaningful, but not the ratio
2007/2000.
3-Ordinal scale: Variables on this scale satisfy the ordering property of the ratio scale, but not the other two
properties. For examples, grading systems, such as A, B, C, or income classification, such as low income, middle
income, and high income, are ordinal scale variables, but quantities such as grade A divided by grade B are not
meaningful.
4-Nominal scale: Variables in this category do not have any of the features of the ratio scale variables.
Variables such as gender, marital status, and religion are nominal scale variables. Such variables are often called
dummy or categorical variables. They are often “quantified” as 1 or 0, 1 indicating the presence of an attribute and
0 indicating its absence. Thus, we can “quantify” gender as male = 1 and female = 0, or vice versa.

8
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
As a result, our regression analysis is conditional, that is, conditional on the given values of the
regressors.

2.7. The nature of the stochastic error term ui:


The error term ui is a catchall for all those variables that cannot be introduced in the model
for a variety of reasons. However, the average influence of these variables on the regressand is
assumed to be negligible.
The stochastic error term is a catchall that includes all those variables that cannot be readily
quantified. It may represent variables that cannot be included in the model for lack of data
availability, or errors of measurement in the data, or intrinsic randomness in human behavior.
Whatever the source of the random term u, it is assumed that the average effect of the error
term on the regressand is marginal at best. However, we will have more to say about this
shortly.
The disturbance term ui is a proxy for all those variables that are omitted from the model but
that collectively affect Y. The obvious question is: Why not introduce these variables into the
model explicitly? Stated otherwise, why not develop a multiple regression model with as many
variables as possible? The reasons are many.
1. Vagueness of theory: The theory, if any, determining the behavior of Y may be, and often
is, incomplete. We might know for certain that weekly income X influences weekly consumption
expenditure Y, but we might be ignorant or unsure about the other variables affecting Y.
Therefore, ui may be used as a substitute for all the excluded or omitted variables from the
model.
2. Unavailability of data: Even if we know what some of the excluded variables are and
therefore consider a multiple regression rather than a simple regression, we may not have
quantitative information about these variables. It is a common experience in empirical analysis
that the data we would ideally like to have often are not available. For example, in principle we
could introduce family wealth as an explanatory variable in addition to the income variable to
explain family consumption expenditure. But unfortunately, information on family wealth
generally is not available. Therefore, we may be forced to omit the wealth variable from our
model despite its great theoretical relevance in explaining consumption expenditure.
3. Core variables versus peripheral variables: Assume in our consumption-income example
that besides income X1, the number of children per family X2, sex X3, religion X4, education
X5, and geographical region X6 also affect consumption expenditure. But it is quite possible
that the joint influence of all or some of these variables may be so small and at best Non-
systematic or random that as a practical matter and for cost considerations it does not pay to
introduce them into the model explicitly. One hopes that their combined effect can be treated
as a random variable ui.
4. Intrinsic randomness in human behavior: Even if we succeed in introducing all the
relevant variables into the model, there is bound to be some “intrinsic” randomness in
individual Y’s that cannot be explained no matter how hard we try. The disturbances, the u’s,
may very well reflect this intrinsic randomness.

9
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
5. Poor proxy variables: Although the classical regression model assumes that the variables Y
and X are measured accurately, in practice the data may be plagued by errors of measurement.
Consider, for example, Milton Friedman’s well-known theory of the consumption function. He
regards permanent consumption (Yp) as a function of permanent income (Xp). But since data
on these variables are not directly observable, in practice we use proxy variables, such as
current consumption (Y) and current income (X), which can be observable.
Since the observed Y and X may not equal Yp and Xp, there is the problem of errors of
measurement. The disturbance term u may in this case then also represent the errors of
measurement. As we will see in a later chapter, if there are such errors of measurement, they
can have serious implications for estimating the regression coefficients, the β’s.
6. Principle of parsimony: We would like to keep our regression model as simple as possible.
If we can explain the behavior of Y “substantially” with two or three explanatory variables and
if our theory is not strong enough to suggest what other variables might be included, why
introduce more variables? Let ui represent all other variables. Of course, we should not exclude
relevant and important variables just to keep the regression model simple.
7. Wrong functional form: Even if we have theoretically correct variables explaining a
phenomenon and even if we can obtain data on these variables, very often we do not know the
form of the functional relationship between the regressand and the regressors. Is consumption
expenditure a linear function of income or a nonlinear function? If it is the former, Yi = β1 +
β2 Xi + ui is the proper functional relationship between Y and X, but if it is the latter, Yi = β1
+ β2 Xi + β3 Xi2 + ui may be the correct functional form. In two-variable models the functional
form of the relationship can often be judged from the scattergram. But in a multiple regression
model, it is not easy to determine the appropriate functional form, for graphically we cannot
visualize scatter graphs in multiple dimensions.
For all these reasons, the stochastic disturbances ui assume an extremely critical role in
regression analysis, which we will see as we progress.

2.7. The method of ordinary least squares (OLS)

Having obtained the data, the important question is: how do we estimate the LRM given in
Eq. (2.1)? Suppose we want to estimate a wage function of a group of workers. To explain the
hourly wage rate (Y), we may have data on variables such as gender, ethnicity, union status,
education, work experience, and many others, which are the X regressors. Further, suppose
that we have a random sample of 1,000 workers. How then do we estimate Eq. (2.1)?
The method of Ordinary Least Squares (OLS) is one of the most commonly used techniques
to estimate the coefficients of a linear regression model. The goal of OLS is to find the line
(regression line) that minimizes the sum of the squared differences (errors) between the
observed values of the dependent variable (Y) and the values predicted by the regression model.
The linear regression model can be written as:
Yi = b0 + b1Xi+ui

10
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
Where:

• Yi is the dependent variable (the value we want to predict) for observation i,


• Xi is the independent variable (the predictor) for observation i,
• b0 is the intercept of the regression line (the value of Y when X=0),
• b1 is the slope of the regression line (it measures the change in Y for a one-unit change
in X),
• ui is the error term (the difference between the actual Yi and the predicted value).

FIGURE 2.7: Least-squares criterion.

Objective of OLS

The objective is to estimate the coefficients b0 and b1 such that the sum of the squared errors
(residuals) is minimized. The error (or residual) for each data point i is:

ui = Yi – (b0 + b1Xi)

The sum of squared residuals (SSR) is:

Where n is the number of observations.

Steps to Derive b0 and b1

Minimizing the sum of squared errors: The goal is to minimize SSR with respect to b^0 and
b^1. To do this, we take the partial derivatives of the SSR with respect to b^0 and b^1 and set
them equal to zero.

11
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
Partial derivatives: First, take the derivative of SSR with respect to b^0:

Second, take the derivative of SSR with respect to b^1:

Solving for b^0 and b^1:

• From the first equation, solve for b^0:

• Simplifying:

• Dividing by n:

• Where Yˉ and Xˉ are the sample means of Y and X, respectively.

• From the second equation, solve for b1:

• Substituting b^0=Yˉ−b^1Xˉ into the equation:

• Simplifying and rearranging gives:

12
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable
• Final Estimates:

• The slope of the regression line is given by:

• The intercept is given by:

These formulas give us the estimated coefficients b^0 and b^1 for the regression line,
which minimizes the sum of squared errors between the observed values and the predicted
values.

• Interpretation

• b^0: The intercept tells us the expected value of Y when X=0.


• b^1: The slope tells us how much Y changes for a one-unit change in X.

- Interpreting an Estimated Regression Equation

The intercept and slope of an estimated regression can provide useful information. For
example:
Sales =268 + 7.37 Ads

Each extra $1 million of advertising will generate $7.37 million of sales on average. The
firm would average $268 million of sales with zero advertising. However, the intercept may
not be meaningful because Ads = 0 may be outside the range of observed data.

Problem 2.2:

The provided data in the following table shows the relationship between Years of Schooling
and Mean Wage. Use the data to model this relationship using a simple linear regression.

13
Module: Statistical Modeling
Lecture 2.2: Simple Linear Regression Model- linear equation with one
independent variable

The Chapter’s references:

• David P. Doane and Lori E. Seward (2016). Applied Statistics in Business and
Economics. 5TH EDITION. . McGraw-Hill Companies, Inc. Boston.
• Damodar N. Gujarati and Dawn C. Porter (2009). BASIC ECONOMETRICS. 5th edition.
McGraw-Hill Companies, Inc. Boston.
• Damodar Gujarati (2012). Econometrics By Examples. Palgrave Macmillan. London.
• Neil A. Weiss (2012). Introductory STATISTICS. 9TH EDITION. Pearson Education, Inc.
Boston. USA.
• Neil A. Weiss (2017). Introductory STATISTICS. 10TH EDITION. Pearson Education,
Inc. Boston. USA.

14

You might also like