0% found this document useful (0 votes)
132 views

Chapter 10 Regression Slides

The document discusses correlation and regression analysis and their application areas. Correlation measures the degree of association between two quantitative variables on a scale from -1 to 1. Correlation does not imply causation. Regression analysis aims to explain the variation in a dependent variable based on independent variables. It provides coefficients and statistical tests of significance. The document presents an example of building a regression model to predict sales using variables like market potential, number of dealers, competitors, and customers based on data from 15 sales territories. It examines the correlations between variables and performs the regression analysis.

Uploaded by

Abhishek Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

Chapter 10 Regression Slides

The document discusses correlation and regression analysis and their application areas. Correlation measures the degree of association between two quantitative variables on a scale from -1 to 1. Correlation does not imply causation. Regression analysis aims to explain the variation in a dependent variable based on independent variables. It provides coefficients and statistical tests of significance. The document presents an example of building a regression model to predict sales using variables like market potential, number of dealers, competitors, and customers based on data from 15 sales territories. It examines the correlations between variables and performs the regression analysis.

Uploaded by

Abhishek Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Welcome to Powerpoint slides

for

Chapter 10
Correlation and
Regression:
Explaining Association
and Causation
Marketing Research
Text and Cases
by
Rajendra Nargundkar

Application Areas: Correlation


1. Correlation and Regression are generally
performed together. The application of correlation
analysis is to measure the degree of association
between two sets of quantitative data. The
correlation coefficient measures this association. It
has a value ranging from 0 (no correlation) to 1
(perfect positive correlation), or -1 (perfect
negative correlation).
2. For example, how are sales of product A correlated
with sales of product B? Or, how is the advertising
expenditure correlated with other promotional
expenditure? Or, are daily ice cream sales
correlated with daily maximum temperature?

3. Correlation does not necessarily mean there is a


causal effect. Given any two strings of numbers,
there will be some correlation among them. It does
not imply that one variable is causing a change in
another, or is dependent upon another.
4. Correlation is usually followed by regression
analysis in many applications.

Application Areas: Regression


1. The main objective of regression analysis is to
explain the variation in one variable (called the
dependent variable), based on the variation in one or
more other variables (called the independent
variables).
2. The applications areas are in explaining
variations in sales of a product based on advertising
expenses, or number of sales people, or number of
sales offices, or on all the above variables.

3.If there is only one dependent variable and one


independent variable is used to explain the variation
in it, then the model is known as a simple
regression.
4. If multiple independent variables are used to
explain the variation in a dependent variable, it is
called a multiple regression model.
5. Even though the form of the regression equation
could be either linear or non-linear, we will limit our
discussion to linear (straight line) models.

6. As seen from the preceding discussion, the major


application of regression analysis in marketing is in
the area of sales forecasting, based on some
independent (or explanatory) variables. This does
not mean that regression analysis is the only
technique used in sales forecasting. There are a
variety of quantitative and qualitative methods used
in sales forecasting, and regression is only one of
the better known (and often used) quantitative
techniques.

Methods
There are basically two approaches to regression
A hit and trial approach .
A pre- conceived approach.
Hit and trial Approach
In the hit and trial approach we collect data on a
large number of independent variables and then
try to fit a regression model with a stepwise
regression model, entering one variable into the
regression equation at a time.
The general regression model (linear) is of the
type
Y = a + b1x1 + b2x2 +.+ bnxn

where y is the dependent variable and x1, x2 , x3.xn


are the independent variables expected to be related
to y and expected to explain or predict y. b1, b2, b3
bn are the coefficients of the respective independent
variables, which will be determined from the input
data.
Pre-conceived Approach
The pre-conceived approach assumes the researcher
knows reasonably well which variables explain y
and the model is pre-conceived, say, with 3
independent variables x1, x2, x3. Therefore, not too
much experimentation is done. The main objective is
to find out if the pre-conceived model is good or not.
The equation is of the same form as earlier.

Data
1. Input data on y and each of the x variables is
required to do a regression analysis. This data
is input into a computer package to perform the
regression analysis.
2. The output consists of the b coefficients for
all the independent variables in the model. The
output also gives you the results of a t test for
the significance of each variable in the model,
and the results of the F test for the model on
the whole.

3. Assuming the model is statistically


significant at the desired confidence level
(usually 90 or 95% for typical applications in
the marketing area), the coefficient of
determination or R2 of the model is an
important part of the output. The R2 value is
the percentage (or proportion) of the total
variance in y explained by all the
independent variables in the regression
equation.

Recommended usage
1. It is recommended by the author that for exploratory
research, the hit-and-trial approach may be used. But for
serious decision-making, there has to be a-priori
knowledge of the variables which are likely to affect y,
and only such variables should be used in the regression
analysis.
2. It is also recommended that unless the model is itself
significant at the desired confidence level (as evidenced
by the F test results printed out for the model), the R
value should not be interpreted.

3. The variables used (both independent and dependent)


are assumed to be either interval scaled or ratio scaled.
Nominally scaled variables can also be used as
independent variables in a regression model, with dummy
variable coding. Please refer to either Marketing
Research: Methodological Foundations by Churchill or
Research for Marketing Decisions by Green, Tull &
Albaum for further details on the use of dummy variables
in regression analysis. Our worked example confines itself
to metric interval scaled variables.
4. If the dependent variable happens to be a nominally
scaled one, discriminant analysis should be the technique
used instead of regression.

Worked Example: Problem


1. A manufacturer and marketer of electric motors
would like to build a regression model consisting
of five or six independent variables, to predict
sales. Past data has been collected for 15 sales
territories, on Sales and six different independent
variables. Build a regression model and
recommend whether or not it should be used by
the company.
2. We will assume that data are for a particular year,
in different sales territories in which the company
operates, and the variables on which data are
collected are as follows:

Dependent Variable
Y =sales in Rs.lakhs in the territory
Independent Variables
X1 = market potential in the territory (in Rs.lakhs).
X2 = No. of dealers of the company in the territory.
X3 = No. of salespeople in the territory.
X4 = Index of competitor activity in the territory on
a 5 point scale (1=low, 5 = high level of
activity by competitors).
X5 = No. of service people in the territory.
X6 = No. of existing customers in the territory.

The data set consisting of 15 observations (from15 different sales territories), is


given in Table No.10.1The dataset is referred to as Regdata 1.
1
SALES

2
POTENTIAL

3
DEALERS

4
PEOPLE

5
COMPET

6
SERVICE

7
CUSTOM

25

20

60

150

12

30

50

20

45

15

25

11

30

10

20

45

75

12

20

30

10

16

15

29

18

30

22

43

16

40

29

70

15

39

10

40

11

16

40

11

17

12

25

10

13

18

32

14

31

14

23

73

10

10

43

15

81

150

15

35

70

Correlation

Fig.2 : Correlations Table


STAT.
MULTIPLE
REGRESS.
Variable
POTENTL
DEALERS
PEOPLE
COMPET
SERVICE
CUSTOM
SALES

Correlations (regdata1.sta)
POTEN
TL
1.00
.84
.88
.14
.61
.83
.94

DEAL
ERS
.84
1.00
.85
-.08
.68
.86
.91

PEOP COM
LE
PET
.88
.14
.85
-.08
1.00
-.04
-.04
1.00
.79
-.18
.85
-.01
.95
-.05

SERV
ICE
CUST SALES
OM
.61
.83
.94
.68
.86
.91
.79
.85
.95
-.18
-.01
-.05
1.00
.82
.73
.82
1.00
.88
.73
.88
1.00

First, let us look at the correlations of all the variables with each
other. The correlation table (output from the computer for the
Pearson Correlation procedure) is shown in Fig. 2. The values in
the correlation tables are standardised, and range from 0 to 1 (+ ve
and - ve).
STAT
MULTIPLE
REGRESS

Correlations
( regdata 1 sta)

Variable

POTENTIAL

DEALERS

PEOPLE

COMPET

SERVICE

CUSTOM

SALES

POTENTIAL

1.00

.84

.88

014

.61

.83

.94

DEALERS

.84

1.00

.85

-08

.68

.86

.91

PEOPLE

.88

.85

1.00

-.04

.79

.85

.95

COMPET

.14

-.08

-.04

1.00

-.18

-.01

-.05

SERVICE

.61

.68

.79

-.18

1.00

.82

.73

CUSTOM

.83

.86

.85

-.01

.82

1.00

.88

SALES

.94

.91

.95

-.05

.73

.88

1.00

1. Looking at the last column of the table, we find


that except for COMPET (index of competitor
activity), all other variables are highly correlated
(ranging from .73 to .95) with Sales.
2. This means we may have chosen a fairly good set
of independent variables (No. of Dealers, Sales
Potential, No. of Customers, No. of Service People,
No. of Sales People) to try and correlate with Sales.

3. Only the Index of Competitor Activity does not


appear to be strongly correlated (correlation
coefficient is -.05) with Sales. But we must
remember that these correlations in Fig. 2 are oneto-one correlations of each variable with the other.
So we may still want to do a multiple regression
with an independent variable showing low
correlation with a dependent variable, because in
the presence of other variables, this independent
variable may become a good predictor of the
dependent variable.

4. The other point to be noted in the correlation table is


whether independent variables are highly correlated
with each other. If they are, like in Fig. 2, this may
indicate that they are not independent of each other,
and we may be able to use only 1 or 2 of them to
predict the dependent variables.

5. As we will see later, our regression ends up


eliminating some of the independent
variables, because all six of them are not
required. Some of them, being correlated
with other variables, do not add any value to
the regression model.
6. We now move on to the regression analysis
of the same data.

Regression
We will first run the regression model of the
following form, by entering all the 6 'x' variables in
the model Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6
..Equation 1
and determine the values of a, b1, b2, b3, b4, b5, & b6.

Regression Output:
The results (output) of this regression model are in
Fig.4 in table form.
Column 4 of the table, titled B lists all the
coefficients for the model. According to this,
a (intercept) = -3.17298
b1 = .22685
b2 = .81938
b3 = 1.09104
b4 = -1.89270
b5 = -0.54925
b6 = 0.06594

These values of a, b1, b2,. b6, can be substituted in


Equation 1 and above we can write the equation
rounding off all coefficients to 2 decimals),as
Sales= -3.17+23(potential)+.82(dealers) +1.09
(salespeople)- 1.89 competitor activity)-0.55
(service people) + 0.07(existing customers )
Before we use this equation, however, we need to
look at the statistical significance of the model and
the R2 Value. These are available from fig.3.The
analysis of variance Table and Fig.4

Fig. 3 : The ANOVA Table


STAT.
Analysis of Variance; Depen.Var: SALES (regdata1.sta)
MULTIPLE
REGRESS.
Sums of
Mean
Effect
Squares
df Squares
F
Regress.
6609.484
6 1101.581 57.13269 .000004
Residual
154.249
8
19.281
Total
6763.733

From Fig. 3, the analysis of variance table, the last


column indicates the p-level to be 0.000004. This
indicates that the model is statistically significant at a
confidence level of (1-0.000004)*100 or
(0.999996)*100, or 99.9996.

However, ignoring the significance of individual


variables for now, we shall use the model as it is, and
try to apply it for decision making.
The real use of the regression model would be to try
and predict sales in Rs. lakhs, given all the
independent variable values.
The equation we have obtained means, in effect, that
sales will increase in a territory if the potential
increases, or if the number of dealers increases, or if
level of competitors activity decreases, if number of
service people decreases, and if the number of existing
customers increases.

The estimated increase in sales for every unit


increase or decrease in these variables is given by
the coefficients of the respective variables. For
instance, if the number of sales people is increased
by 1, sales in Rs . lakhs, are estimated to increase
by 1.09, if all other variables are unchanged.
Similarly, if 1 more dealer is added, sales are
expected to increase by 0.82 lakh, if other variables
are held constant.

There is one co-efficient, that of the SERVICE variable,


which does not make too much intuitive sense. If we
increase the number of service people, sales are estimated
to decrease according to the 0.55 coefficient of the
variable "No. of Service People" (SERVICE).
But if we look at the individual variable t tests, we find
that the coefficients of the variable SERVICE is
statistically not significant (p-level 0.735204 from fig. 4).
Therefore, the coefficient for SERVICE is not to be used
in interpreting the regression, as it may lead to wrong
conclusions.

Strictly speaking, only two variables, potential


(POTENTL) and No. of sales people (PEOPLE)
are significant
statistically at 90 percent
confidence level since their p- level is less than
0.10. One should therefore only look at the
relationship of sales with one of these variables,
or both these variables.

Making Predictions/Sales Forecasts


Given the levels of X1, X2, X3, X4, X5, and X6 for a
particular territory, we can use the regression model
for prediction of sales.
Before we do that, we have the option of redoing the
regression model so that the variables not
statistically significant are minimized or eliminated.
We can follow either the Forward Stepwise
Regression method, or the Backward Stepwise
Regression method,
to try and eliminate the
'insignificant' variables from the full regression
model containing all six independent variables.

Forward Stepwise Regression


For example, we could ask the computer for a
Forward stepwise Regression model, in which case
the algorithm adds one independent variable, at a time
, starting with the one which explains most of the
variation in sales (y), and adding one more X variable
to it , rechecking the model to see that both variables
form a good model, then adding a third variable if it
still adds to the explanation of Y , and so on. Fig, 5
shows the result of running a forward stepwise
Regression, which ends up with only 4 out of 6
independent variables remaining in the regression
model.

STATISTICA: Multiple Regression


Data file: REGDATA1:STA(15 cases with 7 variables)
MULTIPLE REGRESSION RESULTS:
Forward Stepwise regression, no. of steps:4
Dependent variable:
Sales
Multiple R:
.988317862
Multiple R-Square:
.976772197
Adjusted R-Square:
.967481076
F(4.10)= 105.1296
P<.000000
Standard Error of Estimate: 3.963668333
Intercept-3.741938802Std Error:4,847682
t(10)=.7719P<458025
No other F to enter exceeds specified limit

STAT.
Regression summary for Dependent Variable: Sales
MULTIPLE R=.98831786 R2 =.97677220 Adjusted R2 = .96748108
REGRESS. F= (4.10) = 105.13 p<.00000 std.Error of estimate: 3.9637
N=15

BETA

St.Err.of
BETA

Intercept

St.Err.of
BETA

T(10)

P-level

-3.74194

4,8477683

-.77190

.458025

People

.390134

.115138

1.02822

.303453

3.38841

.006904

Potential

.462686

.117988

.23905

.60959

3.92147

.002860

Dealers

.180700

.102687

.90109

.512065

1.75971

.108955

Compet

-.081195 .053434

-1.81074

-1.191624

-1.51955 .159589

The 4 variables in the model are PEOPLE (no. of


salespople), POTENTIAL(Sales potential),
DEALERS ( no. of dealers) and COMPET
(Competitive index). Again we notice, that the two
Significant variables (those with p-value<.10) at 95%
confidence are only PEOPLE and POTENTL(p-levels
of .006904 and.002860). But dealers is now at p-level
of .108955,very close to significance at 90%
confidence level.

This could be the equation, instead of the one with 6 independent


variables, that we could use. We would be economising on the
two variables, which are not required if we decide to use the
model from Table 10.5 instead of that from Table 10.4. The Ftest For the model in Table 10.5 also indicates it is highly
significant (From Top of Table 10.5, F=105.1296, p<.000000).
R2 value for the model is 0.9767, which is very close to the 6
independent variable model of Table 10.4. If we decide to use
the model from Table 10.5, it would be written as follows:
Sales= -3.74+1.03(PEOPLE)+24(POTENTL)+.9 DEALERS)1.81 (COMPET)
- Equation 2

STATISTICA: Multiple Regression


Data file: REGDATA1:STA(15 cases with 7 variables)
MULTIPLE REGRESSION RESULTS:
Forward Stepwise regression, no. of steps:4
Dependent variable:
SALES
Multiple R:
..979756241
Multiple R-Square:
.959922293
Adjusted R-Square:
.953242675
Number of cases:
15
F(4.10)= 105.1296
P<.000000
Standard Error of Estimate: 4.752849362
Intercept-10.614641069 Std Error:2,659532t(12) =3.992
P<458025
No other F to remove is less than specified limit

STAT.
Regression summary for Dependent Variable: SALES
MULTIPLE R=.97975624 R2 =.95992229 Adjusted R2 = .95324267
REGRESS. F= (2, 12) = 143.71p<.00000 std.Error of estimate: 4,7528
N=15

BETA

St.Err.
of BETA

Intercept

St.Err.
of BETA

T(10)

P-level

-10,6164

2,659532

- 3,99183 .001788

POTENTL

.470825

.120127

.2433

..062065

3,91939

..000728

PEOPLE

.540454

.120127

.1.4244

.316602

4.49902

.000728

We could, as another alternative, perform a


backward stepwise Regression, on the same set of
6 independent variables. This procedure starts with
all 6 variables in the model, and gradually
eliminates those, one after another, which do not
explain much of the variation in y, until it ends
with an optimal mix of independent variables
according to pre-set criteria for the exit of
variables. This results in a model with 2 only
independents variables POTENTL and PEOPLE
remaining in the model. This model is shown in
Fig.6.

The R for the model has dropped only slightly, to


0.9599, the F-test for the model is highly significant,
and both the independent variables POTENTL and
PEOPLE are significant at 90 % confidence level (plevels of .002037 and .000728 from last column, Fig,
6).
If we were to decide to use this model for prediction,
we only require data to be collected on the number of
sales people (PEOPLE) and the sales potential
(POTENTL), in a given territory. We could form the
equation using the Intercept and coefficients from
column B in Fig. 6. as follows-

Sales = -10.6164 + .2433 (POTENTL) + 1.4244


(PEOPLE)...Equation 3
Thus, if potential in a territory were to be Rs. 50 lakhs,
and the territory had 6 salespeople, then expected sales,
using the above equation would be
= -10.6164 +.2433(50) +1.4244(6)
= 10.095 lakhs.
Similarly, we could use this model to make predictions
regarding sales in any territory for which Potential and
No. of Sales-people were known.

Additional comments
1. As we can see from the example discussed,
regression analysis is a very simple (particularly on a
computer), and useful techniques to predict one metric
dependent variable based on a set of metric
independent variables. Its use, however, gets more
complex, for instance, if the independent variables are
nominally scaled into two (dichotomous) or more
(polytomous) categories.

2. It is also a good idea to define the range of all


independent variables used for constructing the
regression model. For prediction of Y values, only
those X values which fall within or close to this
range (used earlier in the model construction stage)
must be used, for the predictions to be effective.
3. Finally, we have assumed that a linear model is
the only option available to us. That is not the only
choice. A regression model could be of any non
linear variety, and some of these could be more
suitable for particular cases.

4. Generally, a look at the plot of Y and X tells


us in case of a simple regression model, whether
the linear (straight line) approach is best or not.
But in a multiple regression, this visual plot may
not indicate the best kind of model, as there are
many independent variables, and the plot in 2
dimensions is not possible.

5. In this particular example, we have


not used any macroeconomic variables,
but in industrial marketing, we may use
those
types
of
industry
or
macroeconomic variables in a regression
model. For example, to forecast sales of
steel, we may use as independent
variables, the growth rate of a countrys
GDP, the new construction starts, and
the growth rate of the automobile
industry.

You might also like