0% found this document useful (0 votes)
32 views

Notes - Part III

This document discusses intervention analysis for regression models. It presents a scenario where a regression model is built to explain and forecast monthly sales for a deli based on monthly advertising. Through examining plots, autocorrelation, and cross-correlation functions, it is determined that sales are best explained by a model using advertising from two periods ago. Diagnostic checks on the residuals reveal missing factors of seasonality in December and an intervention that increased sales in mid-1997. The analysis concludes more factors need to be included to fully model the relationship.

Uploaded by

somya.anime.007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Notes - Part III

This document discusses intervention analysis for regression models. It presents a scenario where a regression model is built to explain and forecast monthly sales for a deli based on monthly advertising. Through examining plots, autocorrelation, and cross-correlation functions, it is determined that sales are best explained by a model using advertising from two periods ago. Diagnostic checks on the residuals reveal missing factors of seasonality in December and an intervention that increased sales in mid-1997. The analysis concludes more factors need to be included to fully model the relationship.

Uploaded by

somya.anime.007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 56

INTERVENTION ANALYSIS

In this section we will be introducing the topic of intervention analysis as it applies to regression

models. Besides introducing intervention analysis, other objectives are to review the three-phase

model building process and other regression concepts previously discussed. The format that will be

followed is a brief introduction to a case scenario, followed by an edited discussion that took place

between an instructor and his class, when this case was presented in class. The reader is encouraged

to work through the analysis on the computer as they read the narrative. (The data resides in the file

FRED.SF.

As you work through the analysis, keep in mind that the sequence of steps taken by one analyst may

be different from another analysis, but they end up with the same result. What is important is the

thought process that is undertaken.

Scenario: You have been provided with the monthly sales (FRED.SALE) and
advertising (FRED.ADVERT) for Fred’s Deli, with the intention that you will
construct a regression model which explains and forecasts sales. The data set
starts with December 1992.

Instructor: What is the first step you need to do in your analysis?


Students: Plot the data.

Instructor: Why?
Students: To see if there is any pattern or information that helps specify the model.

Instructor: What data should be plotted?


Students: Let’s first plot the series of sales.

Instructor: Here is the plot of the series first for the sales. What do you see?

84
Students: The series seems fairly stationary. There is a peak somewhere in 1997. It is a little
higher and might be a pattern.

Instructor: What kind of pattern? How do you determine it?


Students: There may be a seasonality pattern.

Instructor: How would you see if there is a seasonality pattern?


Students: Try the autocorrelation function and see if there is any value that would indicate a
seasonal pattern.

Instructor: OK. Let’s go ahead and run the autocorrelation function for sales. How many time
periods would you like to lag it for?
Students: Twenty-four.

Instructor: Why?
Students: Twenty-four would be two years worth in a monthly value.

Instructor: OK, let’s take a look at the autocorrelation function of sales for 24 lags.

85
Instructor: What do you see?
Students: There appears to be a significant value at lag 3, but besides that there may also be
some seasonality at period 12. However, it’s hard to pick it up because the values
are not significant. So, in this case we don’t see a lot of information about sales as a
function of itself.

Instructor: What do you do now?


Students: See if advertising fits sales.

Instructor: What is the model that you will estimate or specify?


Students: Salest = 0 + 1 Advertt + 1.

Instructor: What is the time relationship between sales and advertising?


Students: They are the same time period.

Instructor: OK, so what you are hypothesizing or specifying is that sales in the current time
period is a function of advertising in the current time period, plus the error term,
correct?
Students: Yes.

Instructor: Let’s go ahead and estimate the model. To do so, you select model, regression, and
let’s select a simple regression for right now. The results appear on the following
page.

86
Instructor: What do you see from the result? What are the diagnostic checks you would come
up with?
Students: Advertising is not significant.

Instructor: Why?
Students: The p-value is 0.6335; hence, advertising is a non-significant variable and should be
thrown out. Also, the R-squared is 0.000, which indicates advertising is not
explaining sales.

Instructor: OK, what do we do now? You don’t have any information as its past for the most
part, and you don’t have any information as advertising as current time period, what
do you do?
Students: To see if the past values of advertising affects sales.

Instructor: How would you do this?


Students: Look at the cross-correlation function.

Instructor: OK. Let’s look at the cross-correlation between the sales and advertising. Let’s put
in advertising as the input, sales as the output, and run it for 12 lags - one year on

87
either side. Here is the result of doing the cross-correlation function, what do you
see?

Students: There is a large “spike” at lag 2 on the positive side. What it means is that there is a
strong correlation (relationship) between advertising two time periods ago and sales
in the current time period.

Instructor: OK, then, what do you do now?


Students: Run a regression model where sales is the dependent variable and advertising
lagged two (2) time periods will be the explanatory variable.

Instructor: OK, this is the model now we are going to specify


Salest = 0 + 1 Advertt-2 + 1
What we are seeing here is that sales is a function of advertising two time periods
ago. So, at this point this is the model that you have specified. Going to the three-
phase model building process, let’s now estimate the model, and then we will
diagnostically check it. The estimation results for this model are as follows:

88
Instructor: Looking at the estimation results, we are now ready to go ahead and do the
diagnostic checking. How would you analyze the results at this point from the
estimation phase?
Students: We are getting 2 lag of advertising as being significant, since the p-value is 0.0000.
So, it is extremely significant and the R-squared is now 0.3776.

Instructor: Are you satisfied at this point?


Students: No.

Instructor: What would you do next?


Students: Take a look at some diagnostics that are available.

Instructor: Such as what?


Students: We can plot the residuals, look at the influence measures, and a couple other things.

Instructor: OK. Let’s go ahead and first of all plot the residuals. What do residuals represent?
Remember that the residuals represent the difference between the actual values and
the fitted values. Here is the plot of the residuals against time (the index):

89
Residual Plot
(X 10000)
8

4
residual

-4

-8
0 20 40 60 80 100
row number

Instructor: What do you see?


Students: There is a clear pattern of points above the line, which indicates some kind of
information there.

Instructor: What kind of information?


Students: It depends on what those values are.

Instructor: Let us take a look at a feature in Statgraphics. When one maximizes the pane, which
displays the residual graph versus time (row), one is then able to click on any point
(square) and find out which observation it is by looking above the graph in the "row"
box.

We are now able to identify each of the points by lining up the plus mark on each
point and clicking. If you do that for the first point, you will notice that X is 13, the
second point, X is 25 and the third point, X is 37. The fourth point that is out by
itself is 49.

As you see what is going on there, you have a pattern of every 12 months. Recall
that we started it off in December. Hence each of the clicked points is in December.
Likewise, if you see the cluster in the middle, you will notice that those points
correspond to observations 56, 57, 58, 59, 60, and the 61. Obviously, something is
going on at observation 56 through 61.

So, if you summarize the residuals, you have some seasonality going on at the
month 13, 25,.... i.e. every December has a value, plus something extra happen starting
with 56th value and continues on through the 61 st value. We could also obtain very
similar information by taking a look at the "Unusual Residuals" and "Influential Points"

90
Instructor: To summarize from our residuals and influential values, one can see that what we
have left out of the model at this time are really two factors. One, the seasonality
factor for each December, and two, an intervention that occurred in the middle part
of 1997 starting with July and lasting through the end of 1997. This may be a case
where a particular salesperson came on board and some other kind of policy/event
may have caused sales to increase substantially over the previous case. So, what do
you do at this point? We need to go back to incorporate the seasonality and the
intervention.
Students: The seasonality can be accounted for by creating a new variable and assigning
“1”for each December and “0” elsewhere.

Instructor: OK, what about the intervention variable?


Students: Create another variable by assigning a “1” to the months 56, 57, 58, 59, 60, and 61.
Or we figure out the values for July through December in 1997. i.e. “1” for the
values from July 97 to December 1997 inclusive, and zero elsewhere.

Instructor: Very good. So, what we are going to do is to run a regression with these two
additional variables. Those variables are already included in the file. One variable
is called FRED.INTERVENT and if you look at it, it has “1” for the values from 56 to
61, and “0” elsewhere. Other variable FRED.DEC has values of “1” only for
December values, “0” elsewhere. So, what is the model we are going to estimate?
Students: Salest = 0 + 1 Advertt-2 2 Dect + 3 Interventt + 1.

Instructor: What does this model say in words at this point?


Students: Sales in the current time period is a function of advertising two time periods ago, a
dummy variable for December and intervention variable for the event occurred in
1997.

Instructor: Good. Let’s summarize what we have done.


You started off with a model that has advertising two time periods ago as
explanatory variable, but you say some information was not included in that model.
That is, we are missing some information that is included in the data. Then, we
looked at the residuals and the influence values, and we came up with two new
variables that incorporated that missing information. Having re-specified the model,
we are now going to re-estimate, and diagnostic check the revised model. The
estimations for the revised model are shown on the following page:

91
Instructor: Given these estimation results, how would you analyze (i.e. diagnostically check) the
revised model?
Students: All the variables are significant since the p-values are all 0.0000 (truncation). In
addition, R-squared value has gone up tremendously to 0.969 (roughly 97 percent).
In other words, R2 has jumped from 37 percent to approximately 97 percent, and the
standard error has gone down substantially from 17000 to about 3800. As a result,
the model looks much better at this time.

Instructor: Is there anything else you would do?


Students: Yes, we will go back to diagnostic check again to see if this revised model still has
any information that has not been included, and hence can be improved.

Instructor: What is some diagnostic checking you would try?


Students: Look at the residuals again, and plot it against time.

Instructor: OK, here is the plot of the residual against time. Do you see any information?

92
Residual Plot
(X 1000)
12
8
4
residual

0
-4
-8
-12
0 20 40 60 80 100
row number

Students: No, the pattern looks pretty much random. We cannot determine any information left
out in the model with the series of the structure.

Instructor: OK, anything else you would look at?


Students: Yes, let us look at the influence measures.

Instructor: OK, when you look at the "Unusual Residuals" and "Influential Points" options, what
do you notice about these points.
Students: They have already been accounted for with the December and Intervention
variables.

Instructor: Would you do anything differently to the model at this point?


Students: We don’t think so.

Instructor: Unless you are able to identify those points with particular events occurred, we
do not just keep adding dummy variables in to get rid of the values that have
been flagged as possible outliers. As a result, let us assume that we have pretty
much cleaned things up, and at this point, you can be satisfied with the model
that you have obtained.

Summary

The objectives in this section, once again are to introduce the concepts of intervention analysis, and

review the three-phase model building process. To do this, we look at the situation where we have

sales and advertising, in particular, we have monthly values starting in December 1997.

The three-phase model building process talks about specifying, estimating, and diagnostic checking

a model. In our analysis the first step we did was try to decide what would be an appropriate model

to specify, that was what variable or variables helped explain the variation in sales. As we saw,

93
advertising in the current time period did not affect sales. When we used the cross-correlation

function, however, we were able to see that advertising two time periods prior had an effect on

sales. Thus, we ran the simple linear regression of sales against advertising two time periods prior.

From this regression, we looked at the diagnostic checks and noticed that a fair amount of

information had been left out of the model. In particular, we had left out two factors. The first one

was the seasonality factor that occurred in each December, and the second one was an intervention

that happened in the last half of 1997, from July to December 1997. To incorporate these two

factors into the model, we set up two additional variables. The revised model increased R-squared

substantially and reduced the means-squared. Thus, the revised model was our final model.

94
SAMPLING

As an efficient method to obtain information about a population, one frequently needs to sample

from a population. There are many different probabilistic sampling methods. In addition to

random sampling, two other frequently used techniques are stratified sampling and systematic

sampling1. The type of sampling method appropriate for a given situation depends on the

attributes of the population being sampled, sampling cost, and desired level of precision.

Random Sampling

A simple random sample is a sample in which every group of size n has an equal chance of being

selected. In order to conduct a random sample, one needs the frame (listing of all elements) and

then either by “drawing from the hat” or using a random number table 2 one obtains the elements

selected for the sample.

Stratified Sample

A stratified sample is appropriate to use when the population of concern has subpopulations

(strata) that are homogenous within and heterogeneous between each other, with regards to the

parameter of concern. The reason it may be appropriate to use stratified sampling, as opposed to

simple random sampling of the whole population, is that each subgroup will have relatively

smaller variances than the overall population. Hence, when we combine the results from the

different subgroups, the aggregated variance (standard error) will be smaller than the same size

sample from the entire population using simple random sampling.

For example, assume we desire to estimate the average number of hours business students study

per week. One could use a simple random sample. However, if one were to stratify based upon

1
There are many other techniques available but we will restrict our discussion to these.
2
Many software packages, such as Stat Graphics, have random number generators.

95
concentrations3, take a random sample from each concentration, then the aggregated result would

probably be more precise (smaller confidence interval) than the one from a random sample (same

sample size). The greater precision would come from the aggregation of strata (subpopulations)

whose individual variances are less than the variance of the entire population.

Systematic Sample

Systematic sampling is a widely used technique when there is no pattern to the way in which the

data set is organized. The lack of pattern is important since a systematic sample involves

selecting every nth observation. For example, one may be selecting every 4 th observation.

Clearly, the technique could provide a biased estimate if there is a periodicity (seasonality) to the

data and the sampling interval is a multiple of the period.

Comparison Of Survey Sampling Designs

Design How to Select Strengths/Weaknesses

Simple Random Assign numbers to Basic, simple, often costly.


elements using random Must assign a number to each element
numbers table. in target population.

Stratified Divide population into groups With proper strata, can produce
that are similar within and very accurate estimates. Less
different between variable of costly than simple random
interest. Use random numbers sampling. Must stratify target
to select sample from each population correctly.
Stratum.

Systematic Select every kth element Produces very accurate estimates


from a list after a when elements in population
random start. exhibit order. Used when simple
random or stratified sampling is
not practical [e.g.:
population size not
known]. Simplifies selection
process. Do not use with
periodic populations.

3
Other discriminating variables could be used, such as, age, premajor vs. upper division, etc...

96
CROSSTABULATIONS

In this section we will be focusing our attention on a technique frequently used in analyzing survey

results, cross tabulation. The purpose of cross tabulation is to determine if two variables are

independent or whether there is a relationship between them.

To illustrate cross tabulation assume that a survey has been conducted in which the following questions

were asked:

-- What is your age


____ less than 25 years ____ 25-40 _____ more than 40

-- What paper do you subscribe to


____ Chronicle ____ BEE ___ Times

-- What is your annual household gross income


____ < $15,000 ____ $15,000 - $40,000 ___ >$40,000

Letting the first response for each question be recorded as a 1, the second as a 2 and the third as a 3 4,

the file CLTRES.SF contains 200 responses.

We will first consider the hypothesis test generally referred to as a test of dependence:

H0: AGE and PAPER are independent


H1: AGE and PAPER are dependent.

To perform this test via Statgraphics, we first pull up the data file CLTRES.SF, then we go to the main
menu and select

Describe
Categorical Data
Cross tabulation

4 ?
For example for the second question about the paper, we will create a variable called PAPER, with Chronicle = 1,
BEE = 2 and Times = 3.

97
and fill one of the variables as the row variable and the other as the column variable. For our example

we will select Age as the row variable and Paper as the column variable. For the desired output we go

to the tabular options and select the Chi-square and frequency table options.

The chi-square option gives us the value of the chi-square statistic for the hypothesis (see Figure 2).

This value is calculated by comparing the actual observed number for each cell (combination of levels

for each of the two variables) and the expected number under the assumption that the two variables are

independent.

Figure 2

Since the p-value for the chi-square test is 0.6218, which exceeds the value of  = 0.05, we conclude

that there is not enough evidence to suggest that AGE and PAPER are dependent. Hence it is

appropriate to conclude that age is not a factor in determining who subscribes to which paper. Selecting

the frequency option provides us with the following output (pane):

Figure 3

98
Note that the top entry for each cell represents the actual number of responses for the cell from the

survey. The bottom entry in each cell represents the cell's percentage for the entire sample (array). By

right clicking on the output pane displayed in Figure 3, one can choose the option pane analysis and

select either column or row percentages, for the lower entry. If the hypothesis test had concluded that

AGE and PAPER were dependent, then a comparison of the cell's percentage with the value in the

Column Total column would suggest how the variables are related. If, looking at Figure 1, we had

selected the row option, then one should substitute row in the above discussion for column.

THE READER IS ENCOURAGED TO ANALYZE WHETHER PAPER AND INCOME ARE RELATED.

99
Practice Problem

A survey was administered to determine whether various categories describing a student were
independent. Part of the survey questionnaire appears below:

PLEASE PROVIDE THE REQUESTED INFORMATION BY CHECKING (ONCE).

What is your:

 age ____ < 18 ____ 18 - 26 ____ > 26

 gender ____ male ____ female

 course load ____ < 6 units ____ 6 - 12 units ____ > 12 units

 gpa __ < 2.0 __ 2.0 - 2.5 ___ 2.6 - 3.0 __ 3.1 - 3.5 __ > 3.5

 annual income ___ < $10,000 ___ $10,000 - $20,000 ___ > 20,000

The information is coded and entered in the file STUDENT.SF by letting the first response be
recorded as a 1, the second as a 2, etc.

a. Test whether a relationship exists between the categories “age” and “gpa.”
H0: ________________________________________________
H1: ________________________________________________
p-value: _________ Decision: ___________________________

b. Test whether a relationship exists between the categories “gender” and “income.”
H0: ________________________________________________
H1: ________________________________________________
p-value: _________ Decision: ___________________________

c. Describe your observation of the table display for the categories “gender” and
“income.”

100
THE ANALYSIS OF VARIANCE

In this section we will study the technique of analysis of variance (ANOVA), which is designed to

allow one to test whether the means of more than two qualitative populations are equal. As a follow

up, we will discuss what interpretations can be made should one decide that the means are

statistically different.

We will discuss two different models (experimental designs), one-way ANOVA and two-way

ANOVA. Each model assumes that the random variable of concern is continuous and comes from a

normal distribution5 and the sources of specific variation are strictly qualitative. A one-way ANOVA

model assumes there is only one (1) possible source of specific variation, while the two-way

ANOVA model assumes that there are two (2) sources of specific variation.

One-Way Analysis Of Variance

The one-way ANOVA model assumes that the variation of the random variable of concern is made

up of common variation and one possible source of specific variation, which is qualitative. The

purpose of the one-way ANOVA analysis is to see if the population means of the different

populations, as defined by the specific source of variation, are equal or not.

For example, assume you are the utility manager for a city and you want to enter into a contract for a

single supplier of streetlights.

5
The results of the ANOVA models are robust to the assumption of normality (i.e. one need not be concerned
about the normality assumption).

101
You are currently considering four possible vendors. Since their prices are identical, you wish to see

if there is a significant difference in the mean number of hours per streetlight.6

Design

The design we employ, randomly assigns experimental units to each of the populations. In the street

light example we will randomly select light bulbs from each population and then randomly assign

them to various streetlights. When there are an equal number of observations per population, then

the design is said to be a balanced design. Most texts when introducing a one-way ANOVA discuss a

balanced design first, since the mathematical formulas that result are easier to present for a balanced

design than an unbalanced design. Since our presentation will not discuss the formulas, what we

present does not require a balanced design, although our first example will feature a balanced design.

Going back to our example, we randomly selected 7 light bulbs from each of the
populations and recorded the length of time each bulb lasted until burning out. The
results are shown below where value recorded is in 10,000 hours.

GE DOT West Generic


2.29 1.92 1.69 2.22
2.50 1.92 1.92 2.01
2.50 2.24 1.84 2.11
2.60 1.92 1.92 2.06
2.19 1.84 1.69 2.19
2.29 2.00 1.61 1.94
1.98 2.16 1.84 2.17

One can easily calculate the sample means (X-BARS) for each population with the results 7 being

2.34, 2.00, 1.79 and 2.10 for GE, DOT, West, and Generic respectively. Recall that our objective is

to determine if there is a statistically significant difference between the four population means, not

the sample means. To do this, note that there is variation within each population and between the

6
In this example the random variable is the number of hours per light and the source of specific variation is the
different vendors (qualitative).
7
There is some rounding.

102
populations. Since we are assuming that the within variations are all the same, a significant

between population variance will be due to a difference in the population means. To determine if

the between population variation is significant, we employ the following Statgraphics steps so that

we can conduct the hypothesis test:

H0: All of the population means are the same


H1: Not all population means are the same via an F statistic.

Create a StatGraphics file [LGHTBULB -- notice spelling, 8 letters] with three variables. The first

variable [HRS] represents the measured value (hours per light bulb in 10,000 hours) and the second

variable [BRAND] indicates to which population the observation belongs. This can be accomplished

by letting GE be represented by a 1, DOT by a 2, West by a 3, and Generic by a 4. We will also

created a third variable [Names] which is unnecessary for the Windows version of Statgraphics8.

ROW HRS BRAND NAMES


----- ------- ------- ---------
1 2.29 1 GE
2 1.92 2 DOT
3 1.69 3 WEST
4 2.22 4 GENERIC
. . .
. . .
25 1.98 1
26 2.16 2
27 1.84 3
28 2.10 4

Using the created data file LGHTBULB.SF, as shown above, we are now able to select the one way
ANOVA option in Statgraphics by going to the main menu and selecting:
Compare
Analysis of Variance
One-way ANOVA

and declaring hours as the dependent variable, along with brand as the factor variable.
8
The data file accessed from the WWW page, was released prior to our converting over this file to the
Windows version of Stat Graphics.

103
The resulting output pane, when selecting the ANOVA table option, under tabular options, is

Table 1. Output for One-Way ANOVA

From this output we can now conduct the hypothesis test:

H0: All four population means are the same


H1: Not all four population means are the same

by means of the F test. Note that the F-ratio is the ratio of the between groups (populations)

variation and the within groups (populations) variation9. When this ratio is large enough, then we

say there is significant evidence that the population means are not the same. To determine what is

large enough, we utilize the p-value (Sig. level) and compare it to alpha. Setting  = 0.05, we can

see for our example that the p-value is less than alpha. This indicates that there is enough evidence

to suggest that the population means are different and we reject the null hypothesis. To go one step

further and see what kind of interpretation one can make about the population means, when it is

determined that they are not all equal, we can utilize the means plot option under the graphics option

icon. The resulting pane is shown below.

9
The mean square values are estimates of the respective variances.

104
Means and 95.0 Percent LSD Intervals
2.5

2.3

hours
2.1

1.9

1.7
1 2 3 4
brand

Figure 1. Intervals for Factor Means

To interpret the means plot, note that the vertical axis is numeric and the figures depicted for each

brand covers the confidence intervals for the respective population mean. When the confidence

intervals overlap then we conclude the population means are not significantly different, when there

is no overlap we conclude that the population means are significantly different. The interpretations

are done taking the various pair-wise comparisons. Interpreting Figure 1, one can see that GE (brand

1) is significantly greater than all of the other three brands, WEST (brand 3) is significantly less than

all of the others and that DOT (brand 2) and GENERIC (brand 4) are not significantly different. The

means table provides the same information but in a numerical format .

105
Practice Problems

1. A consumer organization was interested in determining whether any difference existed in the

average life of four different brands of walkmans. A random sample of four batteries of each brand

was tested. Using the data in the table, at the 0.05 level of significance, is there evidence of a

difference in the average life of these four brands of Walkman batteries? [Create the file

WALKBAT.SF.]

Brand 1 Brand 2 Brand 3 Brand 4


12 19 20 14
10 17 19 21
18 12 21 25
15 14 23 20

2. A toy company wanted to compare the price of a particular toy in three types of stores in a

suburban county: Discount toy stores, specialty stores, and variety stores. A random sample of four

discount toy stores, six specialty stores, and five variety stores was selected. At the 0.05 level of

significance, is there evidence of a difference in the average price between the types of stores?

[Create the file TOY.SF.]

Discount Toy Specialty Variety


12 15 19
14 18 16
15 14 16
16 18 18
18 15
15

106
107
Two-Way Analysis Of Variance

Given our discussion about the one-way ANOVA model, we can easily extend our discussion to a

two way ANOVA model. As stated previously, the difference between a one-way ANOVA and

two-way ANOVA depends on the number qualitative sources of specific variation for the variable

of concern.

The two-way ANOVA model we will consider has basically the same assumptions as the one-way

ANOVA model presented previously. In addition we will assume the factors influence the variable

of concern in an additive fashion. The analysis will be similar to the one-way ANOVA, in that each

factor is analyzed.

To illustrate the two-way ANOVA model we consider an example where the dependent variable is

the sales of Maggie Dog Food per week. In its pilot stage of development Maggie Dog Food is

packaged in four different colored containers (blue, yellow, green and red) and placed at different

shelf heights (low, medium, and high). As the marketing manager you are interested in seeing what

impact the different levels for each of the two factors have on sales. To do this you randomly assign

different weeks to possess the different combinations of package colors and shelf height. The results

are shown below:

Shelf Height

Low Med High

Blue 125 140 152


Can Color Yellow 112 130 124
Green 85 105 93
Red 85 97 98

108
Given this design, we can test two sets of the hypotheses.

H0: The population means for all four colors is the same
H1: The population means for at least two colors are different

and

H0: The population means for the different shelf heights are the same
H1: The population means for at least two of the shelf heights are different

To conduct this analysis using Statgraphics we enter the data into a file called DOG.SF as shown in

Table 2 below:

Table 2.

SALES COLOR HGT


-------------------------------------------------
125. B L
112. Y L
85. R L
85. G L
140. B M
130. Y M
105. R M
97. G M
152. B H
124. Y H
93. R H
98. G H

Now that the data is entered into the file DOG.SF, we are ready to have Statgraphics generate

the required output. To accomplish this we escape back to the main menu and select

Compare
Analysis of Variance
Multifactor ANOVA

109
then select SALES as the dependent variable and for the factors we select COLOR and HEIGHT.

{We do not choose to consider a covariate for this model). When selecting the tabular option

ANOVA Table, we get the following pane:

Table 4

Looking at the two-way ANOVA table (Table 4) one can see that the total variation is comprised of

variation for each of the two factors (height and color) and the residual. The F-ratios for the factors

are significant as indicated by their respective p-values. Hence, one can conclude that there is

enough evidence to suggest that the means are not all the same for the different colors and that the

means are not all the same for the different shelf heights.

To determine what one can conclude about the relationship of the population means for each of the

factors we look at the mean plot (table) for each of the factors.[see Graphics options] The mean plot

for shelf height and color are shown in Figures 5 and 6 10.

10
To change the means plot from one variable to the other, one needs to right click on the pane and choose the
appropriate Pane Option(s).

110
Means and 95.0 Percent LSD Intervals
126
121
116
sales

111
106
101
96
H L M
height

Figure 5

Means and 95.0 Percent LSD Intervals


147
137
127
sales

117
107
97
87
B G R Y
color

Figure 6

Interpreting the means plots just like we did for the one-way ANOVA example, we can make the

following conclusions. With regards to shelf height, the low shelf height has a lower population

mean than both the medium and high shelf heights, while we are unable to detect a significant

difference between the medium and high shelf heights. With regards to the colors, the blue

population mean is greater than the yellow population mean which is greater than both the green

population mean and the red population mean and that we are unable to detect a significant

difference between the green and red population means. The mean tables (tabular options) provide

the same results as the means plots, just that it is given in numerical format.

111
Practice Problems

1. The Environmental Protection Agency of a large suburban county is studying coliform

bacteria counts (in parts per thousand) at beaches within the county. Three types of beaches

are to be considered -- ocean, bay, and sound -- in three geographical areas of the county --

west, central, and east. Two beaches of each type are randomly selected in each region of

the county. The coliform bacteria counts at each beach on a particular day were as follows:

Geographic Area
Type of Beach West Central East
Ocean 25 20 9 6 3 6
Bay 32 39 18 24 9 13
Sound 27 30 16 21 5 7

Enter the data and save as the file WATER.SF.

At the 0.05 level of significance, is there an

a. effect due to type of beach?

H0: ________________________ H1: _______________________

p-value: _____________________ Decision: __________________

b. effect due to type of geographical area?

H0: ________________________ H1: _______________________

p-value: _____________________ Decision: __________________

c. effect due to type of beach and geographical area? OPTIONAL

H0: ________________________ H1: _______________________

p-value: _____________________ Decision: __________________

d. Based on your results, what conclusions concerning average bacteria count can be
reached?

112
2. A videocassette recorder (VCR) repair service wished to study the effect of VCR brand and
service center on the repair time measured in minutes. Three VCR brands (A, B, C) were
specifically selected for analysis. Three service centers were also selected. Each service
center was assigned to perform a particular repair on two VCRs of each brand. The results
were as follows:

Service Centers Brand A Brand B Brand C


52 48 59
1 57 39 67
51 61 58
2 43 52 64
37 44 65
3 46 50 69

Enter the data and save as the file VCR.SF

At the .05 level of significance:

(a) Is there an effect due to service centers?


(b) Is there an effect due to VCR brand?
(c) Is there an interaction due to service center and VCR brand? OPTIONAL

3. The board of education of a large state wishes to study differences in class size between
elementary, intermediate, and high schools of various cities. A random sample of three cities within
the state was selected. Two schools at each level were chosen within each city, and the average class
size for the school was recorded with the following results:

Education Level City City City


A B C
Elementary 32, 34 26, 30 20, 23
Intermediate 35, 39 33, 30 24, 27
High School 43, 38 37, 34 31, 28

Enter the data and save as the file SCHOOL.SF.

At the .05 level of significance:

(a) Is there an effect due to education level?


(b) Is there an effect due to cities?
(c) Is there an interaction due to educational level and city? OPTIONAL

113
4. The quality control director for a clothing manufacturer wanted to study the effect of operators

and machines on the breaking strength (in pounds) of wool serge material. A batch of material

was cut into square yard pieces and these were randomly assigned, three each, to all twelve

combinations of four operators and three machines chosen specifically for the equipment. The

results were as follows:

Machine Machine Machine


Operator I II III
A 115 115 119 111 108 114 109 110 107
B 117 114 114 105 102 106 110 113 114
C 109 110 106 100 103 101 103 102 105
D 112 115 111 105 107 107 108 111 110

Enter the data and save as the file SERGE.SF.

At the .05 level of significance:

(a) Is there an effect due to operator?


(b) Is there an effect due to machine?
(c) Is there an interaction due to operator and machine? OPTIONAL

114
Appendices

Quality

The Concept of Stock Beta

115
116
QUALITY

Common Causes and Specific Causes

As stated earlier, and repeated here because of the concept’s importance, in order to reduce the

variation of a process, one needs to recognize that the total variation is comprised of common

causes and specific causes. Those factors, which are not readily identifiable and occur randomly

are referred to as the common causes, while those which have a large impact and can be associated

with special circumstances or factors are referred to as specific causes.

It is important that one get a feeling of a specific source, something that can produce a significant

change and that there can be numerous common sources which individually have insignificant

impact on the processes variation.

Stable and Unstable Processes

When a process has variation made up of only common causes then the process is said to be a stable

process, which means that the process is in statistical control and remains relatively the same over

time. This implies that the process is predictable, but does not necessarily suggest that the process

is producing outputs that are acceptable as the amount of common variation may exceed the amount

of acceptable variation. If a process has variation, which is comprised of both common causes

and specific causes, then it is said to be an unstable process -- the process is not in statistical

control. An unstable process does not necessarily mean that the process is producing unacceptable

products since the total variation (common variation + specific variation) may still be less than the

acceptable level of variation.

Tampering with a stable process will usually result in an increase in the variation, which will

decrease the quality. Improving the quality of a stable process (i.e. decreasing common variation)

117
is usually only accomplished by a structural change, which will identify some of the common

causes, and eliminate them from the process.

Identification Tools

There are a number of tools used in practice to determine whether specific causes of variation exist

within a process. In the remaining part of this chapter we will discuss how time series plots, the

runs test, a test for normality and control charts are used to identify specific sources of variation.

As will become evident there is a great deal of similarity between time series plots and control

charts. In particular, the control charts are time series plots of statistics calculated from subgroups

of observations, whereas when we speak of time series plots we are referring to plots of consecutive

observations.

Time Series Plots

One of the first things one should do when analyzing a time series is to plot the data, since

according to Confucius “A picture is worth a thousand words.” A time series plot is a graph where

the horizontal axis represents time and the vertical axis represents the units in which the variable

of concern is measured. For example, consider the following series where the variable of concern

is the price of Anheuser Busch Co. stock on the last trading day for each month from June 1995 to

June 2000 inclusive (for data see STOCK03 in StatGraphics file STOCK.SF). Using the computer

we are able to generate the following time series plot. Note that the horizontal axis represents time

and the vertical axis represents the price of the stock, measured in dollars.

118
When using a time series plot to determine whether a process is stable, what one is seeking is the

answer to the following questions:

1. Is the mean constant?

2. Is the variance constant?

3. Is the series random (i.e. no pattern)?

Rather than initially showing the reader time series plots of stable processes, we show examples of

nonstable processes commonly experienced in practice.

(a) (b)

119
(c) (d)
In figures (a) and (b) a change in mean is illustrated as in figure (a) there is an upward trend, while

in figure (b) there is a downward trend. In figure (c) a change in variance (dispersion) is shown,

while figure (d) demonstrates a cyclical pattern, which is typical of seasonal data. Naturally,

combinations of these departures are examples of nonstable processes.

Runs Test

Frequently nonstable processes can be detected by visually examining their time series plots.

However, there are times when patterns exist that are not easily detected. A tool that can be used to

identify nonrandom data in these cases is the runs test. The logic behind this nonparametric test, is

as follows:

Between any two consecutive observations of a series the series either increases,
decreases or stays the same. Defining a run as a sequence of exclusively positive or
exclusively negative steps (not mixed) then one can count the number of observed
runs for a series. For the given number of observations in the series, one can
calculate the number of expected runs, assuming the series is random. If the number
of observed runs is significantly different from the number of expected runs then one
can conclude that there is enough evidence to suggest that the series is not random.
Note that the runs test is a two tailed test, since there can be either too few of
observed runs [once it goes up (down) it tends to continue going up (down)] or too
many runs [oscillating pattern (up, down, up, down, up, down, etc..)]. To determine
if the observed number significantly differs from the expected number, we encourage
the reader to rely on statistical software (StatGraphics) and utilize the p-values that
are generated.

Normal Distribution?

Another attribute of a stable process, which you may recall lacks specific causes of variation, is that

the series follows a normal distribution. To determine whether a variable follows a normal

distribution one can examine the data via a graph, called a histogram, and/or utilize a test which

incorporates a chi-square test statistic.

120
A histogram is a two dimensional graph in which one axis (usually the horizontal) represents the

range of values the variable may assume and is divided into mutually exclusive classes (usually of

equal length), while the other axis represents the observed frequencies for each of the individual

classes. Recalling the attributes of a normal distribution

 symmetry
 bell shaped
 approximately 2/3 of the observations are within one (1) standard deviation of the
mean
 approximately 95 percent of the observations are within two (2) standard
deviations of the mean

one can visually check to see whether the data approximates a normal distribution. Many software

packages, such as Statgraphics, will overlay the observed data with a theoretical distribution

calculated from the sample mean and sample standard deviation in order to assist in the evaluation.

Even so many individuals still find this evaluation difficult and hence prefer to rely on statistical

testing. The underlying logic of the statistical test for normality is that, like the visual inspection,

the observed frequencies are compared with expected values which are a function of an assumed

normal distribution with the sample mean and sample standard deviation serving as the parameters.

The test statistic:

    2 
2    i i 
 i
alli  

where:
Oi is the number of observed observations in the ith class and E i is the number of expected
observations for the ith class, follows a conditional chi-square distribution with n-1 degrees
of freedom.

In particular, the null hypothesis that the series is normally distributed is rejected when the 2 values

are too large (i.e. the observed values are not close enough to the expected). To aid in the

121
calculations in determining what is too large, we encourage the reader to rely on the results

generated by their statistical software package especially the p-values that are calculated.

Exercises

The data for these exercises are in the file HW.SF. For each series determine if the series are

stationary (i.e. constant mean and constant variance), normal and random. If any of the series

violates any of the conditions (stationarity, normal and random); then, there is information and you

only need to cite the violation.

You are encouraged to examine each series before looking at the solution provided. The series are:

HW.ONE
HW.TWO
HW.THREE
HW.FOUR
HW.FIVE
HW.SIX

For each series, the time units selection is “index” since the series is not monthly, daily or workdays

in particular.

122
1. HW.ONE

The time series (horizontal) plot shows:

Time Sequence Plot

20.28

20.25

20.22
HW.ONE
20.19

20.16

20.13

20.1
0 20 40 60 80 100
Time

Time Series Plot: HW.ONE

Stationarity?

From the visual inspection, one can tell the series is stationary. This may not be obvious to you at

this time; however, it will be with more experience. Remember, one way to determine if the series

is stationary is to take snap shots of the series in different time increments, then impose them in

different time intervals and see if they match up. If you do that with this series, you will indeed see

that is in fact stationary.

123
Normality?

Shown below is the histogram that is generated by Statgraphics for the HW.ONE.11

Frequency Histogram

40

30

frequency
20

10

20.1 20.14 20.18 20.22 20.26 20.3

HW.ONE

Histogram: HW.ONE

Remember, a histogram shows the frequency with which the series occurs at different

intervals along that horizontal axis. From this, one can see that the distribution of

HW.ONE appears somewhat like a normal distribution. Not exactly, but in order to see

how closely it does relate to theoretical normal distribution, we rely on the Chi-square test.

For the Chi-square test, we go to the tabular options and select Tests for Normality. As a

result, Chi-square statistic will then be calculated as displayed in the table shown on the

following page.

11
To obtain such a graph using Stat graphics, we selected the data file HW.SF and then selected Describe,
Distribution Fitting, Uncensored Data, specify One (data series) , Graphical Options and Frequency Histogram.

124
Using the information displayed in the table, one is now able to perform the following hypothesis

test:

H0: the series is normal and

H1: the series is not normal.

As we can see from the table, the p-value (significance level) equals 0.1356. Since the p-value is

greater than alpha (0.05), we retain the null hypothesis, and hence we feel that there is enough

evidence to say that the distribution is normally distributed. Thus, we are able to pass the series as

being normally distributed at this time.

Random?

Relying upon the nonparametric test for randomness, we are now able to look at the series HW.ONE
12
and determine if in fact we think the series is random.

12
This result was generated by the following Stat Graphics steps after selecting the data file HW.SF. Select Special,
Time Series, Descriptive, specifying One as the data series, followed by Tabular Options and in particular selecting Test
for Randomness.

125
Test for Randomness

Recall again that one will reject the null hypothesis of the series is random with the alternate being

the series is not random, if there are either too few or too many runs. Ignoring the information

about the median, and just looking at what is said with regards to the number of runs of up and

down, we note that for HW.ONE there are 74 runs. The expected number of runs is 66.3. We do

not need to rely on a table in a book as we stated before, but again we can just look at the p-value,

which in this case is 0.086 (rounded). So, since the p-value again is larger than our value of  =

0.05, we are able to conclude that we cannot reject the null hypothesis, and hence we conclude that

the series may in fact be random.

Summary

Having checked the series for stationarity, normality and randomness, and not having rejected any

of those particular tests, we are therefore able to say that we do not feel there is any information in

the series based upon these particular criteria.

2............................................................................................................................................... HW.TWO

126
Stationarity?

As one can see in the horizontal time series plot shown below, HW.TWO is not stationary.

Time Sequence Plot

57

55

53
HW.TWO
51

49

47

45

0 20 40 60 80 100
Time

Time Series Plot: HW.TWO

In particular, if you notice at around 60, there is a shift in the series, so that the mean increases.

Hence, for this process, there is information to look at the series because there is a shift in the mean.

Given that piece of information, we will not go to the remaining steps checking for normality and

also for randomness. If it is difficult for you to see the shift of the mean, take a snap shot for the

series from say 0 to 20, and impose that on the values from 60 to 80, and you will see that there is in

fact a difference in the mean itself.

3........................................................................................................................................... HW.THREE

127
Stationarity?

The initial step of our process is once again to take a look at the visual plot of the data itself. As

one can see from the plot shown on the following page, there is a change in variance after the 40 th

time period.

Time Sequence Plot

41

38

HW.THREE35

32

29

26

23
0 20 40 60 80 100

Time

Time Series Plot: HW.THREE

In particular, the variance increases substantially when compared to the variance in the first 40 time

periods. This is the source of information and once again we will not consider the test for normality

or the runs test. We have acquired information about the change of variance.

If you were the manager of a manufacturing process and saw this type of plot, you would be

particularly concerned about the increase in the variability at the 40th time period. Some kind of

intervention took place and one should be able to determine what caused that particular shift of

variance.

128
4. HW.FOUR

The time series plot of HW.FOUR is appears below:

Time Sequence Plot

240

200

160

HW.FOUR
120

80

40

0 20 40 60 80 100

Time

Time Series Plot: HW.FOUR

Stationarity?

As one can clearly see from this plot, the values are linear in that the values fall on a straight line.

This series is clearly not stationary. Once again, if one were to take a snap shot of the values say

between 0 and 40, and just shift that over so they match up between 60 and 100, you have two

separate lines clearly the means are not the same. The mean is changing over time. (We will

discuss this kind of series when we are applying regression analysis techniques.)

129
5. HW.FIVE

Shown below is the time series plot of the HW.FIVE:

Time Sequence Plot

4.2

4.1

HW.FIVE
4

3.9

3.8

0 20 40 60 80 100
Time

Time Series Plot: HW.FIVE

Stationarity?

The series is clearly stationary. It has a constant mean and a constant variance as we move in time.

Once again, recall that one can take a snap shot of the series between a couple time periods, say the

0 and 20, and that will look very similar to any other increments of 20 time periods that are shown

on the time series plot of the series. We now think that the series may in fact be stationary. Recall

we also want to check for normality and the runs test. Hence, we now perform these two tests.

130
Normality?

Once again, utilizing Statgraphics options, we are able fit the series to a normal distribution. A

theoretical distribution is generated using the sample mean and standard deviation as the

parameters. Using those values, we can compare the frequency of our actual observations with the

theoretical normal distribution. Selecting the default options provided by Statgraphics, the

following figure is displayed:

Histogram: HW.FIVE

Note again that the distribution is not exactly normally distributed, but it may closely follow a

normal distribution. To have an actual test, we revert back to the Chi-square test and again using the

default options provided by StatGraphics, we come up with the output shown on the following page.

131
As one can see from the information provided above, the significance level is 0.836735. Since this

value is greater than 0.05, we are not able to reject the null hypothesis that the series is normal, and

hence we feel the series may in fact be approximately normally distributed. We now test the series

for randomness.

Random?

Again, we use the nonparametric test for randomness and what we are able to determine for

this particular series is based upon information shown below:

132
Ignoring the information about the median, we focus our attention on the area discussing the actual

number of runs up and down. Note that the actual number is 71 and the expected number is 66.3.

Is that discrepancy large enough for us to conclude that there are too many runs in the series, and

hence possibly a pattern? To answer that question, we rely on the z-value, which is 0.997291, and

the following information, which provides the p-value, which is 0.319. Since the p-value exceeds 

= 0.05, we are not able to reject (or, retain) the null hypothesis that the series is random; thus we

retain (or, fail to reject) the null hypothesis.

Summary

As with HW.ONE, the series we just looked at, HW.FIVE, by visual inspection is stationary, and

can pass for a normal distribution, and can pass for random series. Hence, based upon these criteria,

again, we are not able to find any information in this particular series.

133
6. HW.SIX

Stationarity? Again, the first step of our investigation is to take a look at the time series plot

Time Sequence Plot

17.7

15.7

13.7
HW.SIX
11.7

9.7

7.7
0 20 40 60 80 100
Time

Time Series Plot: HW.SIX

From this time series plot, there are two values that stand out. We call those values outliers. They

occur at approximately the 58th observation and about the 70th observation. Besides these two

observations, which may have important information of themselves, the rest of the series appears to

be stationary.

Normality?

Using the sample statistics of the mean equaling 12.0819 and standard deviation equals to 1.06121,

we now compare our actual observations with the theoretical normal distribution. As one can see

from the histogram displayed below, the two outliers appear on the extreme points, but the rest of

the series are very closely approximate in normal distribution.

134
Frequency Histogram

40

30

frequency
20

10

7 9 11 13 15 17 19
HW.SIX

Histogram: HW.SIX

Going to the Chi-square test, and again accepting the option provided by Statgraphics, we note the
following result on the next page:

Chi-square Test
---------------------------------------------------------------
Lower Upper Observed Expected
Limit Limit Frequency Frequency Chi-square
---------------------------------------------------------------
at or below 11.300 18 23 1.111
11.300 12.300 39 35 .438
12.300 13.300 34 29 .752
above 13.300 9 13 1.005
---------------------------------------------------------------
Chi-square = 3.30589 with 1 d.f. Sig. level = 0.069032

We noted that the p-value is 0.069032, again since the value is greater than  = 0.05, we conclude

that the series may in fact pass for a normal distribution. The Chi-square statistic is sensitive to

outlier observations as the outlier observations tend to inflate the statistics itself. Given that fact

135
and the significance level we have obtained, we should feel comfortable that the series may in fact

fall in normal distribution. {The inflation of the Chi-square statistics would tend to make the

significance level much smaller than what it would be without the outlier observations.} Hence, we

are unable to reject the null hypothesis that the series is normally distributed.

Randomness?

To determine whether the series can pass the randomness, we once again utilize the nonparametric
runs test. (The information provided is an abbreviated format):

The actual number of runs up and down is 72 verses the expected number of 66.3333. The question

we need to ask now is “Is the difference significant, which would imply that we have too many runs

verses the theoretical distribution?” As one can see from the p-value, which is 0.21662, we will not

reject the null hypothesis that the series is random because the p-value again exceeds our stated

value of alpha. Thus, we are able to conclude that we feel the series may in fact be random.

136
Summary

We have observed from HW.SIX that the series may in fact be stationary, normally distributed and

random. We are possibly concerned about this measure with the two outlier observations numbers

58 and 70. As a manager, one will naturally want to ask the question what happend at those time

periods, and see if there is information. Note, without the visual plot, we will never expect the

series to have information based solely upon the normality test and the runs test. Thus, one can see

that the visual plot of the data is extremely important if we are to determine information in the

series itself. Of all the tests we had looked at, the visual plot is probably our most important one

and one that we should always do whenever looking at a set of data.

137
The Concept of Stock Beta

An important application of simple linear regression, from business, is used to calculate the  of a

stock. The ‘s are a measure of risk and are used by portfolio managers when selecting stocks. The

model used (specified) to calculate a stock  is as follows:

R j,t =  +  R m,t + t

Where: R j,t is the rate of return for the jth stock in time period t
R m,t is the market rate of return in time period t
t is the error term in time period t
 and  are constants

A formula for R j,t (the rate of return for the jth stock in time period t) follows:

R j,t = ((P + D)t - (P + D) t-1) / (P + D) t-1

Where: P is the price of the stock


D is the stock’s dividend
(P + D)t is the sum of the price and dividend for given time period
(P + D)t-1 is the sum of the price and dividend for a previous time period (in this
case, one time period prior to the “given” time period)

Generally, the average stock moves up and down with the general market as measured by some

accepted index such as S&P 500 or the New York Stock Exchange (NYSE) Index. By definition, a

stock has a beta of one (1.0). As the market moves up or down by one percentage point, stock will

also tend to move up or down by one percentage point. A portfolio of these stocks will also move

up or down with the broad market averages.

If a stock has a beta of 0.5, the stock is considered to be one-half as volatile as the market. It is one-

half as risky as a portfolio with a beta of one. Likewise, a stock with a beta of two (2.0) is

considered to be twice as volatile as an average stock. Such a portfolio will be twice as risky as an

average portfolio.

138
Betas are calculated and published by Value Line and numerous other organizations. The beta

coefficients shown in the table below were calculated using data available at

http://investor.msn.com, for a time period from June 1995 to June 2000. Most stocks have beta in

the range of 0.75 to 1.50, with the average for all stocks being a beta of 1.0. Which stock is the

most stable? Which stock is the most risky? Is it possible for a stock to have a negative beta

(consider gold stocks)? If so, what industry might it represent?

Stock Beta
Ebay.com 4.29
Amazon.com 1.70
Seagate Technology 1.30
Harley Davidson 1.13
Dow Chemical 1.11
General Electric 0.98
Intel 0.77
Chevron 0.73
Pacific Gas & Electric 0.00

In summary, the regression coefficient,  (the beta coefficient), is a market sensitivity index; it

measures the relative volatility of a given stock versus the average stock, or “the market.” The

tendency of an individual stock to move with the market constitutes a risk because the stock market

does fluctuate daily. Even well diversified portfolios are exposed to market risk.

As asked in your text, given the ‘s for Anheuser Busch Company (0.579), The Boeing Company

(0.904), and American Express (1.080), what are the implications for the three companies?

[Note: If the concept of stock risk is of special interest, please refer to any intermediate financial

management text for a more in-depth explanation. The concept is critically important to financial

management.]

139

You might also like