0% found this document useful (0 votes)
41 views55 pages

Statistika Ekonomi Dan Bisnis Lanjutan

The document discusses sampling fundamentals and techniques. It covers: 1) When census and sampling are appropriate based on population size, cost, and need for accuracy. 2) The sampling process, including identifying the target population, determining the sampling frame, selecting a sampling technique, and determining sample size. 3) Common sampling techniques like simple random sampling, stratified sampling, cluster sampling, and non-probability sampling. It provides examples of how to calculate sample sizes and selection probabilities for different stratified sampling techniques. It also discusses addressing non-response biases and sources of bias in shopping center sampling.

Uploaded by

Rasyid Fikri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views55 pages

Statistika Ekonomi Dan Bisnis Lanjutan

The document discusses sampling fundamentals and techniques. It covers: 1) When census and sampling are appropriate based on population size, cost, and need for accuracy. 2) The sampling process, including identifying the target population, determining the sampling frame, selecting a sampling technique, and determining sample size. 3) Common sampling techniques like simple random sampling, stratified sampling, cluster sampling, and non-probability sampling. It provides examples of how to calculate sample sizes and selection probabilities for different stratified sampling techniques. It also discusses addressing non-response biases and sources of bias in shopping center sampling.

Uploaded by

Rasyid Fikri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Masterbook of Business and Industry (MBI)

STATISTIKA EKONOMI -Intersection problem

DAN BISNIS LANJUTAN


SESSION 1 3.) Selecting a Sampling Procedure
Choose between Bayesian and Traditional sampling
procedure
SAMPLING FUNDAMENTALS Decide whether to sample with or without
replacement

When is census appropriate? The Sampling Process


Population size is quite small 1. Identifying the Target Population
Information is needed from every individual in the 2. Determining the Sampling Frame (Reconciling
population the Population, Sampling Frame Differences)
Cost of making an incorrect decision is high 3. Selecting a Sampling Frame ( Probability
Sampling errors are high Sampling OR Non-Probability Sampling )
4. Determining the Relevant Sample Size
When is sample appropriate? 5. Execute Sampling
Population size is large 6. Data Collection From Respondents (Handling
Both cost and time associated with obtaining the Non-Response Problem)
information from the population is high 7. Information for Decision-Making
Quick decision is needed
To increase response quality since more time can be Sampling Techniques
spent on each interview Probability Sampling : All population members have a
Population being dealt with is homogeneous known probability of being in the sample
If census is impossible
Simple Random Sampling : Each population member
Error in Sampling and each possible sample has equal probability of
being selected
Total Error : Difference between the true value and
the observed value of a variable Stratified Sampling : The chosen sample is forced to
contain units from each of the segments or strata of
Sampling Error : Error is due to sampling the population

Non-sampling Error : Error is observed in both census Types of Stratified Sampling


and
Proportionate Stratified Sampling
Sources of non-sampling error Number of objects/sampling units chosen from each
1. Measurement Error group is proportional to number in population
2. Data Recording Error Can be classified as directly proportional or indirectly
3. Data Analysis Error proportional stratified sampling
4. Non-response Error
Disproportionate Stratified Sampling
Sampling Process Sample size in each group is not proportional to the
respective group sizes
1.) Determining Target Population Used when multiple groups are compared and
Well thought out research objectives respective group sizes are small
Consider all alternatives
Know your market Directly Proportional Stratified Sampling
Consider the appropriate sampling unit
Specify clearly what is excluded
Should be reproducible
Consider convenience
2.) Determining Sampling Frame
List of population members used to obtain a sample
Issues:
▫Obtaining appropriate lists
▫Dealing with population sampling frame differences
-Superset problem
Inversely Proportional Stratified Sampling
Muhammad Firman (University of Indonesia - Accounting ) 153
Masterbook of Business and Industry (MBI)

Assume that among the 600 consumers in the Appropriate when reaching small, specialized
population, 200 are heavy drinkers and400 are light populations
drinkers. If a research values the opinion of the heavy Each respondent, after being interviewed, is asked to
drinkers more than that of the lightdrinkers, more identify one or more others in the field
people will have to be sampled from the heavy
drinkers group. If a sample size of 60 is desired, a 10 Convenience
percent inversely proportional stratified samplingis Used to obtain information quickly and inexpensively
employed.The selection probabilities are computed as Quota
follows: Minimum number from each specified subgroup in
the population
Often based on demographic data
Quota Sampling - Example

Cluster Sampling
Involves dividing population into subgroups
Random sample of subgroups/clusters is selected and
all members of subgroups are interviewed
Very cost effective
Useful when subgroups can be identified that are
representative of entire population
Comparison of Stratified & Cluster Sampling
Processes

Systematic Sampling Non-Response Problems


Involves systematically spreading the sample through
the list of population members 1. Respondents may:Refuse to respond, Lack the
Commonly used in telephone surveys ability to respond , Be inaccessible
Sampling efficiency depends on ordering of the list in 2. Sample size has to be large enough to allow
the sampling frame for non response
3. Those who respond may differ from non
Non Probability Sampling respondents in a meaningful way, creating
Costs and trouble of developing sampling frame are biases
eliminated 4. Seriousness of non-response bias depends on
Results can contain hidden biases and uncertainties extent of non response
Used in: Solutions to Non-response Problem
The exploratory stages of a research project Improve research design to reduce the number of
Pre-testing a questionnaire non-
Dealing with a homogeneous population responses
When a researcher lacks statistical knowledge Repeat the contact one or more times (call back) to
When operational ease is required try to
reduce non-responses
Types of Non Probability Sampling Attempt to estimate the non-response bias
Judgmental
"Expert" uses judgment to identify representative Shopping Center Sampling
samples 20% of all questionnaires completed or interviews
granted are store-intercept interviews
Snowball Bias is introduced by methods used to select
Form of judgmental sampling
Source of Bias
Muhammad Firman (University of Indonesia - Accounting ) 154
Masterbook of Business and Industry (MBI)

1. Selection of shopping center


2. Point of shopping center from which 2. Nonprobability Sample : Items included are
respondents are drawn chosen without regard to their probability of
3. Time of day occurrence
4. More frequent shoppers will be more likely to
be selected
Solutions to Bias
Shopping Center Bias
1. Use several shopping centers in different
neighborhoods
2. Use several diverse cities
Sample Locations Within a Center
1. Stratify by entrance location
2. Take separate sample from each entrance
3. To obtain overall average, strata averages
should be combined by weighing them to
reflect traffic that is associated with each
entrance
Time Sampling
1. Stratify by time segments Simple Random Samples
2. Interview during each segment Suppose that a sample of n objects is to be selected
3. Final counts should be weighed according to from a population of N objects. A simple random
traffic counts sample procedure is one in which every possible
sample of n objects is equally likely to be chosen. Only
Sampling People versus Shopping Visits – Options: sampling without replacement is considered here.
1. Ask respondents how many times they visited Random samples can be obtained from table of
the shopping center during a specified time random numbers or computer random number
period, such as the last four weeks and weight generators.
results according to frequency Decide on sample size: n
2. Use quotas, which serve to reduce the biases 1. Divide frame of N individuals into groups of j
to levels that may be acceptable 2. individuals: j=N/n
3. Control for sex, age, employment status etc. 3. Randomly select one individual from the 1st
4. The number sampled should be proportional 4. group
to the number of the quota in the population 5. Select every jth individual thereafter
EX : N = 64, n = 8, j = 8
Steps of a Sampling Study
Step 1: Information Required? Finite Population Correction Factor
Step 2: Relevant Population? Suppose sampling is without replacement and the
Step 3: Sample Selection? sample size is large relative to the population size.
Step 4: Obtaining Information? Assume the population size is large enough to apply
Step 5: Inferences From the central limit theorem. Apply the finite population
Step 6: Conclusions? correction factor when estimating the population
variance.
Sampling and Nonsampling Errors
A sample statistic is an estimate of an unknown
population parameter . Sample evidence from a
population is variableSample-to-sample variation is
expected . Sampling error results from the fact that Estimating the Population Mean
we only see a subset of the population when a sample Let a simple random sample of size n be taken from a
is selected. Statistical statements can be made about population of N members with mean μ. The sample
sampling error. It can be measured and interpreted mean is an unbiased estimator of the population
using confidence intervals, probabilities, etc. mean μ. The point estimate is:

Nonsampling error results from sources not related to


the sampling procedure used.
Examples:
1. The population actually sampled is not the
relevant one An unbiased estimation procedure for the varianceof
2. Survey subjects may give inaccurate or the sample mean yields the point estimate
dishonest answers
3. Nonresponse to survey questions
Types of Samples
1. Probability Sample : Items in the sample are
chosen on the basis of known probabilities

Muhammad Firman (University of Indonesia - Accounting ) 155


Masterbook of Business and Industry (MBI)

Provided the sample size is large, 100(1 -


α)%confidence intervals for the population mean
aregiven by
Stratified Sampling
Overview of stratified sampling:
• Divide population into two or more subgroups
(called
• strata) according to some common
characteristic
Estimating the Population Total • A simple random sample is selected from each
Consider a simple random sample of size n from a subgroup
population of size N. The quantity to be estimated is • Samples from subgroups are combined into
the population total Nμ. An unbiased estimation one
procedure for the population total Nμ yields the point
estimate NX. An unbiased estimator of the variance of
thepopulation total is

Provided the sample size is large, a 100(1 -


α)%confidence interval for the population total is
Stratified Random Sampling
Suppose that a population of N individuals can be
subdivided into K mutually exclusive and collectively
Confidence Interval for Population Total: Example exhaustive groups, or strata . Stratified random
A firm has a population of 1000 accounts and wishes sampling is the selection of independent simple
to estimate the total population value A sample of 80 random samples from each stratum of the population.
accounts is selected with average balance of $87.6
and standard deviation of $22.3 . Find the 95% Let the K strata in the population contain N1, N2,. . .,
confidence interval estimate of the total balance ! NK members, so that
N1 + N2 + . . . + NK = N
Let the numbers in the samples be n1, n2, . . ., nK.
Then the total number of sample members is
n1 + n2 + . . . + nK = n
Estimation of the Population Mean,Stratified
Random Sample
Let random samples of nj individuals be taken
fromstrata containing Nj individuals (j = 1, 2, . . ., K)
Let

Denote the sample means and variances in the


strataby Xj and sj2 and the overall population mean by
μ. An unbiased estimator of the overall population
Estimating the Population Proportion meanμ is:
1. Let the true population proportion be P
2. Let be the sample proportion from n
observations from a simple random sample
3. The sample proportion, , is an unbiased An unbiased estimator for the variance of the overall
estimator of the population proportion, P populationmean is

An unbiased estimator for the variance of


thepopulation proportion is

Where

Provided the sample size is large, a 100(1 -


α)%confidence interval for the population proportion
is
Provided the sample size is large, a 100(1 - α)%
confidence
Muhammad Firman (University of Indonesia - Accounting ) 156
Masterbook of Business and Industry (MBI)

interval for the population mean for stratified random


samples is

The sample size for the jth stratum using


proportionalallocation is
Suppose that random samples of nj individuals from
strata containing Nj individuals (j = 1, 2, . . ., K) are
selected and that the quantity to be estimated is the
population total, Nμ
An unbiased estimation procedure for the population Optimal Allocation
total Nμ yields the point estimate To estimate an overall population mean or total and if
the population variances in the individual strata are
denoted σj2 , the most precise estimators are
obtained with optimal allocation. The sample size for
the jth stratum using optimal allocation is
An unbiased estimation procedure for the variance
ofthe estimator of the population total yields the
pointestimate

To estimate the overall population proportion,


Provided the sample size is large, 100(1 - estimators with the smallest possible variance are
α)%confidence intervals for the population total obtained by optimal allocation.The sample size for the
forstratified random samples are obtained from jth stratum for population proportion using optimal
allocation is

Estimation of the Population Proportion,Stratified


Random Sample
Suppose that random samples of nj individuals
fromstrata containing Nj individuals (j = 1, 2, . . ., K)
areobtained. Let Pj be the population proportion, and Determining Sample Size
thesample proportion, in the jth stratum . If P is the The sample size is directly related to the size of the
overall population proportion, an unbiasedestimation variance of the population estimator. If the researcher
procedure for P yields sets the allowable size of the variance in advance, the
necessary sample size can be determined
Sample Size, Mean,Simple Random Sampling
Consider estimating the mean of a population of
Nmembers, which has variance σ2 . If the desired
An unbiased estimation procedure for thevariance of variance, of the sample mean isspecified, the required
the estimator of the overall populationproportion is sample size to estimate thepopulation mean through
simple random sampling is

Where
Often it is more convenient to specify directly
thedesired width of the confidence interval for
thepopulation mean rather than. Thus the researcher
specifies the desired margin of error forthe mean.
is the estimate of the variance of the sample Calculations are simple since, for example, a
proportion in 95%confidence interval for the population mean
the jth stratum willextend an approximate amount 1.96 on each side
of the sample mean, X
Provided the sample size is large, 100(1 - α)%
confidence intervals for the population proportion for Required Sample Size Example
stratified random samples are obtained from 2000 items are in a population. If σ = 45,what sample
size is needed to estimate themean within ± 5 with
95% confidence?

Proportional Allocation:Sample Size


One way to allocate sampling effort is to make
theproportion of sample members in any stratum the
sameas the proportion of population members in the
stratum If so, for the jth stratum,
Muhammad Firman (University of Indonesia - Accounting ) 157
Masterbook of Business and Industry (MBI)

Cluster Sampling
Population is divided into several “clusters,”each
Consider estimating the proportion P of individualsin a representative of the population. A simple random
population of size N who possess a certainattribute. If sample of clusters is selected. Generally, all items in
the desired variance, , of the sample proportionis the selected clusters are examined. An alternative is
specified, the required sample size to estimate to chose items from selected clusters usinganother
thepopulation proportion through simple probability sampling technique
randomsampling is

A population is subdivided into M clusters and a


simple random sample of m of these clusters is
The largest possible value for this expression selected and information is obtained from every
occurswhen the value of P is 0.25 member of the sampled clusters.
1. Let n1, n2, . . ., nm denote the numbers of
members in the m sampled clusters.
2. Denote the means of these clusters by
A 95% confidence interval for the population
proportionwill extend an approximate amount 1.96
on eachside of the sample proportion 3. Denote the proportions of cluster members
possessing an attribute of interest by P1, P2, .
How large a sample would be necessary to estimate . . , Pm
the true proportion of voters who will vote for
proposition A, within ±3%, with 95% confidence, from The objective is to estimate the overall population
a population of 3400 voters? mean
μ and proportion P. Unbiased estimation procedures
Solution: give :
N = 34000
For 95% confidence, use z = 1.96
S

Estimates of the variance of these estimators,


following fromunbiased estimation procedures, are
Sample Size, Mean, Stratified Sampling
Suppose that a population of N members is
subdivided in K strata containing N1, N2, . . .,NK
members. Let σj2 denote the population variance in
the jth stratum . An estimate of the overall population
mean is desired. If the desired variance, , of the
sample estimator is specified, the required total
sample size, n, can be found
For proportional allocation:

is the average number of individuals in the sampled


clusters
Provided the sample size is large, 100(1 -
α)%confidence intervals using cluster sampling are
For optimal allocation:
for the population mean

for the population proportion

Muhammad Firman (University of Indonesia - Accounting ) 158


Masterbook of Business and Industry (MBI)

• Always contains “=” , “≤” or “ ” sign


• May or may not be rejected

Two-Phase Sampling The Alternative Hypothesis, H1


Sometimes sampling is done in two steps. An initial
pilot sample can be done • Is the opposite of the null hypothesis, e.g.The
average number of TV sets in US. homes is not
Disadvantage equal to 3( H1: μ ≠ 3 )
• takes more time • Challenges the status quo
• Never contains the “=” , “≤” or “ ” sign
Advantages: • May or may not be supported
1. Can adjust survey questions if problems are • Is generally the hypothesis that the researcher
noted is trying to support
2. Additional questions may be identified
3. Initial estimates of response rate or
population parameters can be obtained
Non-Probability Samples
It may be simpler or less costly to use a non- Hypothesis Testing Process
probability based sampling method
1. Judgement sample
2. Quota sample
3. Convience sample
These methods may still produce good estimates of
population parameters. But :
• Are more subject to bias
• No valid way to determine reliability

SESSION 2

Reason for Rejecting H0


HYPOTESIS TESTING

What is a Hypothesis?
A hypothesis is a claim(assumption) about
apopulation parameter:
population mean
Example: The mean monthly cell phone bilof this city
is μ = $42
population proportion
Example: The proportion of adults in thiscity with cell
phones is p = .68
The Null Hypothesis, H0 Level of Significance, α
States the assumption (numerical) to betested • Defines the unlikely values of the sample
Example: The average number of TV sets inU.S. statistic if the null hypothesis is true ( Defines
Homes is equal to three ( ). Null rejection region of the sampling distribution).
Hypothesis Is always about a population • Is designated by α , (level of significance),
parameter,not about a sample statistic Typical values are .01, .05, or .10
• Is selected by the researcher at the beginning
• Provides the critical value(s) of the test
Level of Significance and the Rejection Region

The Null Hypothesis, H0


• Begin with the assumption that the
nullhypothesis is true . Similar to the notion
of innocent untilproven guilty
• Refers to the status quo

Muhammad Firman (University of Indonesia - Accounting ) 159


Masterbook of Business and Industry (MBI)

Errors in Making Decisions The decision rule is:


Type I Error
• Reject a true null hypothesis
• Considered a serious type of error
The probability of Type I Error is α
• Called level of significance of the test
• Set by researcher in advance

Type II Error p-Value Approach to Testing


• Fail to reject a false null hypothesis p-value: Probability of obtaining a test statistic more
• The probability of Type II Error is β extreme ( ) than the observed sample value
given H0 is true.
Outcomes and Probabilities • Also called observed level of significance
• Smallest value of α for which H0 can be
rejected
• Convert sample result (e.g., ) to test statistic
(e.g., z
statistic )
Obtain the p-value, For an uppertail test:

Type I & II Error Relationship


Type I and Type II errors can not happen atthe same
time
• Type I error can only occur if H0 is true Decision rule: compare the p-value to α
• Type II error can only occur if H0 is false If p-value <α , reject H0
If p-value α , do not reject H0
If Type I error probability ( α )↑ , thenType II error
probability ( β )↓ Example: Upper-Tail Z Test for Mean (σ Known)
A phone industry manager thinks that customer
Factors Affecting Type II Error monthly cell phone bill have increased, and now
All else equal, β↑ when the difference between average over $52 per month. The company wishes to
hypothesized parameter and its true value↓ test this claim. (Assume σ = 10 is known)
β ↑when α↓ , β↑ when σ↑, β↑ when n↓
Form hypothesis test:
Power of the Test H0: μ ≤ 52 the average is not over $52 per month
The power of a test is the probability of rejecting a H1: μ > 52 the average is greater than $52 per month
null hypothesis that is false. i.e., Power = P(Reject H0 | (i.e., sufficient evidence exists to support the
H1 is true). Power of the test increases as the sample manager’s claim)
size increases
Example: Find Rejection Region
Hypothesis Tests for the Mean Suppose that α = .10 is chosen for this test
Find the rejection region:
Test of Hypothesisfor the Mean (σ Known)
Convert sample result ( ) to a z value

Muhammad Firman (University of Indonesia - Accounting ) 160


Masterbook of Business and Industry (MBI)

In many cases, the alternative hypothesis focuses on


one particular direction
H0: μ ≥ 3
H1: μ < 3
This is a lower-tail test since the alternative
hypothesis is focused on the lower tail below the
mean of 3
H0: μ ≤ 3
H1: μ > 3
This is an upper-tail test since the alternative
hypothesis is focused on the upper tail above the
mean of 3
Example: Sample Results Upper-Tail Tests
Obtain sample and compute the test statistic There is only one critical value, since the rejection
area is in only one tailH0: μ ≤ 3 H1: μ > 3
Suppose a sample is taken with the following
results: n = 64, x = 53.1 (σ =10 was assumed known)
Using the sample results,

Example: Decision
Reach a decision and interpret the result:
Lower-Tail Tests
There is only one critical value, since the rejection
area is in only one tailH0: μ ≥ 3 H1: μ < 3

Two-Tail Tests
In some settings, the alternative hypothesis does not
Example: p-Value Solution specify a unique direction. There are two critical
Calculate the p-value and compare to α(assuming that values, defining the two regions of rejection
μ = 52.0) H0: μ = 3
H1: μ ≠ 3

Hypothesis Testing Example


One-Tail Tests Test the claim that the true mean # of TV sets in US
homes is equal to 3.(Assume σ = 0.8)
Muhammad Firman (University of Indonesia - Accounting ) 161
Masterbook of Business and Industry (MBI)

State the appropriate null and alternativehypotheses !


H0: μ = 3 , H1: μ ≠ 3 (This is a two tailed test)
Specify the desired level of significance !
Suppose that α = .05 is chosen for this test
Choose a sample size !
Suppose a sample of size n = 100 is selected
Determine the appropriate technique !
σ is known so this is a z test
Set up the critical values !
For α = .05 the critical z values are ±1.96 Compare the p-value with α
• If p-value <α , reject H0
Collect the data and compute the test statistic ! • If p-value α , do not reject H0
Suppose the sample results aren = 100, x = 2.84 (σ =
0.8 is assumed known), So the test statistic is: Here: p-value = .0456 , α = .05
Since .0456 < .05, we reject the null hypothesis
t Test of Hypothesis for the Mean(σ Unknown)
For a two-tailed test:
Hypothesis Testing Example
Is the test statistic in the rejection region?

The decision rule is:

Reach a decision and interpret the result

Example: Two-Tail Test(σ Unknown)


Since z = -2.0 < -1.96, we reject the null hypothesisand The average cost of ahotel room in New Yorkis said to
conclude that there is sufficient evidence that be $168 pernight. A random sampleof 25 hotels
themean number of TVs in US homes is not equal to 3 resulted in
x = $172.50 ands = $15.40. Test at theα = 0.05 level
Example: p-Value (Assume the population distribution is normal)
How likely is it to see a sample mean of2.84 (or H0: μ = 168
something further from the mean, in eitherdirection) H1: μ ≠ 168
if the true mean is = 3.0?
σ is unknown, souse a t statistic

Muhammad Firman (University of Indonesia - Accounting ) 162


Masterbook of Business and Industry (MBI)

Z Test for Proportion: Solution


Critical Value:t24 , .025 = ± 2.0639 H0: P = .08H1: P ≠.08
α = .05n = 500, = .05
Test Statistic:

Critical Values: ± 1.96

Do not reject H0: not sufficient evidence thattrue


mean cost is different than $168
Tests of the Population Proportion
• Involves categorical variables
• Two possible outcomes
- “Success” (a certain characteristic is present)
- “Failure” (the characteristic is not present) Decision:
• Fraction or proportion of the population in Reject H0 at α = .05
the “success” category is denoted by P
• Assume sample size is large Conclusion:There is sufficient evidence to reject
thecompany’s claim of 8%response rate.
Proportions
Sample proportion in the success category isdenoted p-Value Solution
by Calculate the p-value and compare to α(For a two
sided test the p-value is always two sided)

When nP(1 – P) > 9, can be approximatedby a normal


distribution with mean andstandard deviation

Hypothesis Tests for Proportions

Reject H0 since p-value = .0136 <α = .05


Power of the Test
Recall the possible hpothesis test outcomes :

Example: Z Test for Proportion


A marketing companyclaims that it receives8%
responses from itsmailing. To test thisclaim, a random
sampleof 500 were surveyedwith 25 responses. Testat
the α = .05significance level.
Check:
Our approximation for P is= 25/500 = .05
nP(1 - P) = (500)(.05)(.95)= 23.75 > 9

Muhammad Firman (University of Indonesia - Accounting ) 163


Masterbook of Business and Industry (MBI)

Type II Error
Assume the population is normal and the
populationvariance is known. Consider the test

The decision rule is :

If the null hypothesis is false and the true mean is μ*,


then the probability of type II error is

Power of the Test Example


If the true mean is μ* = 50,
Type II Error Example The probability of Type II Error = β = 0.1539
Type II error is the probability of failingto reject a false The power of the test = 1 – β = 1 – 0.1539 = 0.8461
H0
Suppose we fail to reject H0: μ ≥ 52
when in fact the true mean is μ* = 50

Suppose we do not reject H0: μ ≥ 52 when in fact the


true mean is μ* = 50 SESSION 3

TWO-SAMPLE TEST OF HYPOTHESIS

Calculating β
Suppose n = 64 , σ = 6 , and α = .05

Muhammad Firman (University of Indonesia - Accounting ) 164


Masterbook of Business and Industry (MBI)

Assume you send your salespeople to a


“customerservice” training workshop. Has the training
made adifference in the
number of complaints? You collectthe following data:

Matched Pairs
Tests Means of 2 Related Populations
1. Paired or matched samples
2. Repeated measures (before/after)
3. Use difference between paired values: di = xi -
yi
Matched Pairs: Solution
Assumptions:Both Populations Are Normally Has the training made a difference in the number of
Distributed complaints (at the α = 0.01 level)?
H0: μx – μy = 0
The test statistic for the meandifference is a t value, H1: μx – μy ≠ 0
with Test Statistic:
n – 1 degrees of freedom: Critical Value = ± 4.604
d.f. = n - 1 = 4

Where
D0 = hypothesized mean difference
sd = sample standard dev. of differences
n = the sample size (number of pairs)

Decision: Do not reject H0(t stat is not in the reject


Decision Rules: Matched Pairs region)
Conclusion: There is not asignificant change in
thenumber of complaints.
Difference Between Two Means
Population means, independent samples
Goal: Form a confidence interval for the difference
between two population means, μx – μy
Different data sources
1. Unrelated
2. Independent, Sample selected from one
population has no effect on the sample
selected from the other population

Matched Pairs Example

Muhammad Firman (University of Indonesia - Accounting ) 165


Masterbook of Business and Industry (MBI)

Assumptions:
1. Samples are randomly and independently
drawn
2. Populations are normally distributed
3. Population variances are unknown but
assumed equal
Forming interval estimates:
The population variances are assumed equal, so use
the two sample standard deviations and pool them to
estimate σ use a t value with (nx + ny – 2) degrees of
freedom, The test statistic for is :
σx2 and σy2 known
Assumptions:
1. Samples are randomly and independently
drawn
2. both population distributions are normal
3. Population variances are known
Where t has (n1 + n2 – 2) d.f.,and
2 2
σx and σy known
When σx2 and σy2 are known andboth populations
are normal, the
variance of X – Y is
σx2 and σy2 Unknown, Assumed Unequal
Assumptions:
1. Samples are randomly and independently
drawn
2. Populations are normally distributed
And the random variable 3. Population variances are unknown and
assumed unequal
Forming interval estimates:
The population variances are assumed unequal, so a
pooled variance is not appropriate, use a t value with
v degrees of freedom, where

Test Statistic,σx2 and σy2 Known


The test statistic forμx – μy is:

The test statistic forμx – μy is:

Two Population Means, Independent Samples,


Two Population Means, Independent Samples, Variances Unknown
Variances Known :Decision rules

σx2 and σy2 Unknown, Assumed Equal Pooled Variance t Test: Example
Muhammad Firman (University of Indonesia - Accounting ) 166
Masterbook of Business and Industry (MBI)

You are a financial analyst for a brokerage firm. Is


there a difference in dividend yield between stocks
listed on the NYSE & NASDAQ? You collect the
following data:

Where
Assuming both populations areapproximately normal
with equal variances, is there a difference in average
yield (α = 0.05)?
Calculating the Test Statistic
The test statistic is: Decision Rules: Population proportions

Example: Two Population Proportions


Solution Is there a significant difference between the
H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) proportion of men and the proportion of women who
H1: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2) will vote Yes on Proposition A?

α = 0.05 In a random sample, 36 of 72 men and 31 of 50


df = 21 + 25 - 2 = 44 women indicated they would vote Yes. Test at the .05
Critical Values: t = ± 2.0154 level of significance

Test Statistic: The hypothesis test is:


H0: PM – PW = 0 (the two proportions are equal)
H1: PM – PW ≠ 0 (there is a significant difference
between
proportions)
The sample proportions are:Men: = 36/72 = .50
Women: = 31/50 = .62
The estimate for the common overall proportion is:

The test statistic for PM – PW = 0 is:

Decision:Reject H0 at α = 0.05
Conclusion:There is evidence of a difference in means.
Two Population Proportions
Goal: Test hypotheses for the difference between two
population proportions, Px – Py
Assumptions:Both sample sizes are large,nP(1 – P) > 9
Test Statistic forTwo Population Proportions
The test statistic forH0: Px – Py = 0is a z value: Critical values = 1.96 for α = .05
Muhammad Firman (University of Indonesia - Accounting ) 167
Masterbook of Business and Industry (MBI)

Decision: Do not reject H0 2denominator degrees of freedom . The critical value


Conclusion: There is notsignificant evidence of for a hypothesis test
adifference between menand women in about two population variances is :
proportionswho will vote yes.
Hypothesis Tests of one Population Variance
Goal: Test hypotheses about thepopulation variance,
σ2 where F has (nx – 1) numeratordegrees of freedom
If the population is normally distributed, and (ny – 1)denominator degrees of freedom
Decision Rules: Two Variances
Use sx2 to denote the larger variance.
follows a chi-square distribution with(n – 1) degrees
of freedom
Confidence Intervals for thePopulation
VariancePopulation
Variance
The test statistic forhypothesis tests about
onepopulation variance is

Decision Rules: Population variance

Rejection region for a two-tail test is :

where sx2 is the larger ofhe two sample variances


Hypothesis Tests for Two Variances (F test statistics) Example: F Test
Goal: Test hypotheses about two population variances You are a financial analyst for a brokerage firm. You
want to compare dividend yields between stocks
listed on the NYSE & NASDAQ. You collect the
following data:

Is there a difference in the variances between the


NYSE & NASDAQ at the α = 0.10 level?
F Test: Example Solution
The two populations are assumed to be independent Form the hypothesis test:
and normally distributed. The random variable H0: σx2 = σy2 (there is no difference between
variances)
H1: σx2 ≠ σy2 (there is a difference between
variances)
Find the F critical values for α = .10/2

Has an F distribution with (nx – 1)numerator degrees Degrees of Freedom:


of freedom and(ny – 1) denominator degrees Numerator : nx – 1 = 21 – 1 = 20 d.f.
offreedom. Denote an F value with 1 numerator and (NYSE has the larger standard deviation)

Muhammad Firman (University of Indonesia - Accounting ) 168


Masterbook of Business and Industry (MBI)

Denominator:ny – 1 = 25 – 1 = 24 d.f. One-Way ANOVA

The statistic is :

All Means are the same:The Null Hypothesis is True


(No variation betweenroups)

F = 1.256 is not in the rejectionregion, so we do not


reject H0 At least one mean is different:The Null Hypothesis is
Conclusion: There is not sufficient evidenceof a NOT true(Variation is present between groups)
difference in variances at α = .10
Variability
The variability of the data is key factor to test the
equality of means. In each case below, the means may
look different, but a large variation within groups in B
makes the evidence that the means are different
weak
SESSION 4 - 6

ANALYSIS OF VARIANCE (ANOVA)

One-Way Analysis of Variance


Evaluate the difference among the means of three or
more groups. Examples: Average production for 1st,
2nd, and 3rd shiftExpected mileage for five brands of
tires
Assumptions
1. Populations are normally distributed Partitioning the Variation
2. Populations have equal variances Total variation can be split into two parts:
3. Samples are randomly and independently
drawn
SST = Total Sum of Squares
Hypotheses of One-Way ANOVA Total Variation = the aggregate dispersion of the
individual
data values across the various groups
All population means are equal SSW = Sum of Squares Within Groups
i.e., no variation in means between groups Within-Group Variation = dispersion that exists among
the
data values within a particular group
t least one population mean is different SSG = Sum of Squares Between Groups
i.e., there is variation between groups Between-Group Variation = dispersion between the
Does not mean that all population means are different group
(some pairs may be the same) sample means
Muhammad Firman (University of Indonesia - Accounting ) 169
Masterbook of Business and Industry (MBI)

Total Sum of Squares (SST)

Where:
SST = Total sum of squares
K = number of groups (levels or treatments)
ni = number of observations in group i
xij = jth observation from group i Between-Group Variation
x = overall sample mean
Total variation

Where:
SSG = Sum of squares between groups
K = number of groups
ni = sample size from group i
xi = sample mean from group i
x = grand mean (mean of all data values)
Variation Due toDifferences Between Groups

Within-Group Variation

Where:
SSW = Sum of squares within groups
K = number of groups Mean Square Between Groups= SSG/degrees of
ni = sample size from group i freedom
Xi = sample mean from group i
Xij = jth observation in group i
Summing the variationwithin each group and
thenadding over all groups n K

Mean Square Within =SSW/degrees of freedom

Muhammad Firman (University of Indonesia - Accounting ) 170


Masterbook of Business and Industry (MBI)

Obtaining the Mean Squares One-Factor ANOVAF Test Example


You want to see if threedifferent golf clubs
yielddifferent distances. Yourandomly select
fivemeasurements from trials onan automated
drivingmachine for each club. At the
.05 significance level, is therea difference in mean
distance?

Club 1 Club 2 Club 3


254 234 200
263 218 222
241 235 197
237 227 206
251 216 204

K = number of groups
n = sum of the sample sizes from all groups
df = degrees of freedom
One-Factor ANOVA (F Test Statistic)
H0: μ1= μ2 = … = μK
H1: At least two population means are different
Test statistic

MSG is mean squares between variances


MSW is mean squares within variances
Degrees of freedom
df1 = K – 1 (K = number of groups)
df2 = n – K (n = sum of sample sizes from all groups)
Interpreting the F Statistic
The F statistic is the ratio of the between estimate of
variance and the within estimate of variance. The
ratio must always be positive
• df1 = K -1 will typically be small
• df2 = n - K will typically be large
Decision Rule:Reject H0 ifF > FK-1,n-K,α

Muhammad Firman (University of Indonesia - Accounting ) 171


Masterbook of Business and Industry (MBI)

Assumptions:
1. The samples are random and independent
One-Factor ANOVA ExampleSolution 2. variables have a continuous distribution
3. the data can be ranked
H0: μ1 = μ2 = μ3 4. populations have the same variability
H1: μi not all equal 5. populations have the same shape
α = .05
df1= 2 Kruskal-Wallis Test Procedure
df2 = 12 1. Obtain relative rankings for each value, In
event of tie, each of the tied values gets the
Test Statistic: average rank
2. Sum the rankings for data from each of the K
groups, Compute the Kruskal-Wallis test
statistic, Evaluate using the chi-square
distribution with K – 1 degrees of freedom

Decision:Reject H0 at α = 0.05 The Kruskal-Wallis test statistic:


(chi-square with K – 1 degrees of freedom)
ConclusionThere is evidence thatat least one μi
differsfrom the rest

where:
n = sum of sample sizes in all groups
K = Number of samples
Ri = Sum of ranks in the ith group
ni = Size of the ith group
Complete the test by comparing the calculated H
value to a critical X2 value from the chi-square
distribution with K – 1 degrees of freedom
Decision rule :
Reject H0 if W > 2K–1,α
Otherwise do not reject H0

ANOVA -- Single Factor:Excel Output


EXCEL: tools | data analysis | ANOVA: single factor

Kruskal-Wallis Example
Do not reject H0

Do different departments have different class sizes?


Kruskal-Wallis Test
Use when the normality assumption for one-way
ANOVA is violated

Muhammad Firman (University of Indonesia - Accounting ) 172


Masterbook of Business and Industry (MBI)

Two-Way Notation
H0 : Mean M = Mean E = Mean B Let xji denote the observation in the jth group and ith
H1 : Not all population means are equal Block. Suppose that there are K groups and H blocks,
for a
The W statistic is : total of n = KH observations. Let the overall mean be
x
Denote the group sample means by

Denote the block sample means by

Partition of Total Variation


Compare W = 6.72 to the critical value fromthe chi-
square distribution for 3 – 1 = 2degrees of freedom SST = SSG + SSB + SSE
and α = .05:
Total Sum of Squares (SST)=
Variation due to differences between groups (SSB)
+Variation due to differences between blocks (SSB)+
Variation due to random sampling (unexplained error)
(SSE)
The error terms are assumed to be independent,
normally distributed, and have the same variance
There is sufficient evidence to reject thatthe
population means are all equal Two-Way Sums of Squares
Two-Way Analysis of Variance The sums of squares are
• Examines the effect ofTwo factors of interest
on the dependent variable . e.g., Percent
carbonation and line speed on soft drink
bottling process
• Interaction between the different levels of
these two factors . e.g., Does the effect of one
particular carbonation level depend on which
level the line speed is set?
Assumptions
1. Populations are normally distributed
2. Populations have equal variances
3. Independent random samples are drawn
Randomized Block Design
Two Factors of interest: A and B
K = number of groups of factor A
H = number of levels of factor B (sometimes called a
blocking variable)
The mean squares are

Muhammad Firman (University of Indonesia - Accounting ) 173


Masterbook of Business and Industry (MBI)

Two-Way ANOVA:The F Test Statistic

Sums of Squares with Interaction

General Two-Way Table Format

Two-Way Mean Squares with Interaction


More than One Observation per Cell The mean squares are :
A two-way design with more than one observation per
cell allows one further source of variation. The
interaction between groups and b. locks can also be
identified
K = number of groups
H = number of blocks
L = number of observations per cell
n = KHL = total number of observations

Two-Way ANOVA:The F Test Statistic

Muhammad Firman (University of Indonesia - Accounting ) 174


Masterbook of Business and Industry (MBI)

SESSION 7 - 8

GOODNESS-OF-FIT TEST AND CONTINGENCY TABLES

Chi-Square Goodness-of-Fit Test


Does sample data conform to a hypothesized
distribution?
Examples:
1. Do sample results conform to specified
expected probabilities?
2. Are technical support calls equal across all
days of the week? (i.e., do calls follow a
uniform distribution?)
3. Do measurements from a production process
follow a normal distribution?
4. Are technical support calls equal across all
days of the week? (i.e., do calls follow a
Two-Way ANOVA Summary Table uniform distribution?)
Sample data for 10 days per day of week:
Sum of calls for this day:
Monday 290
Tuesday 250
Wednesday 238
Thursday 257
Friday 265
Saturday 230
Sunday 1 92
SUM = 1722
Features of Two-Way ANOVA F Test
Degrees of freedom always add up Logic of Goodness-of-Fit Test
• n-1 = KHL-1 = (K-1) + (H-1) + (K-1)(H-1) + KH(L- If calls are uniformly distributed, the 1722 calls would
1) be expected to be equally divided across the 7 days:
• Total = groups + blocks + interaction + error
The denominator of the F Test is always the same but
the numerator is different
The sums of squares always add up
• SST = SSG + SSB + SSI + SSE
• Total = groups + blocks + interaction + error
Chi-Square Goodness-of-Fit Test: test to see if the
Examples: No Interaction vs. Interaction sample results are consistent with the expected
results
Observed vs. Expected Frequencies

Muhammad Firman (University of Indonesia - Accounting ) 175


Masterbook of Business and Industry (MBI)

Test whether data follow a specified distribution (such


as binomial, Poisson, or normal) ,without assuming
the parameters of the distribution are known. Use
sample data to estimate the unknown population
parameters .
Suppose that a null hypothesis specifies
categoryprobabilities that depend on the estimation
(from thedata) of m unknown population parameters .
The appropriate goodness-of-fit test is the same as
inthe previously section

except that the number of degrees of freedom forthe


chi-square random variable is

Chi-Square Test Statistic Where K is the number of categories


H0: The distribution of calls is uniformover days of the
week Test of Normality
H1: The distribution of calls is not uniformThe The assumption that data follow a normal distribution
Rejection Region is common in statistics
The test statistic is Normality was assessed in :
1. Normal probability plots
2. Normal quintile plots
Here, a chi-square test is developed
Test of Normality
where: Two population parameters can be estimated
K = number of categories usingsample data:
Oi = observed frequency for category i
Ei = expected frequency for category i
The Rejection Region
H0: The distribution of calls is uniformover days of the
week
H1: The distribution of calls is not uniform

For a normal distribution


• Skewness = 0
(with k – 1 degreesof freedom) • Kurtosis = 3
Bowman-Shelton Test for Normality
Consider the null hypothesis that the population
distribution is normal. The Bowman-Shelton Test for
Normality is based on the closeness the sample
skewness to 0 and the sample kurtosis to 3. The test
statistic is :

As the number of sample observations becomes very


large, this statistic has a chi-square distribution with 2
Goodness-of-Fit Tests, Population Parameters degrees of freedom. The null hypothesis is rejected
Unknown for large values of the test statistic. The chi-square
approximation is close only for very large sample
Idea: sizes. If the sample size is not very large, the Bowman-

Muhammad Firman (University of Indonesia - Accounting ) 176


Masterbook of Business and Industry (MBI)

Shelton test statistic is compared to significance Consider n observations tabulated in an r x c


points from Table below : contingency table. Denote by Oij the number of
observations in the cell that is in the ith row and the
jth column. The null hypothesis is :

The appropriate test is a chi-square test with(r-1)(c-1)


degrees of freedom. Let Ri and Cj be the row and
column totals . The expected number of observations
in cell row i andcolumn j, given that H0 is true, is

A test of association at a significance level α is


basedon the chi-square distribution and the following
Example: Bowman-Shelton Test for Normality decision :
The average daily temperature has been recorded
for200 randomly selected days, with sample
skewness0.232 and kurtosis 3.319. Test the null
hypothesis that the true distribution isnormal

Contingency Table Example


Left-Handed vs. Gender
• Dominant Hand: Left vs. Right
• Gender: Male vs. Female
H0: There is no association betweenhand preference
From the Table the 10% critical value for n = 200 and gender
is3.48, so there is not sufficient evidence to reject that H1: Hand preference is not independent of gender
thepopulation is normal
Sample results organized in a contingency table:
Contingency Tables
• Used to classify sample observations
according to a pair of attributes
• Also called a cross-classification or cross-
tabulation table
• Assume r categories for attribute A and c
categories for attribute B. Then there are (r x
c) possible cross-classifications
r x c Contingency Table

Logic of the Test


If H0 is true, then the proportion of left-handed
females should be the same as the proportion of left-
handed males. The two proportions above should be
the same as the proportion of left-handed people
overall
Finding Expected Frequencies

If no association, thenP(Left Handed | Female) =


P(Left Handed | Male) = .12
Test for Association

Muhammad Firman (University of Indonesia - Accounting ) 177


Masterbook of Business and Industry (MBI)

So we would expect 12% of the 120 females and 12%


of the 180 males to be left handed…
SESSION 9- 10
i.e., we would expect
(120)(.12) = 14.4 females to be left handed
(180)(.12) = 21.6 males to be left handed NONPARAMETRIC ANALYSIS

Expected Cell Frequencies Nonparametric Statistics


Fewer restrictive assumptions about data levels and
underlying probability distributions
• Population distributions may be skewed
• The level of data measurement may only be
Example: ordinal or nominal
Sign Test and Confidence Interval
A sign test for paired or matched samples:
• Calculate the differences of the paired
Observed vs. Expected Frequencies observations
Observed frequencies vs. expected frequencies: • Discard the differences equal to 0, leaving n
observations
• Record the sign of the difference as + or –
For a symmetric distribution, the signs are random
and + and – are equally likely
Sign Test
Define + to be a “success” and let P = the true
proportion of +’s in the population.The sign test is
used for the hypothesis test
The Chi-Square Test Statistic
The test-statistic S for the sign test is :
The Chi-square test statistic is:

S has a binomial distribution with P = 0.5 andn = the


number of nonzero differences
Determining the p-value
The p-value for a Sign Test is found using the binomial
where: distribution with n = number of nonzero differences, S
Oij = observed frequency in cell (i, j) = number of positive differences, and P = 0.5
Eij = expected frequency in cell (i, j)
r = number of rows
c = number of columns For an upper-tail test, H1: P > 0.5
For a lower-tail test, H1: P < 0.5
For a two-tail test, H1: P ≠ 0.5,
Sign Test Example
Ten consumers in a focus group have rated the
Contingency Analysis attractiveness of two package designs for a new
product

Muhammad Firman (University of Indonesia - Accounting ) 178


Masterbook of Business and Industry (MBI)

Test the hypothesis that there is no overall package • Find the sums of the positive ranks and the
preferenceusing α = 0.10 negative ranks
• The smaller of these sums is the Wilcoxon
Signed Rank Statistic T:
T = min(T+ , T- )
Where T+ = the sum of the positive ranks
T- = the sum of the negative ranks
The test-statistic S for the sign test is n = the number of nonzero differences
S = the number of pairs with a positive difference= 2
S has a binomial distribution with P = 0.5 and n = 9 The null hypothesis is rejected if T is less than or equal
(there was to the value in Appendix Table 10
The p-value for this sign test is found using the
binomial distribution with n = 9, S = 2, and P = 0.5:
For a lower-tail test, Signed Rank Test Example

Since 0.090 <α = 0.10 we reject the null hypothesis


and conclude that consumers prefer package 2
Sign Test: Normal Approximation
If the number n of nonzero sample observations is
large,then the sign test is based on the normal
approximation tothe binomial with mean and
standard deviation

The test statistic is


Ten consumers in a focus group have rated the
attractiveness of two package designs for a new
product
Where S* is the test-statistic corrected for continuity: Test the hypothesis that the distribution of paired
For two-tail test, S* = S + 0.5, if S < μ or S* = S – 0.5, if differences is centered at zero, using α = 0.10
S>μ
For upper-tail test, S* = S – 0.5 Conducting the test:
For lower-tail test, S* = S + 0.5 • The smaller of T+ and T- is the Wilcoxon
Signed Rank Statistic T:
Sign Test for Single Population Median
The sign test can be used to test that a single T = min(T+ , T- ) = 3
population median is equal to a specified value
1. For small samples, use the binomial • Use Appendix Table 10 with n = 9 to find the
distribution critical value:
2. For large samples, use the normal
approximation The null hypothesis is rejected if T ≤ 4
Wilcoxon Signed Rank Test for Paired Samples • Since T = 3 < 4, we reject the null hypothesis
• Uses matched pairs of random observations
• Still based on ranks Wilcoxon Signed Rank Test Normal Approximation
• Incorporates information about the A normal approximation can be used when
magnitude of the differences 1. Paired samples are observed
• Tests the hypothesis that the distribution of 2. The sample size is large
differences is centered at zero 3. The hypothesis test is that the population
• The population of paired differences is distribution of differences is centered at zero
assumed to be symmetric
The table in Appendix 10 includes Tα values only for
Conducting the test: sample sizes from 4 to 20 . The T statistic approaches
• Discard pairs for which the difference is 0 a normal distribution as sample size increases. If the
• Rank the remaining n absolute differences in number of paired values is larger than 20, a normal
ascending order (ties are assigned the average approximation can be used
of their ranks)
Wilcoxon Matched Pairs Testfor Large Samples
Muhammad Firman (University of Indonesia - Accounting ) 179
Masterbook of Business and Industry (MBI)

The mean and standard deviation for wilcoxon T


The null hypothesis is that the central locations of
thetwo population distributions are the same . The
Mann-Whitney U statistic has mean and variance

When n is the number of paired values


Then for large sample sizes (both at least 10),
Normal approximation for the Wilcoxon T Statistic: thedistribution of the random variable

is approximated by the normal distribution


Decision Rules forMann-Whitney Test
If the alternative hypothesis is one-sided, reject the The decision rule for the null hypothesis that the
null twopopulations have the same central location:
hypothesis if • For a one-sided upper-tailed alternative
hypothesis:

If the alternative hypothesis is two-sided, reject the


null • For a one-sided lower-tailed hypothesis:
hypothesis if

• For a two-sided alternative hypothesis:

Mann-Whitney U-Test
Used to compare two samples from two populations
Mann-Whitney U-Test Example
Assumptions: Claim: Median class size for Math is larger than the
1. The two samples are independent and median class size for English A random sample of 10
random Math and 10 English classes is selected (samples do
2. The value measured is a continuous variable not have to be of equal size) Rank the combined
3. The two distributions are identical except for values and then determine rankings by original
a possible difference in the central location sample. Suppose the results are:
4. The sample size from each population is at
least 10
Consider two samples
• Pool the two samples (combine into a singe
list) but keep track of which sample each
value came from
• rank the values in the combined list in
ascending order
• For ties, assign each the average rank of the
tied values
• sum the resulting rankings separately for each
sample
If the sum of rankings from one sample differs enough Ranking for combined samples :
from the sum of rankings from the other sample, we
conclude there is a difference in the population
medians
Mann-Whitney U Statistic
Consider n1 observations from the first population
and n2 observations from the second. Let R1 denote
the sum of the ranks of the observations from the first
populationThe Mann-Whitney U statistic is

Rank byoriginalsample:
Muhammad Firman (University of Indonesia - Accounting ) 180
Masterbook of Business and Industry (MBI)

Then, for large samples ( )


thedistribution of the random variable

is approximated by the normal distribution


We wish to test

Use α = 0.05
Suppose two samples are obtained:n1 = 40 , n2 = 50
When rankings are completed, the sum of ranks for
sample 1 is

When rankings are completed, the sum of ranks for


sample 2 is
Wilcoxon Rank Sum Example
Using the normal approximation:

H0: Median1 ≥ Median2


H1: Median1 < Median2

The decision rule for this one-sided upper-tailed


alternative
hypothesis:

For α = 0.05, -zα = -1.645


The calculated z value is not in the rejection region, so
weconclude that there is not sufficient evidence of
difference in classsize medians
Wilcoxon Rank Sum Test Since z = -2.80 < -1.645, we reject H0 and conclude
• Similar to Mann-Whitney U test that median 1 is less than median 2 at the 0.05 level
• Results will be the same for both tests of significance
n1 observations from the first population . Spearman Rank Correlation
n2 observations from the second population Consider a random sample (x1 , y1), . . .,(xn, yn) of n
pairs of observations. Rank xi and yi each in ascending
Pool the samples and rank the observations in order. Calculate the sample correlation of these ranks.
ascending order. Let T denote the sum of the ranks of The resulting coefficient is called Spearman’s Rank
the observations from the first population(T in the Correlation Coefficient.If there are no tied ranks, an
Wilcoxon Rank Sum Test is the same as R1 in the equivalent formula for computing this coefficient is
Mann-Whitney U Test)
The Wilcoxon Rank Sum Statistic, T, has mean

And variance where the di are the differences of the ranked pairs
Muhammad Firman (University of Indonesia - Accounting ) 181
Masterbook of Business and Industry (MBI)

Consider the null hypothesis


H0: no association in the population
To test against the alternative of positive association,
the decision rule is

To test against the alternative of negative association,


the decision rule is

To test against the two-sided alternative of


someassociation, the decision rule is

SESSION 11

SIMPLE REGRESSION Introduction to Regression Analysis


Regression analysis is used to:
1. Predict the value of a dependent variable
Correlation Analysis based on the value of at least one
Correlation analysis is used to measure strength of the independent variable
association (linear relationship) between two 2. Explain the impact of changes in an
variables independent variable on the dependent
• Correlation is only concerned with strength of variable
the relationship Dependent variable: the variable we wish to
• No causal effect is implied with correlation explain(also called the endogenous variable)

The population correlation coefficient isdenoted ρ Independent variable: the variable used to explain the
(the Greek letter rho). The sample correlation dependent variable(also called the exogenous
coefficient is variable)
Linear Regression Model
The relationship between X and Y is described by a
linear function. Changes in Y are assumed to be
where caused by changes in X. Linear regression population
equation model

Where β0 and β1 are the population model


Hypothesis Test for Correlation coefficients and ƹ is a random error term.
To test the null hypothesis of no linearassociation,
Simple Linear RegressionModel
the test statistic follows the Student’s tdistribution
with (n – 2 ) degrees of freedom:

Decision Rules

Muhammad Firman (University of Indonesia - Accounting ) 182


Masterbook of Business and Industry (MBI)

• The true relationship form is linear (Y is a


linear function
• of X, plus random error)
• The error terms, εi are independent of the x
values
• The error terms are random variables with
mean 0 andconstant variance, σ2(the
constant variance property is called
homoscedasticity)

The random error terms, εi, are not correlated with


one
another, so that
The simple linear regression equation provides
anestimate of the population regression line
Interpretation of the Slope and the Intercept
• b0 is the estimated average value of y when
the value of x is zero (if x = 0 is in the range of
observed x values)

• b1 is the estimated change in the average


value of y as a result of a one-unit change in x
Simple Linear Regression Example
A real estate agent wishes to examine the relationship
between the selling price of a home and its size
(measured in square feet). A random sample of 10
houses is selected
• Dependent variable (Y) = house price in
Least Squares Estimators $1000s
b0 and b1 are obtained by finding the valuesof b0 and • Independent variable (X) = square feet
b1 that minimize the sum of thesquared differences
between y and : Sample Data for House Price Model

Differential calculus is used to obtain thecoefficient


estimators b0 and b1 that minimize SSE
The slope coefficient estimator is

Graphical Presentation
And the constant or y-intercept is

The regression line always goes through the mean x, y


Finding the Least Squares Equation
The coefficients b0 and b1 , and other regression
results in this chapter, will be found using a computer
1. Hand calculations are tedious
2. Statistical routines are built into Excel
3. Other statistical analysis software can be used
Linear Regression ModelAssumptions

Muhammad Firman (University of Indonesia - Accounting ) 183


Masterbook of Business and Industry (MBI)

Regression Using Excel increases by .10977($1000) = $109.77, onaverage, for


each additional one square foot of size
Measures of Variation

where:
= Average value of the dependent variable
yi = Observed values of the dependent variable
Excel Output i = Predicted value of y for the given xi value yˆ
SST = total sum of squares
Measures the variation of the yi values around their
mean, y
SSR = regression sum of squares
Explained variation attributable to the linear
relationship between x and y
SSE = error sum of squares
Variation attributable to factors other than the linear
relationship between x and y

Graphical Presentation
House price model: scatter plot andregression line

Coefficient of Determination, R2
The coefficient of determination is the portionof the
total variation in the dependent variablethat is
explained by variation in theindependent variable.
The coefficient of determination is also calledR-
squared and is denoted as R2
house price = 98.24833+ 0.10977 (square feet)
Interpretation of theIntercept, b0
b0 is the estimated average value of Y when thevalue
of X is zero (if X = 0 is in the range ofobserved X note:
values). Here, no houses had 0 square feet, so b0 =
98.24833just indicates that, for houses within the Examples of Approximate r2 Values
range ofsizes observed, $98,248.33 is the portion of
thehouse price not explained by square feet
Interpretation of theSlope Coefficient, b1
b1 measures the estimated change in theaverage
value of Y as a result of a oneunitchange in X. Here,
b1 = .10977 tells us that the average value of ahouse

Muhammad Firman (University of Indonesia - Accounting ) 184


Masterbook of Business and Industry (MBI)

Division by n – 2 instead of n – 1 is because the


simple regressionmodel uses two estimated
parameters, b0 and b1, instead of one

is called the standard error of the


estimate
Excel Output

Comparing Standard Errors


se is a measure of the variation of observed yvalues
from the regression line. The magnitude of se should
always be judged relative to the sizeof the y values in
Excel Output the sample data
i.e., se = $41.33K is moderately small relative to house
prices inthe $200 - $300K range

Inferences About theRegression Model


The variance of the regression slope coefficient(b1) is
estimated by

Correlation and R2
The coefficient of determination, R2, for a simple
regression is equal to the simple correlation squared where:
= Estimate of the standard error of the least
squares slope

Estimation of ModelError Variance


Estimator for the variance of the population = Standard error of the estimate
modelerror is
Excel Output
Muhammad Firman (University of Indonesia - Accounting ) 185
Masterbook of Business and Industry (MBI)

H0: β1 = 0
H1: β1 ≠ 0
From Excel output:

Comparing Standard Errors ofthe Slope

is a measure of the variation in the slope of


regression
lines from different possible samples
Inference about the Slope:t Test
t test for a population slope
• Is there a linear relationship between X and
Y?
Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
Test statistic

where:
b1 = regression slopecoefficient
β1 = hypothesized slope
sb1 = standarderror of the slope
Confidence Interval Estimatefor the Slope
Estimated Regression Equation:

The slope of this model is 0.1098. Does square


footage of the houseaffect its sales price?
Excel Printout for House Prices:

At 95% level of confidence, the confidence interval


forthe slope is (0.0337, 0.1858). Since the units of the
house price variable is $1000s, we are 95% confident
Inferences about the Slope:t Test Example
Muhammad Firman (University of Indonesia - Accounting ) 186
Masterbook of Business and Industry (MBI)

that the average impact on sales price is between


$33.70 and $185.80 per square foot of house size .
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship between
house price and square feet at the .05 level of
significance .
F-Test for Significance
F Test statistic:

Where
Prediction
The regression equation can be used to predict a
value for y, given a particular x . For a specified value,
xn+1 , the predicted value is

Predictions Using Regression Analysis


where F follows an F distribution with k numerator Predict the price for a housewith 2000 square feet:
and (n – k - 1)denominator degrees of freedom , (k =
the number of independent variables in the regression
model).
Excel Output

The predicted price for a house with 2000square feet


is 317.85($1,000s) = $317,850
Relevant Data Range
When using a regression model for prediction,only
predict within the relevant range of data

Relevant data range

F-Test for Significance

Estimating Mean Values and Predicting Individual


Values
Goal: Form intervals around y to express uncertainty
about the value of y for a given xi

Muhammad Firman (University of Indonesia - Accounting ) 187


Masterbook of Business and Industry (MBI)

The confidence interval endpoints are 215.50


and420.07, or from $215,500 to $420,070
Finding Confidence and Prediction Intervals in Excel
In Excel, use : PHStat | regression | simple linear
regression Check the“confidence and prediction
interval for x=”box and enter the x-value and
confidence level desired

Confidence Interval forthe Average Y, Given X


Confidence interval estimate for theexpected value of
y given a particular xi

Notice that the formula involves the termso the size


of interval varies according to the distancexn+1 is
from the mean
Prediction Interval foran Individual Y, Given X
Confidence interval estimate for an actualobserved
value of y given a particular xi

Graphical Analysis
The linear regression model is based on minimizing
the sum of squared errors. If outliers exist, their
potentially large squared errors may have a strong
influence on the fitted regression line. Be sure to
This extra term adds to the interval width to reflectthe examine your data graphically for outliers and
added uncertainty for an individual case extreme points. Decide, based on your model and
Estimation of Mean Values:Example logic, whether the extreme points should remain or
Find the 95% confidence interval for the mean priceof be removed
2,000 square-foot houses
Predicted Price yi = 317.85 ($1,000s) SESSION 12

MULTIPLE REGRESSION

The confidence interval endpoints are 280.66 and The Multiple RegressionModel
354.90,or from $280,660 to $354,900 Idea: Examine the linear relationship between1
dependent (Y) & 2 or more independent variables (Xi)
Estimation of Individual Values:Example
Confidence Interval Estimate for yn+1 Multiple Regression Model with k Independent
Variables:
Find the 95% confidence interval for an
individualhouse with 2,000 square feet
Predicted Price yi = 317.85 ($1,000s)

The coefficients of the multiple regression model


areestimated using sample data
Muhammad Firman (University of Indonesia - Accounting ) 188
Masterbook of Business and Industry (MBI)

Multiple regression equation with k independent


variables:

Two variable model

Multiple regression equation:

Estimating a Multiple Linear Regression Equation


Excel will be used to generate the coefficients and
measures of goodness of fit for multiple regression
Excel:Tools / Data Analysis... / Regression
PHStat:PHStat / Regression / Multiple Regression…
Multiple Regression Output

Standard Multiple Regression Assumptions


The values xi and the error terms εi are independent.
The error terms are random variables with mean 0
and a constant variance, .

(The constant variance property is


calledhomoscedasticity)
The random error terms, εi , are not correlatedwith
one another, so that

It is not possible to find a set of numbers, c0,c1, . . . ,


ck, such that

(This is the property of no linear relation forthe Xj’s)


Example:2 Independent Variables
A distributor of frozen desert pies wants toevaluate
factors thought to influence demand The Multiple Regression Equation
1. Dependent variable: Pie sales (units per week)
2. Independent variables: Price (in $),
Advertising ($100’s) where
1. Sales is in number of pies per week
Data are collected for 15 weeks 2. Price is in $
3. Advertising is in $100’s.
Pie Sales Example
b1 = -24.975: saleswill decrease, onaverage, by
24.975pies per week for
each $1 increase inselling price, net ofthe effects of
changes
due to advertising
b2 = 74.131: sales willincrease, on average,by 74.131
pies perweek for each $100increase inadvertising, net
of the
effects of changesdue to price
Muhammad Firman (University of Indonesia - Accounting ) 189
Masterbook of Business and Industry (MBI)

R2 never decreases when a new X variable is added to


Coefficient of Determination, R2 the model, even if the new variable is not an
Reports the proportion of total variation in y important predictor variable. This can be a
explained by all x variables taken together disadvantage when comparing models
What is the net effect of adding a new variable?
1. We lose a degree of freedom when a new X
variable is added
2. Did the new X variable add enough
This is the ratio of the explained variability to total explanatory power to offset the loss of one
sample variability degree of freedom?
Used to correct for the fact that adding non-
relevantindependent variables will still reduce the
error sum ofsquares
52.1% of the variation in pie salesis explained by the
variation inprice and advertising
(where n = sample size, K = number of independent
variables)
Adjusted R2 provides a better comparison
betweenmultiple regression models with different
numbers ofindependent variables.Penalize excessive
use of unimportant independentvariables . Smaller
than R2
R244.2% of the variation in pie sales isexplained by the
variation in price andadvertising, taking into account
the samplesize and number of independent variables

Estimation of Error Variance


Consider the population regression model
Coefficient of Multiple Correlation
The coefficient of multiple correlation is the
correlation between the predicted value and the
The unbiased estimate of the variance of the errors is observed value of the dependent variable

Is the square root of the multiple coefficient of


determination. Used as another measure of the
strength of the linear relationship between the
Where . The square root of the dependent variable and the independent variables.
variance, se , is called thestandard error of the Comparable to the correlation between Y and X in
estimate simple regression.
Standard Error, se Evaluating Individual Regression Coefficients
Se = 47.463 • Use t-tests for individual coefficients
The magnitude of thisvalue can be compared tothe • Shows if a specific independent variable is
average y value conditionally important
Hypotheses:
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist
between xj and y)
Test Statistic:

Adjusted Coefficient of Determination,

Muhammad Firman (University of Indonesia - Accounting ) 190


Masterbook of Business and Industry (MBI)

Example: Excel output also reports these interval


endpoints:
Weekly sales are estimated to be reduced by between
1.37 to 48.58 pies for each increase of $1 in the selling
price
Test on All Coefficients
t-value for Price is t = -2.306, withp-value .0398 • F-Test for Overall Significance of the Model
t-value for Advertising is t = 2.855,with p-value .0145 • Shows if there is a linear relationship between
Example: Evaluating Individual Regression all of the X variables considered together and
Coefficients Y
• Use F test statistic
Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one
independentvariable affects Y)
F-Test for Overall Significance
Test statistic:

where F has k (numerator) and(n – K – 1)


(denominator)degrees of freedom The decision rule is

Confidence Interval Estimatefor the Slope


Confidence interval limits for the population slope βj
F-Test for Overall Significance

Example: Form a 95% confidence interval for the


effect of
changes in price (x1) on pie sales:
-24.975 ± (2.1788)(10.832)
So the interval is -48.576 < β1 < -1.374

Muhammad Firman (University of Indonesia - Accounting ) 191


Masterbook of Business and Industry (MBI)

Predicted salesis 428.62 pies


Note that Advertising isin $100’s, so $350means that
X2 = 3.5
Predictions in PHStat

Tests on a Subset of Regression Coefficients


Consider a multiple regression model
involvingvariables xj and zj , and the null hypothesis
that the zvariable coefficients are all zero:

Tests on a Subset ofRegression Coefficients


Goal: compare the error sum of squares for the
complete model with the error sum of squares for the
restricted model
• First run a regression for the complete model
and obtain SSE
• Next run a restricted regression that excludes
the z variables (the number of variables
excluded is r) and obtain the restricted error
sum of squares SSE(r)
• Compute the F statistic and apply the decision
rule for a significance level α Check the “confidence and prediction interval
estimates” box

Prediction
Given a population regression model

then given a new observation of a data point


(x1,n+1, x 2,n+1, . . . , x K,n+1)
the best linear unbiased forecast of yn+1 is

It is risky to forecast for new X values outside the


range of the data usedto estimate the model
coefficients, because we do not have data tosupport
that the linear model extends beyond the observed
range.
Residuals in Multiple Regression
Two variable model
Using The Equation to MakePredictions
Predict sales for a week in which the sellingprice is
$5.50 and advertising is $350:

Muhammad Firman (University of Indonesia - Accounting ) 192


Masterbook of Business and Industry (MBI)

Testing for Significance:Quadratic Effect


Compare the linear regression estimate

Nonlinear Regression Models with quadratic regression estimate


• The relationship between the dependent
variable and an independent variable may not
be linear
• Can review the scatter diagram to check for
non-linear relationships Hypotheses

Example: Quadratic model

The test statistic is


The second independent variable is the square of the
first variable
Quadratic Regression Model

where:
β0 = Y intercept Testing the Quadratic Effect
β1 = regression coefficient for linear effect of X on Y
β2 = regression coefficient for quadratic effect on Y Compare R2 from simple regression toR2 from the
εi = random error in Y for observation i quadratic model

Linear vs. Nonlinear Fit If R2 from the quadratic model is larger than R2 from
the simple model, then the quadratic model is a
better model
Example: Quadratic Model
Purity increases as filter time increases:

Quadratic Regression Model


Quadratic models may be considered when the
scatter diagram takes on one of the following shapes:

Muhammad Firman (University of Indonesia - Accounting ) 193


Masterbook of Business and Industry (MBI)

The Log Transformation


The Multiplicative Model:
Simple regression results:y = -11.283 + 5.985 Time

Interpretation of coefficients
For the multiplicative model:

When both dependent and independent variables are


logged:
t statistic, F statistic, andR2 are all high, but • The coefficient of the independent variable Xk
theresiduals are not random: can be interpreted asa 1 percent change in Xk
leads to an estimated bk percentage change
in the average value of Y
• bk is the elasticity of Y with respect to a
change in Xk
Dummy Variables
A dummy variable is a categorical independent
variable with two levels:
• yes or no, on or off, male or female
• recorded as 0 or 1
Regression intercepts are different if the variable is
significantAssumes equal slopes for other variables. If
Quadratic regression results: more than two levels, the number of dummy variables
y^ = 1.539 + 1.565 Time + 0.245 (Time)2 needed is (number of levels - 1)
Dummy Variable Example

Let:
y = Pie Sales
x1 = Price
x2 = Holiday (X2 = 1 if a holiday occurred during the
week)
(X2 = 0 if there was no holiday that week)

Muhammad Firman (University of Indonesia - Accounting ) 194


Masterbook of Business and Industry (MBI)

Slopes are different if the effect of x1 on y depends on


Interpreting theDummy Variable Coefficient x2 value
Example:
Significance of Interaction Term
The coefficient b3 is an estimate of the difference in
the coefficient of x1 when x2 = 1 compared to when
x2 = 0. The t statistic for b3 can be used to test the
hypothesis

1 If a holiday occurred during the week


0 If no holiday occurred
b2 = 15: on average, sales were 15 pies greater If we reject the null hypothesis we conclude that
inweeks with a holiday than in weeks without there is a difference in the slope coefficient for the
aholiday, given the same price two subgroups
Interaction BetweenExplanatory Variables Multiple Regression Assumptions
Hypothesizes interaction between pairs of xvariables
• Response to one x variable may vary at
different
levels of another x variable Assumptions:
Contains two-way cross product terms 1. The errors are normally distributed
2. Errors have a constant variance
3. The model errors are independent
Analysis of Residuals in Multiple Regression
Errors (residuals) from the regression model:
These residual plots are used in multiple regression:
• Residuals vs. yi
Effect of Interaction • Residuals vs. x1i
Given: • Residuals vs. x2i
• Residuals vs. time (if time series data)
Use the residual plots to check for violations of
regression assumptions

Without interaction term, effect of X1 on Y is SESSION 13


measured by β1 . With interaction term, effect of X1
on Y is measured by β1 + β3 X2. Effect changes as X2 REGRESSION ANALYSIS : HETEROSCEDACITY,
changes
MULTICOLLINEARITY, AND AUTO-CORRELATION
Interaction Example
Suppose x2 is a dummy variable and the estimated
regression equation is The Stages of Model Building
Model Specification
Understand the problem to be studied
Select dependent and independent variables

Muhammad Firman (University of Indonesia - Accounting ) 195


Masterbook of Business and Industry (MBI)

Identify model form (linear, quadratic…) Interpreting the Dummy VariableCoefficients (with 3
Determine required data for the study Levels)
Coefficient Estimation
• Estimate the regression coefficients using the
available data
• Form confidence intervals for the regression
coefficients
• For prediction, goal is the smallest se
• If estimating individual slope coefficients,
examine model for multicollinearity and
specification bias
Model Verification
• Logically evaluate regression results in light of
the model (i.e., are coefficient signs correct?)
• Are any coefficients biased or illogical?
• Evaluate regression assumptions (i.e., are
residuals random and independent?)
• If any problems are suspected, return to Experimental Design
model specification and adjust the model Consider an experiment in which
Interpretation and Inference • four treatments will be used, and
• Interpret the regression results in the setting • the outcome also depends on three
and units of your study environmental factors that cannot be
controlled by the experimenter
• Form confidence intervals or test hypotheses
about regression coefficients Let variable z1 denote the treatment, where z1 = 1, 2,
• Use the model for forecasting or prediction 3, or 4. Let z2 denote the environment factor (the
“blocking variable”), where z2 = 1, 2, or 3
Dummy Variable Models (More than 2 Levels) • To model the four treatments, three dummy
Dummy variables can be used in situations in which variables are needed
the categorical variable of interest has more than two
categories. Dummy variables can also be useful in • To model the three environmental factors,
experimental design two dummy variables are needed
• Experimental design is used to identify Define five dummy variables, x1, x2, x3, x4, and x5
possible causes of variation in the value of the Let treatment level 1 be the default (z1 = 1)
dependent variable
• Define x1 = 1 if z1 = 2, x1 = 0 otherwise
• Y outcomes are measured at specific
combinations of levels for treatment and • Define x2 = 1 if z1 = 3, x2 = 0 otherwise
blocking variables • Define x3 = 1 if z1 = 4, x3 = 0 otherwise
• The goal is to determine how the different Let environment level 1 be the default (z2 = 1)
treatments influence the Y outcome
• Define x4 = 1 if z2 = 2, x4 = 0 otherwise
Consider a categorical variable with K levels. The • Define x5 = 1 if z2 = 3, x5 = 0 otherwise
number of dummy variables needed is one less than
the number of levels, K – 1 The dummy variable values can be summarized in a
table:
Example:
y = house price ; x1 = square feet
If style of the house is also thought to matter:
Style = ranch, split level, condo
Three levels, so two dummy variables are needed
Let “condo” be the default category, and letx2 and x3
be used for the other two categories:
y = house price The experimental design model can be estimated
x1 = square feet
x2 = 1 if ranch, 0 otherwise using the equation
x3 = 1 if split level, 0 otherwise
The multiple regression equation is: The estimated value for β2 , for example, shows the
amount by which the y value for treatment 3 exceeds
the value for treatment 1
Lagged Values of theDependent Variable

Muhammad Firman (University of Indonesia - Accounting ) 196


Masterbook of Business and Industry (MBI)

In time series models, data is collected over


time(weekly, quarterly, etc…). The value of y in time Collinearity: High correlation exists among two or
period t is denoted yt. The value of yt often depends more independent variables
on the value yt-1,as well as other independent This means the correlated variables contribute
variables xj : redundant information to the multiple regression
model
MulticollinearityIncluding two highly correlated
A lagged value of the dependentvariable is included as explanatory variables can adversely affect the
an regression results
explanatory variable • No new information provided
• Can lead to unstable coefficients (large
Interpreting Results in Lagged Models standard error and low t-values)
An increase of 1 unit in the independent variable xj in • Coefficient signs may not match prior
time period t (all other variables held fixed), will lead expectations
to an expected increase in the dependent variable of
Some Indications of Strong Multicollinearity
1. Incorrect signs on the coefficients
2. Large change in the value of a previous
coefficient when a new variable is added to
the model
3. A previously significant variable becomes
insignificant when a new independent
The total expected increase over all current and variable is added
4. The estimate of the standard deviation of the
future time periods is . The coefficients model increases when a variable is added to
are estimated by least squares the model
in the usual manner
Detecting Multicollinearity
Interpreting Results in Lagged Models Examine the simple correlation matrix to determine if
Confidence intervals and hypothesis tests for the strong correlation exists between any of the model
regression coefficients are computed the same as in independent variables
ordinary multiple regression. (When the regression
equation contains lagged variables, these procedures Multicollinearity may be present if the model appears
are only approximately valid. The approximation to explain the dependent variable well (high F statistic
quality improves as the number of sample and low se ) but the individual coefficient t statistics
observations increases.) are insignificant
Caution should be used when using confidence Assumptions of Regression
intervals and hypothesis tests with time series data. Normality of Error : Error values (ε) are normally
• There is a possibility that the equation errors distributed for any given value of X
ƹi are no longer independent from one
another. Homoscedasticity : The probability distribution of the
• When errors are correlated the coefficient errors has constant variance
estimates are unbiased, but not efficient. Thus
confidence intervals and hypothesis tests are Independence of Errors : Error values are statistically
no longer valid. independent
Specification Bias Residual Analysis
Suppose an important independent variable z is
omitted from a regression model. If z is uncorrelated
with all other included independent variables, the
influence of z is left unexplained and is absorbed by The residual for observation i, ei, is the difference
the error term, ε . But if there is any correlation between its observed and predicted value
between z and any of the included independent
variables, some of the influence of z is captured in the Check the assumptions of regression by examining the
coefficients of the included variables residuals
1. Examine for linearity assumption
2. Examine for constant variance for all levels of
If some of the influence of omitted variable z is X (homoscedasticity)
captured in the coefficients of the included 3. Evaluate normal distribution assumption
independent variables, then those coefficients are 4. Evaluate independence assumption
biased and the usual inferential statements from
hypothesis test or confidence intervals can be Graphical Analysis of ResidualsCan plot residuals vs. X
seriously misleading. In addition the estimated model
error will include the effect of the missing variable(s) Residual Analysis for Linearity
and will be larger
Multicollinearity

Muhammad Firman (University of Indonesia - Accounting ) 197


Masterbook of Business and Industry (MBI)

Residual Analysis forHomoscedasticity

Heteroscedasticity Vs Homoscedacity
Homoscedasticity
The probability distribution of the errors has constant
variance
Heteroscedasticity
The error terms do not all have the same variance,
The size of the error variances may depend on the size
of the dependent variable value, for exampleWhen
heteroscedasticity is present
Residual Analysis forIndependence • least squares is not the most efficient
procedure to estimate regression coefficients
• The usual procedures for deriving confidence
intervals and tests of hypotheses is not valid
Tests for Heteroscedasticity
To test the null hypothesis that the error terms, εi, all
have
the same variance against the alternative that
theirvariances depend on the expected values
Estimate the simple regression

Excel Residual Output Let R2 be the coefficient of determination of this new


regression

where is the critical value of the chi-square


random variablewith 1 degree of freedom and
probability of error α
Autocorrelated Errors
Independence of Errors
Error values are statistically independent
Autocorrelated Errors
Residuals in one time period are related to residuals in
another period . Autocorrelation violates a least
squares regression assumption

Muhammad Firman (University of Indonesia - Accounting ) 198


Masterbook of Business and Industry (MBI)

• Leads to sb estimates that are too small (i.e., Negative autocorrelation exists if successiveerrors are
biased) negatively correlated. This can occur if successive
• Thus t-values are too large and some variables errors alternate in sign
may appear significant when they are not
Decision rule for negative autocorrelation:
Autocorrelation reject H0 if d > 4 – dL
Autocorrelation is correlation of the errors(residuals)
over time. Violates the regression assumption
thatresiduals are random and independent

Testing for Positive Autocorrelation


Example with n = 25:

Here, residuals showa cyclic pattern, notrandom


The Durbin-Watson Statistic
The Durbin-Watson statistic is used to test for
Autocorrelation
H0: successive residuals are not correlated(i.e.,
Corr(εt,εt-1) = 0)
H1: autocorrelation is present

• The possible range is 0 ≤ d ≤ 4


• d should be close to 2 if H0 is true
• d less than 2 may signal
positiveautocorrelation, d greater than 2
maysignal negative autocorrelation
Testing for Positive Autocorrelation
H0: positive autocorrelation does not exist Testing for Positive Autocorrelation
H1: positive autocorrelation is present Here, n = 25 and there is k = 1 one independent
variable
Calculate the Durbin-Watson test statistic = d Using the Durbin-Watson table, dL = 1.29 and dU =
• d can be approximated by d = 2(1 – r) , where 1.45
r is the sample correlation of successive errors
D = 1.00494 < dL = 1.29, so reject H0 and conclude
Find the values dL and dU from the Durbin-Watson that significant positive autocorrelation exists
table
• (for sample size n and number of independent Therefore the linear model is not the appropriate
variables K) model to forecast sales

Negative Autocorrelation

Muhammad Firman (University of Indonesia - Accounting ) 199


Masterbook of Business and Industry (MBI)

Dealing with Autocorrelation


Suppose that we want to estimate the coefficients of
the
regression model

where the error term εt is autocorrelated


Two steps:
(i) Estimate the model by least squares, obtaining
theDurbin-Watson statistic, d, and then estimate
theautocorrelation parameter using

(ii) Estimate by least squares a second regression with


• dependent variable (yt – ryt-1)
• independent variables (x1t – rx1,t-1) , (x2t –
rx2,t-1) , . . ., (xk1t – rxk,t-1)
The parameters are estimated
regression coefficients from the second model. An
estimate of β0 is obtained by dividing the estimated
intercept for the second model by (1-r). Hypothesis
tests and confidence intervals for the regression
coefficients can be carried out using the output from
the second model
Aggregate Price Indexes
An aggregate index is used to measure the rate of
SESSION 14 change from a base period for a group of items

TIME-SERIES ANALYSIS AND FORECASTING

Index Numbers
• Index numbers allow relative comparisons
over time
• Index numbers are reported relative to a Base
Period Index
• Base period index = 100 by definition
• Used for an individual item or measurement
Unweighted aggregate price index
Consider observations over time on the price of a Unweighted aggregate price index for periodt for a
single group of K items:
item
• To form a price index, one time period is
chosen as a
• base, and the price for every period is
expressed as a
• percentage of the base period price
• Let p0 denote the price in the base period
• Let p1 be the price in a second period
• The price index for this second period is
i = item
t = time period
K = total number of items

Index Numbers: Example


Airplane ticket prices from 1995 to 2003:
Unweighted Aggregate PriceIndex: Example

Muhammad Firman (University of Indonesia - Accounting ) 200


Masterbook of Business and Industry (MBI)

The Runs Test for Randomness


The runs test is used to determine whether a pattern
in time series data is random. A run is a sequence of
one or more occurrences above or below the median.
Denote observations above the median with “+” signs
and observations below the median with “-” signs
• Consider n time series observations
• Let R denote the number of runs in the
sequence
• The null hypothesis is that the series is
random
• Appendix Table 14 gives the smallest
significance level for which the null hypothesis
can be rejected (against the alternative of
positive association between adjacent
Weighted Aggregate Price Indexes observations) as a function of R and n
A weighted index weights the individual prices by
some measure of the quantity sold. If the weights are
based on base period quantities the index is called a If the alternative is a two-sided hypothesis on
Laspeyres price index. nonrandomness,
• the significance level must be doubled if it is
The Laspeyres price index for period t is the total cost less than 0.5
of purchasing the quantities traded in the base period • if the significance level, α, read from the table
at prices in period t , expressed as a percentage of the is greater than 0.5, the appropriate
total cost of purchasing these same quantities in the significance level for the test against the two-
base period. The Laspeyres quantity index for period t sided alternative is 2(1 - α)
is the total cost of the quantities traded in period t ,
based on the base period prices, expressed as a Counting Runs
percentage of the total cost of the base period
quantities
Laspeyres Price Index
Laspeyres price index for time period t:

Runs Test Example


Laspeyres Quantity Index n = 18 and there are R = 6 runs
Use Appendix Table 14
n = 18 and R = 6
• the null hypothesis can be rejected (against
the alternative of positive association
between adjacent observations) at the 0.044
level of significance
• Therefore we reject that this time series is
random using α = 0.05
Runs Test: Large Samples
Consider the null hypothesis H0: The series is random

Muhammad Firman (University of Indonesia - Accounting ) 201


Masterbook of Business and Industry (MBI)

If the alternative hypothesis is positive association


between adjacent observations, the decision rule is: Time-Series Plot
A time-series plot is a two-dimensionalplot of time
series data
• the vertical axismeasures the variableof
interest
• the horizontal axiscorresponds to thetime
periods
U.S. Inflation Rate
If the alternative is a two-sided hypothesis
ofnonrandomness, the decision rule is:

Example: Large Sample Runs Test


A filling process over- or under-fills packages,
compared to the median
OOO U OO U O UU OO UU OOOO UU O UU OOO UUU Time-Series Components
OOOO UU OO UUU O U OO UUUUU OOO U O UU
OOO U OOOO UUU O UU OOO U OO UU O U OO UUU
O UU OOOO UUU OOO
n = 100 (53 overfilled, 47 underfilled)
R = 45 runs
Trend Component
A filling process over- or under-fills Long-run increase or decrease over time (overall
packages,compared to the median upward or downward movement). Data taken over a
n = 100 , R = 45 long period of time

H0: Fill amounts are random


H1: Fill amounts are not random
Test using α = 0.05
Trend can be upward or downward
Trend can be linear or non-linear

Seasonal Component
Time-Series Data • Short-term regular wave-like patterns
Numerical data ordered over time • Observed within 1 year
The time intervals can be annually, quarterly, daily, • Often monthly or quarterly
hourly, etc.
The sequence of the observations is important
Example:

Muhammad Firman (University of Indonesia - Accounting ) 202


Masterbook of Business and Industry (MBI)

Replace each xt with

Moving Averages
Example: Five-year moving average
First average:

Cyclical Component
• Long-term wave-like patterns Second average:
• Regularly occur but may vary in length
• Often measured peak to peak or trough to
trough

Example: Annual Data

Irregular Component
Unpredictable, random, “residual” fluctuations
Due to random variations of
• Nature
• Accidents or unusual events
“Noise” in the time series
Time-Series Component Analysis
• Used primarily for forecasting
• Observed value in time series is the sum or
product ofcomponents
Additive Model

Multiplicative model (linear in log form)

where
Tt = Trend value at period t Calculating Moving Averages
St = Seasonality value for period t Each moving average is for aconsecutive block of
Ct = Cyclical value at time t (2m+1) years
It = Irregular (random) value for period t
Let m = 2
Smoothing the Time Series
Calculate moving averages to get an overall
impression of the pattern of movement over time.
This smooths out the irregular component
Moving Average: averages of a designatednumber of
consecutivetime series values
(2m+1)-Point Moving Average
• A series of arithmetic means over time
• Result depends upon choice of m (the number
of data values in each average)
Examples:
For a 5 year moving average, m = 2
For a 7 year moving average, m = 3
Muhammad Firman (University of Indonesia - Accounting ) 203
Masterbook of Business and Industry (MBI)

Calculating the Ratio-to-Moving Average


• Now estimate the seasonal impact
• Divide the actual sales value by the centered
moving average for that period

Annual vs. Moving Average

The 5-yearmoving averagesmoothes thedata and


shows
the underlyingtrend
Centered Moving Averages
Let the time series have period s, where s iseven
number
i.e., s = 4 for quarterly data and s = 12 for monthly
data
To obtain a centered s-point moving averageseries Xt
1.) Form the s-point moving averages

2.) Form the centered s-point moving averages

Centered Moving AveragesUsed when an even


number of values is used in the moving average. Interpreting Seasonal Indexes
Average periods of 2.5 or 3.5 don’t match the original Suppose we get these seasonal indexes:
periods, so we average two consecutive moving
averages to get centered moving averages

Muhammad Firman (University of Indonesia - Accounting ) 204


Masterbook of Business and Industry (MBI)

Exponential Smoothing
A weighted moving average
• Weights decline exponentially
• Most recent observation weighted most
Used for smoothing and short term forecasting (often
one or two periods into the future)
The weight (smoothing coefficient) is α Sales vs. Smoothed Sales
• Subjectively chosen Fluctuations have been smoothed
• Range from 0 to 1 NOTE: the smoothed value in this case is generally a
• Smaller α gives more smoothing, larger α little low, since the trend is upward sloping and the
gives less smoothing weighting factor is only .2
The weight is:
• Close to 0 for smoothing out unwanted
cyclical and irregular components
• Close to 1 for forecasting
Exponential smoothing model

where:

xt = observed value in period t Forecasting Time Period (t + 1)


α = weight (smoothing coefficient), 0 <α< 1 The smoothed value in the current period (t) is used
as the forecast value for next period (t + 1)
Exponential Smoothing Example
At time n, we obtain the forecasts of future values,
Xn+h of the series

Exponential Smoothing in Excel


Use tools / data analysis /exponential smoothing
The “damping factor” is (1 - α)

Muhammad Firman (University of Indonesia - Accounting ) 205


Masterbook of Business and Industry (MBI)

where the seasonal factor, Ft, is the one generated


forthe most recent seasonal time period
Autoregressive Models
Used for forecasting
Takes advantage of autocorrelation
• 1st order - correlation between consecutive
values
• 2nd order - correlation between values 2
periodsapart
• pth order autoregressive model:

To perform the Holt-Winters method of forecasting:


Obtain estimates of level and trend Tt as
Let Xt (t = 1, 2, . . ., n) be a time series. A model to
represent that series is the autoregressivemodel of
order p:

Where α and β are smoothing constants whosevalues Where


are fixed between 0 and 1. Standing at time n , we
obtain the forecasts of futurevalues, Xn+h of the
series by

Forecasting with the Holt-WintersMethod:


Nonseasonal Series
Assume a seasonal time series of period s. The Holt- The parameters of the autoregressive model
Winters method of forecasting uses a set of recursive areestimated through a least squares algorithm, as
estimates from historical series. These estimates thevalues of for which the sum
utilize a level factor, α, a trend factorβ, and a ofsquares
multiplicative seasonal factor, ϒ
The recursive estimates are based on the
followingequations
is a minimum
Forecasting from EstimatedAutoregressive Models
Consider time series observations x1, x2, . . . , xt.
Suppose that an autoregressive model of order p has
been fitted to
these data:

Standing at time n, we obtain forecasts of future


values of theseries from
Where is the smoothed level of the series, Tt is the
smoothed trendof the series, and Ft is the smoothed
seasonal adjustment for the series
After the initial procedures generate the level,trend,
and seasonal factors from a historicalseries we can Where for j > 0, is the forecast of Xt+j standing at time
use the results to forecast futurevalues h time periods
ahead from the lastobservation Xn in the historical n andfor , is simply the observed value of
series Xt+j

The forecast equation is Autoregressive Model: Example and Solution


The Office Concept Corp. has acquired a number of
office

Muhammad Firman (University of Indonesia - Accounting ) 206


Masterbook of Business and Industry (MBI)

units (in thousands of square feet) over the last eight


years.
Develop the second order autoregressive model.

Develop the 2nd ordertable


Use Excel to estimate aregression model

Use the second-order equation to forecastnumber of


units for 2007:

• Choose p
• Form a series of “lagged predictor” variables
xt-1 , xt-2 , … ,xt-p
• Run a regression model using all p variables
• Test model for significance
• Use model for forecasting

Muhammad Firman (University of Indonesia - Accounting ) 207

You might also like