0% found this document useful (0 votes)
22 views

BUSINESS STATISTICS II notes (1)

Uploaded by

wd.muthoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

BUSINESS STATISTICS II notes (1)

Uploaded by

wd.muthoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Created in Master PDF Editor - Demo Version

SCHOOL OF BUSINESS

BUSINESS STATISTICS II

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON ONE: THE NORMAL DISTRIBUTION ...............................................1

1.0 Introduction
1.1 Distribution shapes
1.3 Normal distributions
1.4 The standard normal probability
1.5 standard normal probabilty given area
1.6 The probability for any normal curve
1.7 Distribution of sample means

LESSON TWO: ESTIMATION AND CONFIDENCE INTERVAL .............................. 17


2.0 Introduction ................................................................................................................. 17
2.1 Point Estimation.......................................................................................................... 18
2.2 Interval Estimation ...................................................................................................... 20
2.2.1 Interval estimation of Population Mean (σ Known) ................................................ 21
2.2.2 Interval estimation for difference of two means ...................................................... 22
2.2.3 Interval Estimation of Population Mean (σ Unknown) ........................................... 23

LESSON THREE: HYPOTHESIS TESTING ................................................................. 27


3.0 Introduction ................................................................................................................. 27
3.1 The Rationale for Hypothesis Testing ........................................................................ 27
3.2 General Procedure for Hypothesis Testing ................................................................. 28
3.3 Errors in Hypothesis Testing ...................................................................................... 30
3.4 Hypothesis testing for small samples.......................................................................... 36
3.5 Chi Square Test ........................................................................................................... 40

LESSON FOUR: CORRELATION ANALYSIS............................................................. 44


4.0 Introduction ................................................................................................................. 44
4.1 Product-Moment Correlation Coefficient ................................................................... 44
4.2 Spearman’s Rank Correlation Coefficient .................................................................. 49
4.3 Coefficient of Determination ..................................................................................... 50
4.4 The Correlation Matrix ............................................................................................... 51
4.5 Correlation and Causation .......................................................................................... 53

LESSON FIVE: REGRESSION ANALYSIS .................................................................. 56


5.0 Introduction ................................................................................................................. 56
5.1The Linear Bivariate Regression Relationship ............................................................ 57
5.2 Regression equations .................................................................................................. 58
5.3 Regression Coefficients .............................................................................................. 66
5.3.1 Properties of the Regression Coefficients ................................................................ 66
5.4 Standard Error of Estimate ......................................................................................... 69
5.5 Coefficient of Determination ...................................................................................... 70

ii

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON SIX: ANALYSIS OF VARIANCE (ANOVA) ................................................ 75


6.0 Introduction ................................................................................................................. 75
6.1 One-Way ANOVA ..................................................................................................... 76
6.2 Two-Way ANOVA ..................................................................................................... 79

LESSON SEVEN: TIME SERIES ANALYSIS .............................................................. 84


7.0 Introduction ................................................................................................................. 84
7.1 Terminologies in Time Series ..................................................................................... 84
7.2 Time Series Trend ....................................................................................................... 86
7.3 Seasonal Variation and Forecasting ............................................................................ 94

LESSON EIGHT: NON-PARAMETRIC METHODS .................................................. 100


8.0 Introduction ............................................................................................................... 100
8.1 Runs Test for Randomness ....................................................................................... 101
8.2 Sign Test ................................................................................................................... 101
8.3 Mann-Whitney U-Test .............................................................................................. 102
8.4 Wilcoxon signed-rank test ........................................................................................ 104
8.5 Wilcoxon rank-sum test ............................................................................................ 106
8.6 Kruskal-Wallis analysis of variance by ranks........................................................... 107
8.7 Spearman coefficient of rank correlation.................................................................. 109
REFERENCES………………………………………………………………………………………...116

iii

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Chapter 6: The Normal Distribution

Learning Objectives
Upon successful completion of Chapter 6, you will be able to:

– Identify shapes of distributions as symmetrical or skewed.


– Identify the properties of the normal distribution.
– Find the area under the standard normal distribution, give z-values.
– Find probabilities for a normally distributed variable by transforming it into a standard score.
– Given area or probability for a normal distribution, find the z-value and the x-value.

I. Introduction
• Many continuous variables have distributions that are bell-shaped and are called
approximately normally distributed variables.

• A normal distribution is also known as the bell curve or the Gaussian distribution.

II. Distribution Shapes


A. Negatively Skewed or Left Skewed Distribution
The majority of the data values fall to the right
of the mean or cluster at the upper end of the
distribution.

Remember: the tail is to the LEFT

B. Positively Skewed or Right Skewed Distribution


The majority of the data values fall to the left of
the mean or cluster at the lower end of the
distribution.

Remember: the tail is to the RIGHT

Dr. Janet Winter, [email protected] Stat 200 Page 1

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

C. Symmetrical Distribution
In a symmetrical distribution, the data values
are evenly distributed on both sides of the mean.

III. Normal Distributions


The normal distribution is a symmetrical, continuous, bell-shaped distribution of variable.

A. Equation for a Normal Distribution

Where:
e ≈ 2.718
π ≈ 3.14
μ = population mean
σ = population standard deviation

Notice: The shape and position of the normal distribution curve depend on 2 parameters:
the mean and the standard deviation.

Dr. Janet Winter, [email protected] Stat 200 Page 2

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

B. Normal Distribution Properties


• Bell-shaped
• The mean, median, and mode are equal and located at the center of the distribution.
• Uni-modal (only 1 mode).
• Symmetrical about the mean.
• The curve is continuous – i.e., there are no gaps or holes. For each value of X, here is a
corresponding value of Y.
• The curve never touches the x-axis. Theoretically, no matter how far in either direction
the curve extends, it never meets the x-axis, but it gets increasingly closer to the x-axis.
• The total area under the normal distribution curve is equal to 1.00 or 100%.
• .50 or 50% of the area lies to the left of the mean.
• .50 or 50% of the area lies to the right of the mean
• P(a < z < b) is the area below the normal curve, above the z-axis between z = a and z = b

C. Empirical Rule for All Bell-shaped Distributions


• Approximately 68% of the data values fall within 1 standard deviation of the mean.
• Approximately 95% of the data values fall within 2 standard deviations of the mean.
• Approximately 99.75% of the data values fall within 3 standard deviations of the mean.

THE EMPIRICAL RULE:

Dr. Janet Winter, [email protected] Stat 200 Page 3

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

D. Types of Normal Distributions


• Standard Normal with mean 0 and standard deviation 1: use given values for z with the
tabled probability values.
• Normal Distributions with mean different from 0 and any value for the standard
deviation. A different table of probabilities is needed for each paired mean and standard
deviation. To avoid this problem, transform all normal distributions to the standard
normal using z-scores:

I. Find the z-score


value − mean
z=
standard deviation
𝑥−𝜇 𝑥 − 𝑥̅
𝑧= 𝑧=
𝜎 𝑠
II. Use the z-values to find the appropriate probability from the standard normal table.

Note: the z score is the number of standard deviation that a particular X value is away
from the mean; If z is positive, x is above the mean. If z is negative, x is below the mean.

E. Standard Normal Distribution


• Mean is 0
• Standard deviation is 1
• Variance is 1

Notation: N(μ, σ) OR N(0, 1) for the standard normal.

• The probability to the left of z is found in Table E, located on the inside cover of the
textbook. Be sure the table in your textbook functions this way.

The area under the curve is more important than the frequencies because the area
corresponds to the probability!

The area under the normal curve = probability

Dr. Janet Winter, [email protected] Stat 200 Page 4

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

IV.Standard Normal Probability


A. Type 1: Area of Probability to the Left of z
I. Method:

Use Table E giving the area of probability to the left of z:

1. Use the z column to find the row for the units and tenths.
2. Move to the appropriate hundredths column across the top of the table.
3. The Intersection of the row for the units and tenths with the column for hundredths is the
probability to the left of z.

II. Example:
Find the area under the normal distribution curve.

P (z < 1.34) =

• Sketch the curve.


• Mark the mean.
• Shade the area.
• Use the table to find the area to the
left of 1.34

Note: < points left and the area to the left of 1.34 is shaded.

P ( z < 1.34) = .9099 is at the intersection of the z-column row 1.3 and the top column
labeled .04.

Question 1
Find P (z < 2.16) =
• Sketch the curve.
• Mark the mean.
• Shade the area.
• Use the table to find the area to the
left of 2.16

Dr. Janet Winter, [email protected] Stat 200 Page 5

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

B. Type II: Area to the Right of Any z-Value


1. Look up the z-value in the table to find the area to the left of z.
2. Subtract this area from 1.00 to find the area to the right of z.
I. Method:
1. Draw the normal curve labeling the mean.
2. Shade the area requested.
3. Find the probability to the left of the z-value.
4. Perform the appropriate arithmetic operation using 1.0 when needed.
1.0 is the area under the entire normal probability curve because the sum of
all probability is 1.00.

II. Example:
P (z > 2.83) =

1. Find the area to the left of 2.83 because it is given in the table. P (z < 2.83) = .9977
2. Subtract the area on the left (.9977) from the total area (1.00) to find the area on the
right: P (z > 2.83) = 1.00 - .9977 = .0023

Question 2
P (z > -2.45) =

Note: > points to the right. We need to find the area to the right to -2.45.

Dr. Janet Winter, [email protected] Stat 200 Page 6

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

C. Type III: Area Between Two z-values


I. Method
1. Look up both z-values to find the areas to their left.
2. Subtract the smaller area from the larger area to find the area between the two z-values.

–z1 0 z2

II. Example:
Find the area between z = 1.51 and z = 2.12:
1. Sketch the curve
2. Mark the mean
3. Shade the area
4. Find the area to the left of each z separately
5. Subtract the smaller area from the larger

Question 3
Find the area between
z = 0.79 and z = 1.28:

Dr. Janet Winter, [email protected] Stat 200 Page 7

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Question 4
Find the area between z = 1.50 and z = 2.50:

Question 5
Find the area between z = -2.16 and z = 0:

Question 6
P (z > -2.56) =

a) .4948
b) .0052
c) .9947
d) -.4948

Question 7
P (-1.76 < z < 2.45) =

a) .9537
b) .0321
c) -.0321
d) None of the above

Dr. Janet Winter, [email protected] Stat 200 Page 8

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Question 8
P (1.87 < z < 2.76) =

a) .9664
b) .0278
c) -.0278
d) None of the above

Question 9
P (z > 2.61) =

a) .4955
b) .9955
c) .0045
d) -.0045

D. Review: Finding Probability or Area for z-scores


1. Sketch the normal curve locating z approximately on the left or right of zero.
2. Shade the area requested in the problem.

Type I: Find the probability less than a z-value or use Table E.


Type II: To find the probability greater than z, subtract the probability less than z from 1.00.
Type III: To find the probability between 2 z-values, subtract the smaller area from the
larger area.

Question 10
Calculate each probability using the method listed above. The first one has been completed
for you.

1. P(z > 2.56) = .0052


2. P(-1.87 < z < 0) =
3. P(0 < z < 2.34) =
4. P(-1.76 < z < 2.45) =
5. P(z > 2.61) =

V. For Standard Normal Probability Given Area/Probability, Find z


(Reverse/Backwards of IV)
The normal distribution table may also be used to determine a z-score if we are given the area
(working backwards).

Dr. Janet Winter, [email protected] Stat 200 Page 9

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

A. Example:
What is the z-score associated with the 85th percentile?

Solution, Example: Backwards or Given Area/Probability, Find z


The normal distribution table may also be used to determine a z-score if we are given the
area (working backwards).

In Table E, the Normal Probability Table, find the “area” entry that is closest to 0.8500:
z 0.00 0.01 0.02 0.03 0.04 0.05

1.0 0.8485 0.8500 0.8508

• The area entry closest to 0.8500 is 0.8508.


• The z-score that corresponds to this area is 1.04.
• The 85th percentile in a standard normal distribution is 1.04.

B. Given Probability, Find Z


Method:
1. Sketch the normal curve approximating the location for z on the left or right of 0 depending
on the problem.
2. Find the probability or area to the left of z.
3. Use Table E to locate the area entry inside the table.
4. To determine the z value, add the z row label for the integer and tenths digits to the column
label for the hundredths digits.

Dr. Janet Winter, [email protected] Stat 200 Page 10

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Question 11
Find the z value that corresponds to the given area. If you’d like, review the method
discussed above before starting.

Question 12
Find the z value that corresponds to the given area or find the z-values that bound the
center 90% of the normal curve.

VI. Probability for Any Normal Curve


Normal Distributions N(μ, σ)

• Since each normally distributed variable has a different mean and standard deviation, the
shape and location of these curves will vary.
• A different table of probabilities would be needed for each mean, and standard
deviation pair.
• To avoid this problem, transform any normal distribution to the standard normal
distributions.
• Find the probability for the standard normal distribution and you have the probability for
the original normal distribution.
Dr. Janet Winter, [email protected] Stat 200 Page 11

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

A. Finding the Probability for Any Normally Distributed x value


I. Method:
1. Find the z-score for any value. The z-value is the number of standard deviations that a
particular X value is away from the mean.

𝑣𝑎𝑙𝑢𝑒 − 𝑚𝑒𝑎𝑛 𝑥−𝜇


𝑧= 𝑜𝑟 𝑧 =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎

𝑥 − 𝑥̅
𝑧=
𝑠
2. Find the standard normal probability using the appropriate methods.
II. Example:
3. Heights of 2 year old boys are normally distributed with mean 28 inches and standard
deviation 2.4. What is the chance a 2 year old boy is less than 26 inches tall?

III. Example:
The average daily jail population in the United States is 618,319. If the distribution is normal
and the standard deviation is 50,200, find the probability that on a randomly selected day,
the jail population is:
1. greater than 700,000;

P(x > 700,000) = P(z > 1.63) = 1 - .9484 = .0516 or 5.16 %

2. between 500,000 and 600,000.


P(500,000 < x < 600,000) =

P(500,000 < x < 600,000) =


.

P( - 2.36 < z < - 0.36) = .3632 - .0091 = 0.3503 or 35.03%

Dr. Janet Winter, [email protected] Stat 200 Page 12

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Question 13
The average credit card debt for college seniors is $3262. If the debt is normally distributed
with a standard deviation of $1100, find these probabilities:

1. That an individual senior owes at least $1000;


𝑥−𝜇
2. That the senior owes more than $4000; 𝑧=
𝜎
3. That the senior owes between $3000 and $4000.

IV. Notation Reminder


If x is a normal random variable with the mean μ and standard deviation σ, this is often
denoted: X ~ N(μ, σ) .

Example: Suppose x is a normal random variable with μ = 35 and σ = 6. A convenient


notation to identify this random variable is: x ~ N(35, 6).

B. Given Normal Probability, find the x value


We also need to find the x values for N(μ, σ) when probabilities are given.

I. Method:
a) Given the area or probability, find the probability or area under the normal curve to
the left of x.
b) Use Table E to locate the area to the left of x inside table and determine z. (See page 10)
c) Find x where:
𝑥−𝜇
𝑧=
𝜎
𝑧𝜎 = 𝑥 − 𝜇
𝑥 =𝑧∙𝜎+𝜇

II. Example:
If the tallest 10% and the shortest 10% of the population is considered abnormal, what is
the range for normal heights of women if heights for women are N (63.6 in, 2.5 in)?

Normal height is between 60.4 and 66.8 inches.

Dr. Janet Winter, [email protected] Stat 200 Page 13

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

III. Example:
The price of a new home in Blair County is normally distributed with a mean price of
$145,000 and a standard deviation of $1,500. Find the minimum and maximum
price to attract the middle 90% of this market.

The middle 90% is split into 2 equal parts by the mean, or the area to the left of –z is .05
and the area to the left of +z is .45+.50 =.9500.

1. Find the area in the Table E closest to 0.9500:


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06

1.6 0.9495 0.9500 0.9505

• 0.9500 is exactly half-way between 0.9495 and 0.9505.


Therefore, z = 1.645 and –z = –1.645 by symmetry.
• z = –1.645 and z = 1.645 bound the middle 90% of a normal distribution.

2. Find x where: x = z ∙ σ + μ
z = + 1.645 μ = 145,500 σ = 1500

Left Answer: Right Answer:

x = – 1.645 ∙ 1500 + 145,500 x = + 1.645 ∙ 1500 + 145,500


x = 143032.5 x = 147967.5

x = 143032.5 or x = 147967.5
143,032.50 and 147,967.50 bound the middle 90% of the market prices for these new
homes.

Dr. Janet Winter, [email protected] Stat 200 Page 14

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Question 14
Test scores for the police academy are normally distributed with mean 95 and standard
deviation 15. To qualify, a candidate must score in the top 10%. Find the lowest possible
score to qualify.

VII. Distribution of Sample Means


• A sampling distribution of sample means is a distribution obtained by using the means for
all random samples of the same size taken from the same population.
• Sampling error is the difference between the sample measure and the corresponding population
measure due to the fact that the sample is not a perfect representation of the population.

A. Properties of Distribution of Sample Means


I. The mean of the sample means will be equal to the population mean.
II. The standard deviation of the sample means will be smaller than the standard deviation
of the population, and will be equal to the population standard deviation divided by the
square root of the sample size.

B. The Central Limit Theorem (formalizing the properties)


• As the sample size n increases, distribution of the sample means taken with
replacement from a population with mean μ and standard deviation σ will approach a
normal distribution.
• The mean of the sample means equals the population mean or:
• The standard deviation of the sample means equals: and is called the
standard error of the mean.

Dr. Janet Winter, [email protected] Stat 200 Page 15

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

C. Individual vs. Sample Mean


The central limit theorem is used to answer questions about sample means just like the
normal distribution can be used to answer questions about individual values.
𝑥−𝜇
• For individuals: 𝑧 = 𝜎
𝑥̅ −𝜇
For sample means: 𝑧 = 𝜎/
√𝑛
Note: Finite Population Correction Factor is not part of this course although it is a topic in
the textbook.

Question 15
A.C. Nelson reported that children between 2 and 5 years old watch an average of 14 hours
of TV per week. Assuming the variable (hrs of TV watched) is normally distributed with a
standard deviation of 3, find the chance that:
1. My neighbor John, who is 4, will watch 20 or more hours of TV this week.
2. The average number of hours watched by the 25 members of John’s play group is 20 hours or
more.

Question 16
The average cholesterol content of a certain brand of eggs is 215 milligrams, and the
standard deviation is 15 milligrams. Assume the variable is normally distributed.
1. If a single egg is selected, find the probability that the cholesterol content will be less than
210 milligrams.
2. If a dozen eggs are selected, what is the chance their average cholesterols is 217 or less?
Cholesterol for eggs is N(215, 15).

VIII. TI-83 Calculator Functions


A. Standard Normal Probability on the TI-83
I. Method normalcdf(L,R)
1. Use the Distr Menu accessible with (2nd) VARS
2. Use the normalcdf (left endpoint, right endpoint) = normal cdf(lower value, higher value)
3. Press enter.
normalcdf (L, R)
(Note: the normal pdf is NOT used in this course!)
P(a < z < b) = normalcdf(a, b)

II. Example:
P(2.34 < z < 2.78) = normalcdf(2.34, 2.78) = 0.0069
Dr. Janet Winter, [email protected] Stat 200 Page 16

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Question 17
P(-1.34 < z < 3.45) =

Question 18
P(z > 2.34) =

Question 19
P(z < -1.79) =

B. Normal Probability on the TI-83


I. Method: normal cdf(L,R, 𝜇, 𝜎)
1. Use the Distr Menu accessible with (2nd) VARS
2. Use the normalcdf (left endpoint, right endpoint, mean, standard deviation)
3. Press enter.
normalcdf (L, R, μ, σ)
(Note: the normalpdf is NOT used in this course!)

II. Example:
For a normal random variable with mean 50 and standard deviation 2, find the
proportion of scores greater than 45.
normalcdf (45, 1000, 50, 2) = .9938

Question 20
For z, find P(z < 2.87)

Question 21
For a normal random variable x with mean 12 and variance 6, find the
probability that x is more than 16.

IX. Summary
• Many variables such as heights, weights, and temperatures have normally distributed data.
• The standard normal distribution has a mean of 0 and a standard deviation of 1.
• All normal distributions are bell-shaped with equal mean, median, and mode.
𝑥̅ −𝜇
• For individual data values from a normal distribution use: 𝑧 = 𝜎
• For group averages with n ≥ 30, use:

Dr. Janet Winter, [email protected] Stat 200 Page 17

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Answer: Question 1
P ( z < 2.16) = .9846 is at the intersection of the Z-column row 2.1 and the top column labeled .06
1. Sketch the curve.
2. Mark the mean.
3. Shade the area.
4. Use the table to find the area to the left of 2.16

Answer: Question 2
Find the area to the right of -2.45

P(z > -2.45) = 1 - .0071 = .9929

Answer: Question 3
Find the area between z = 0.79 and z = 1.28:

P(.79 < z < 1.28) = .8997 - .7852 = .2145

Dr. Janet Winter, [email protected] Stat 200 Page 18

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Answer: Question 4
Find the area between z = 1.50 and z = 2.50:

P(1.50 < z < 2.50) = .9938 - .9332 = .0606

Answer: Question 5
Find the area between z = -2.16 and z = 0:

P(-2.16 < z < 0) = .5000 - .0154 = .4846

Answer: Question 6
P(z > -2.56) =
a.) .4948
b.) .0052
c.) .9948
d.) -.4948

Answer: Question 7
P(-1.76 < z < 2.45) =
a.) .9537
b.) .0321
c.) -.0321
d.) None of the above.

Answer: Question 8
P(1.87 < z < 2.76) =
a.) .9664
b.) .0278
c.) -.0278
d.) None of the above.

Dr. Janet Winter, [email protected] Stat 200 Page 19

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Answer: Question 9
P(z > 2.61) =
a.) .4955
b.) .9955
c.) .0045
d.) -.0045

Answer: Question 10
Calculate each probability. The first one has been completed for you.
1. P(z > 2.56) = .0052
2. P(-1.76 < z < 2.45) = .9929 - .0392 = .9537
3. P(-1.87 < z < 0) = .5 - .0307 = .4693
4. P(z > 2.61) = 1 - .9955 = .0045
5. P(0 < z < 2.34) = .9904 - .5000 = .4904

Answer: Question 11
Find the z value that corresponds to the given area.

Method:
1. Sketch
2. The area or probability to the left of z is: 1 - .8962 = .1038
3. .1038 is located inside Table E at the intersection of row -1.2 and column .06.
4. z = -1.2 + -.06 = -1.26

Thus, P(z > -1.26) = 0.8962

Dr. Janet Winter, [email protected] Stat 200 Page 20

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Answer: Question 12
Find the z-values that correspond to the given area or find the z-values that will bound the center
90% of the normal curve.

The 90% is split into 2 equal 0.45 parts by the mean.


Find the area in Table E closest to .9500 = .5000 + .4500

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06

1.6 0.9495 0.9500 0.9505

0.9500 is exactly half-way between 0.9495 and 0.9505. Therefore, z = 1.645.


Using symmetry, z = -1.645 and z = 1.645 bound the middle 90% of a normal distribution.

Answer: Question 13
The average credit card debt for college seniors is $3262. If the debit is normally distributed with a
standard deviation of $1100, find these probabilities:

1. That an individual seniors owes at least $1000;

P(x ≥ 1000) = P(z > -2.06) = 1 - .0197 = .9803

2. That the senior owes more than $4000;

P(x > 4000) = P(z > 0.67) = 1 - .7486 = 0.2514 or 25.14%

3. That the senior owes between $3000 and $4000;

P(3000 < x < 4000) = P(-0.24 < z < 0.67) = .7486 - .4052
= 0.3434 or 34.34%

Dr. Janet Winter, [email protected] Stat 200 Page 21

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Answer: Question 14
Test scores for the police academy are normally distributed with mean 95 and standard deviation 15.
To qualify, a candidate must score in the top 10%. Find the lowest possible score to qualify.

x=z∙σ+μ
z = 1.28
μ = 95
σ = 15

x = 1.28 ∙ 15 + 95
x = 114.2
114.2 is the lowest score for a candidate to qualify.

Answer: Question 15
A. C. Nelson reported that children between 2 and 5 years old watch an average of 14 hours of TV
per week. Assuming the variable (hrs of TV watched) is normally distributed with a standard
deviation of 3, find the chance that:
1. My neighbor John, who is 4, will watch 20 or more hours of TV this week.
- for an individual, use N(14, 3)
P(x > 20) = P (z > 2) = 1 - .9772 = .0228
2. The average number of hours watched by the 25 members of John’s play group is 20 hours or
more.

Answer: Question 16
The average cholesterol content of a certain brand of eggs is 215 milligrams, and the standard
deviation is 15 milligrams. Assume the variable is normally distributed.
1. If a single egg is selected, find the probability that the cholesterol content will be less than
210 milligrams.

2. If a dozen eggs are selected, what is the chance their average cholesterols is 217 or less?
Cholesterol for eggs is N(215, 15).

Dr. Janet Winter, [email protected] Stat 200 Page 22

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Answer: Question 17
P(-1.34 < z < 3.45) = normalcdf(-1.34, 3.45) = .9096

Answer: Question 18
P(z > 2.34) = normalcdf(2.34, 1000) = .0096

Answer: Question 19
P(z < -1.79) = normalcdf(-1000, -1.79) = .0367

Answer: Question 20
For z, find P (z < 2.87)
normalcdf (-1000, 2.87) = .9979

Answer: Question 21
For a normal random variable x with mean 12 and variance 6, find the probability that x is more
than 16.

normalcdf (16, 1000, 12,√𝟔) = 0.512

Dr. Janet Winter, [email protected] Stat 200 Page 23

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON TWO

ESTIMATION AND CONFIDENCE INTERVAL

Learning Objectives
At the end of this lesson you should be able:

 Differentiate between point estimation and interval estimation


 Compute the margin of error associated with a sample mean and a proportion
 Construct and interpret confidence interval estimates to know the precision of the
Estimate of a population mean and proportion

2.0 Introduction
Statistical inference about the characteristics of a population or process can not be
possible without enough information to calculate exact population [parameters ( such as
µ, σ, p ) and therefore make the best estimate of this value from the corresponding sample
statistic ( such as x , s, and p ). The need to use the sample statistic to draw conclusions
about the population characteristic is one of the fundamental applications of statistical
inference in business and economics. A few applications are given below:

 A production manager needs to determine the proportion of items being


manufactured that do not match with quality standards.
 A telephone service company may be interested to know the average length of a
long distance telephone call and its standard deviation.
 A company needs to understand consumer awareness of its product.
 Any service centre needs to determine the average amount of time a customer
spends in a queue.

In all such cases, a decision-maker needs to examine the following two concepts that are
useful for drawing statistical inference about an unknown population or process
parameters based upon random samples:

i. Estimation- a sample statistic to estimate an unknown parameter


value
ii. Hypothesis testing- a claim or belief about an unknown parameter
value.

In this lesson, we shall discuss the methods used to estimate unknown population
parameters and then to determine the range of values (confidence interval) likely to
contain the parameter value.

There are two types of estimations for a population parameter:


i. Point estimation- one particular value

17

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

ii. Interval estimation- an interval having its centre at the point


estimate

For estimating a parameter value, it is important to know:


i. a point estimate
ii. the amount of possible error in the point estimate, that is, an
interval likely to contain the parameter value, and
iii. The statement or degree of confidence that the interval contains the
parameter value.

The knowledge of these three pieces of information is called a confidence interval or


interval estimation.

2.1 Point Estimation


In point estimation, a single statistic (such as x , s, or p ) is calculated from the sample to
provide the best estimate of true value of the corresponding population parameter (such
as µ, σ, or p). Such a single relevant statistic is termed as point estimator, and the value
of the statistic is termed as point estimate. For example, we may calculate that 10
percent is of the items in a random sample taken from a day’s production are defective.
Thus, until the next sample of items is not drawn and examined, we may proceed on
manufacturing on the assumption that any day’s production contains 10 percent defective
items.

Proportion of a Point Estimator


Before any statistical inference is drawn, it is essential to resolve the following two
important issues:
i. selection of an appropriate statistic to serve as the best estimator of a
population parameter
ii. The nature of the sampling distribution of this selected statistic. Since the
sample statistic value varies from sample to sample, the accuracy of a
given estimator also varies from sample to sample. This means that there
is no certainty of the accuracy achieved for the sample one happens to
draw. Although in practice only one sample is selected at any given time,
we should judge the accuracy of an estimator based on its average value
over all possible samples of equal size. Hence, we prefer to choose the
estimator whose ‘average accuracy’ is close to the value of population
parameter being estimated. The criteria for selecting an estimator are:

 unbiasedness
 consistency
 efficiency

As different sample statistics can be used as point estimators of different


population parameters, the following general nations will be used in this
section:

18

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

 = population parameter (such as µ, σ, or p) of interest being estimated

 = sample statistic (such as x , s, or p ) or point estimator of  .

Where  is the Greek letter ‘theta’

Unbiasedness
The value of a statistic measured from a given sample is likely to be above or below the
actual value of population parameter of interest due to sampling error. Thus it is desirable
that the expected value or mean of all possible values of a statistic from the estimates
over all possible values of a statistic from the estimates over all possible random samples
is equal to the population parameter being estimated. If this is true, then the sample
statistic is said to be an unbiased estimator of the population parameter. Hence, the
sample statistic  is said to be an unbiased estimator of the population parameter,
provided E(  ) = θ where E(  ) is the expected value or mean of the sample statistic 

When the E(  )  θ for the sampling distribution of  , then  is said to be a biased


estimator of θ as illustrated below:

Figure 2.0: Sampling Distribution of Biased and Unbiased Point Estimators

Sampling distribution of
statistic
 1 Sampling distribution of
statistic
 2

Bias
s
θ E ( )

Consistency
A point estimator is said to be consistent if its value  tends to become closer to the
population parameter θ as the sample size increases. For example, the standard error of
sampling distribution of the mean  x = σ/√n, tends to become smaller as sample size n
increases. Thus the sample mean x is a consistent estimator of the population mean µ.
Similarly the sample proportion p is a consistent estimator of the population proportion
p because p = σ/√n

19

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Efficiency
For the same population, out of two unbiased point estimators an unbiased estimator with
smaller standard deviation is said to be efficient because it provides as estimate closer to
the population parameter. It is because of this reason that there is less variation in the
sampling distribution of the statistic. For example, for a simple random sample of size n,
if  1 and  2 are two unbiased point estimators of the population parameter θ, then
relative efficiency of  1 to  2
is given by

 1
Relative efficiency =
 2

The figure below shows the sampling distribution of two statistics  1 and  2
which are
being considered for estimation of the population parameter θ. Since standard deviation
(or error) of statistic  2 is less than that of  1 , therefore values of  2 are more likely
to provide an estimate that is close to parameter θ for a given sample. The statistic  1
tends to be a larger estimation error both above and below the parameter θ. Thus the
estimates obtained from statistic  2 are more consistently close to θ than those of  1

Figure 2.1: Sampling Distribution of two Unbiased Estimators

Statistic
 2

Statistic
 1

θ 
Parameter

2.2 Interval Estimation


In most practical problems, a point estimate does not provide information about how
close is the estimate to the population parameter unless accompanied by a statement of

20

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

possible sampling errors involved based on the sampling distribution of the statistic. It is
therefore important to know the precision of an estimate before relying on it to make
decisions. Thus, decision makers prefer to use an interval estimate that is likely to contain
the population parameter value. However, it is also important to state how confident the
interval estimate actually contains the parameter value. Hence an interval estimate of a
population parameter is therefore a confidence interval with a statement of confidence
that the interval contains the parameter value.

The confidence interval estimate of a population parameter is obtained by applying the


formula:

Pont estimate ±Margin of error

Where margin of error = zc x Standard error of a particular statistic


zc = critical value of standard normal variable that
represents confidence level (probability of being correct)
such as 0.90, 0.95, and so on.

2.2.1 Interval estimation of Population Mean (σ Known)


Suppose the population mean µ is unknown and the true population standard deviation σ
is unknown. Then for a large sample size (n ≥ 30), the interval estimation of population
mean µ is given by:


x  z / 2 x or x  z / 2
n

 
Or x  z / 2    x  z / 2
n n

Where zα/2 is the z value representing an area α/2 in the right and left tails of the standard
normal probability distribution, and (1- α) is the level of confidence as shown in figure
2.2 below:

21

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Figure 2.2: Sampling Distribution of Mean

Part A (47.5%)
+ Part B (47.5%)
= 95% confidence

Tail 1 = α/2
=2.5% Tail 2 = α/2
=2.5%

A B

0% 2. 5% 50% 97.5% 100%

Figure 2.2 shows a 95% level of confidence (two tailed test). The desire is to estimate the
mean with 95% confidence. This area will be composed of two equal segments (A and B)
each containing an area of 47.5% (i.e. 50%-2.5% for A and 97.5%-50% for B).

2.2.2 Interval estimation for difference of two means


If all possible samples of large size n1 and n2 are drawn from two different populations,
then sampling distribution of difference between two means x 1 and x 2 is approximately
normal with mean (μ1 – μ2) and standard deviation

 
2 2

 x1 x 2

n
1

n
2

1 2

For a desired confidence level, the confidence interval limits for the population mean
(μ1 – μ2) are given by

( x1 - x 2 ) ± z  /2 x1 x 2

Illustration
The strength of wire produced by company A has a mean of 9000kg and a standard
deviation of 400 kg. Company B has a mean of 8000kg and a standard deviation of
600kg. A sample of 50 wires of company A and 100 of company B are selected at
random to select strength. Find 99% confidence limits on the difference in average
strength of populations of wires produced by the two companies.

22

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Solution
The following information is given:
Company A: x 1
 4500,  1
 200 n  50
1

Company B: x 2
 4000,  2
 300 n  100
2

Therefore: x x1 2
 4500  4000  500; and z /2
 2.576

 
2 2

 x1 x 2

n
1

n
2

1 2

200  300
2 2


50 100
= 41.23

The required 99% confidence interval limits are given by


x1  x 2  z / 2 x1x 2  500  2.576(41.23)  500  106.20
Hence, 99% confidence limits on the difference in the average strength of wires produced
by the two companies are likely to fall in the interval 393.80 to 606.20

2.2.3 Interval Estimation of Population Mean (σ Unknown)


In practice, the standard deviation of the population σ, is not known. Thus in large sample
case, the standard deviation s provides a good estimate of population standard deviation
σ, and we use z-table to compute zα/2 in the right tail of the standard normal probability
distribution curve. Hence interval estimate of a population mean for a large sample case
(n>30) with confidence coefficient 1-α is given by

s
x  z / 2 s x  x  z / 2
n

When the population standard deviation is not known and the sample size is small, the
procedure of interval estimation of population mean is based on a probability distribution
known as the t-distribution. This distribution is very similar to the normal distribution.
However, the t-distribution has more area in the tails and less in the centre than does
normal distribution. The t-distribution depends on a parameter known as degree of
freedom. As the number of the degrees of freedom increases, the t-distribution gradually
approaches the normal distribution, and the sample standard deviation s becomes a better
estimate of a population standard deviation σ.

23

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The interval estimate of a population mean when the sample size is small (n≤30) with
confidence coefficient (1-α), is given by

s s s
x  t / 2 or x  t / 2    x  t / 2
n n n

where tα/2 is the critical value of t-test statistic providing an area α/2 in the right tail of the
t-distribution with n-1 degrees of freedom, and

s  ( x  x ) / n1
i

The critical value of t for the given degrees of freedom can be obtained from the table of
t-distribution. The procedure of the confidence interval estimation of population mean μ
when the population standard deviation is unknown and sample size is large or small, is
summarized on the table below

Confidence interval for μ

Sample size interval estimate of population mean μ

Large

- σ assumed known x  z / 2
n
s
- σ estimated by s x  z / 2
n

Small

-σ assumed known x  z / 2
n
s
- σ estimated by s x  t / 2
n

Illustration
The personnel department of an organization would like to estimate the family dental
expenses of its employees to determine the feasibility of providing a dental insurance
plan. A random sample of 10 employees reveals the following family dental expenses (in
thousands) in the previous year:

11, 37, 25, 62, 51, 21, 18, 43, 32, 20.

24

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Set up 99% confidence interval of average family dental expenses for the employees of
this organization.

Solution
The calculations for sample mean and standard deviation are shown below:

Calculations for x and s


( x x )
2
Variable x xx
11 -21 441
37 05 25
25 -07 49
62 30 900
51 19 361
21 -11 121
18 -14 196
43 11 121
32 00 00
20 -12 144
∑320 0 2358

x  320 / 10  32

From the table, the sample mean x is

Σx/n

= 320/10 = Ksh 32.

The standard deviation s

 (x x) / n  1
2
=

= √2358/9

= Sh 5.11

Hence the mean expenses per family are likely to fall between Ksh. 29.038 and
Ksh. 34.962. (29.038≤  ≤34.962)

Confidence Limits for Population proportion


If sampling is from an infinite population or sampling from a finite population with
replacement, the confidence limits for the population proportion are given by:

25

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

pq
p  zc
n

where: p is the population of success in the sample of size n


zc is the table value for the required level of significance
q is (1-p)
n is the sample size

Exercises
1. Explain the term ‘estimation’ as applied in the field of statistics.
2. Differentiate between point estimation and interval estimation.
3. What criteria are observed when selecting an estimator?
4. What is unbiasedness?
5. What is the difference between consistency and efficiency?

6. In a consignment of 10,000 tennis balls, 400 were drawn at random and


examined. It was found that 20 of these were defective. How many defective balls
can you expect to find in the whole consignment at 95% level of confidence
level?

7. A random sample of 50 persons was interviewed to find their preference between


two brands of tea. 35 of the interviewed preferred brand A to brand B. find the
95% confidence interval for the proportions of persons who prefer brand A.

8. A sample of 16 observations has been taken from a population in which the


random variable is normally distributed. The sample mean is 50 and the sample
standard deviation is 10. Determine the 95% level of confidence interval for the
population mean.

9. A factory is producing 50,000 pairs of shoes daily. From a sample of 500 pairs,
20% were found to be of substandard quality. Estimate the number of pairs that
can be reasonably expected to be spoiled in the daily production and assign limits
at 5% level of significance.

10. After an intensive advertisement campaign of polish, the manufacturers wanted to


know how many of the possible customers had read the advertisement. They
selected a random sample of 50 customers and found that only 15 of them had
read the advertisement. Find the 95% level of confidence interval for the
proportion of customers who had read the advertisement.

26

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON THREE

HYPOTHESIS TESTING

Lesson Objectives
At the end of this lesson you should be able to:

 Understand the rationale for hypothesis testing


 Know the general procedures for hypothesis testing
 Differentiate between type one and type two errors
 Understand the Chi-Square test

3.0 Introduction
The statistician draws inference about the population parameters using the knowledge of
point estimate and interval estimate. Statistical inference is therefore based on the sample
statistic. A statistical hypothesis is a claim or belief about an unknown population
parameter value. The methodology that enables a decision maker to draw inference about
population characteristics by analysing the difference between the value of sample
statistic and the corresponding hypothesized par ammeter value is called hypothesis
testing. For example, a drug manufacturing company plans to test the efficiency of a new
drug against a disease on the belief that 95% of all persons suffering from the disease get
cured. The company will then draw a random sample of patients suffering from the
disease and administer the drug. The number of sampled patients that get cured will
determine the success of the drug. If the percentage of those who get cured is more than
95%, then the drug is likely to be successful.

3.1 The Rationale for Hypothesis Testing


When a random sample is drawn from a population, it is assumed that it will resemble the
population. Based on this assumption, we make an estimate of the unknown population
parameter by using sample statistic. When a claim is made about the specific value of
population parameter, then it is expected that the corresponding sample statistic is close
to hypothesised parameter value. It is possible only if the hypothesised parameter value is
correct and the sample statistic turns out to be good estimator of the parameter. The
statistic used to test a hypothesis is called a test statistic.

In statistical analysis, the difference between the value of the statistic and the
hypothesised parameter is specified in terms of a given level of probability whether the
particular level of difference is significant or not when the hypothesised value is correct.
The probability that a particular level of deviation occurs by chance can be calculated
from the known sampling distribution of the test statistic.

27

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The probability level at which the decision maker concludes that observed difference
between the value of test statistic and hypothesised parameter value cannot be due to
chance is called the level of significance of the test. For example, a decision-maker may
feel that a difference that could occur by chance five per cent of the time is not significant
even if the hypothesis is correct.

3.2 General Procedure for Hypothesis Testing


To test the validity of a claim about the population parameter, a sample is drawn from the
population and analysed. The results of the analysis are used to decide whether the claim
is true or not. The steps of general procedure for any hypothesis testing are summarised
below:

Step One: State the Null Hypothesis (H0) and Alternative Hypothesis (H1)
The null hypothesis H0 refers to the hypothesised numerical value or range of values of
the population parameter. Theoretically hypothesis testing requires that the null
hypothesis be considered true until it is proved false on the basis of results observed from
the sample data. The null hypothesis is always expressed in the form of an equation
making a claim regarding the specific value of the population parameter. That is:

H0: μ = μ0

Where μ is population mean and μ0 represents hypothesised parameter value.

An alternative hypothesis, H1, is the logical opposite of the null hypothesis, that is, an
alternative hypothesis must be true when the null hypothesis is found to be false. The
alternative hypothesis is written as:

H1: μ  μ0

Step Two: State the Level of Significance α (alpha) for the Test
The level of significance, usually denoted by α, is specified before the samples are drawn,
so that the results obtained should not influence the choice of the decision-maker. It is
specified in terms of the level of probability of null hypothesis H0 being wrong. That is,
the probability of rejecting H0 when it is true.

Step Three: Establish Critical or Rejection Region


7sample space of an experiment which corresponds to the area under the sampling
distribution curve of the test statistic is divided into two mutually exclusive regions called
the acceptance region and the rejection or critical region.

28

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Figure 3.0: Areas of acceptance and Rejection of H0 (two-tailed test)

H0 is rejected, zα/2
H0 is rejected, zα/2
Acceptance region (1-α)

  0

If the value of test statistic falls into the acceptance region, the null hypothesis is
accepted, otherwise it is rejected. The rejection region consists of all values of the test
statistic that are likely to occur if null hypothesis is true. On the other hand, these values
are not likely to occur if the null hypothesis is false. The value of the sample statistic that
separates the regions of acceptance and rejection is called critical value.

The size of rejection region is directly related to the level of precision needed to make
decision about a population parameter. Decision rules concerning null hypothesis are as
follows:
 Prob (H0 is true) ≤ α, reject H0

 Prob (H0 is true) > α, accept H0

In other words, if probability of H0 being true is less than α, reject H0, otherwise
accept.

Step Four: Calculate the suitable Test Statistic


The value of test statistic is calculated from the distribution of sample statistic by using
the following formula

Test statistic = [value of sample statistic – value of hypothesised parameter]/


Standard error of sample statistic

x x
For example, z  ; σ is known, n>30
 x / n

The choice of a probability distribution of a sample statistic is guided by the sample size
n and the value of population standard deviation σ as shown below:

29

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Choice of probability distribution

Sample size n population standard deviation σ


Known Unknown
n > 30 normal distribution normal distribution
n ≤ 30 normal distribution t- distribution

Step Five: Reach a Conclusion


Compare the calculated value of the teat statistic with the critical value (also called
standard table value of test statistic). The decision rule for null hypothesis is as follows:

z cal  z Table reject H 0

z cal  z Table accept H 0

The same decision rules hold for the t-distribution associated with the sampling
distribution of means or proportions when the sample size is small.

3.3 Errors in Hypothesis Testing


Hypothesis testing is meant to assist the decision maker to arrive at a decision given the
level of significance. Hypotheses are designed in such a way that when one is accepted,
the other is rejected. However, in practice, there could be a possibility that one accepts
the hypothesis when in reality the hypothesis should have been rejected. This leads to two
hypotheses errors:

Type I Error
Type one error occurs when the decision maker reject the null hypothesis (H0), when it is
true. The Greek letter α designates this error term.

Type II Error
The probability of committing another type of error, called type II error, is designated by
Greek letter beta (β). Type II error occurs when one accepts the null hypothesis when it is
false.

The following table summarizes the decisions the researcher could make and the possible
consequences.

Researcher
Null Hypothesis Accepts Rejects

H0 is true correct decision Type I error

H0 is false Type II error correct decision

30

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration
The firm manufacturing bottles would commit a type two error if, unknown to the
manufacturer, an incoming shipment of raw materials contained 15% substandard
materials yet its expectations was only 7%. This could happen if out of a sample of 50
items, two items (4%) were found to be substandard. According to the standard
procedure, the shipment should be accepted because the sample has passed the test (bad
items in the sample is less than 7%).

In actual sense, the decision-maker cannot study every item or individual in the
population. Thus there is a possibility of two types of errors: (1) rejecting the shipment
when the substandard items are less than 7%; and (2) accepting the shipment when the
substandard items are more than 7%. The latter error can be more costly than the former
in real life situations.

One Tailed Test


A one tailed test is one where the rejection region is either to the right or to the left.

Illustration
Suppose that the packaging department at Home Economics Corporation is concerned
that some boxes are overweight. The cereal is packaged in 500-gram boxes. The null
hypothesis will be:

H0: μ ≤ 500.

This is read, “The population mean μ is equal to or less than 500”. The alternative
hypothesis is therefore:

H1: μ > 500.

This is read, “The population mean μ is greater than 500”. The inequality sign in the
alternative hypothesis (>) points to the region of rejection in the right (upper) tail. It
should be noted that the null hypothesis includes the equal sign while the alternate
hypothesis has no equal sign. The packaging problem can be represented using the
diagram below:

31

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Figure3.1: Sampling Distribution of Statistic (right-tailed test at α = 0.05)

Rejection region zα=0.05 = 1.65

Do not reject H0

The above illustration showed a problem where the rejection region was to the right. In
some problems, the rejection region is in the opposite direction (i.e. to the left).

Illustration
Consider the problem of an automobile manufacturer, large automobile leasing
companies, and other organisations that purchase large quantities of tyres. They want the
tyres to average 40,000 miles of wear under normal usage. They will therefore reject a
shipment of tyres if accelerated life tests show that the life of tyres is significantly below
40,000 miles on average. They will however gladly accept a shipment if the mean life is
greater than 40,000 miles. The procedure to test this problem is to have a sample and test
the null and alternate hypotheses. The tests will be as follows:

H0: μ ≥ 40,000.
H1: μ < 40,000.

The diagram below shows this relationship

32

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Figure 3.2: Sampling Distribution of Statistic (left-tailed test at α = 0.05)

Rejection region zα=0.05 = -1.65

Do not reject H0

Note: the z values for any level of significance will depend on whether the test is a two-
tailed, or a one tailed test. For example:
z-value (s)
Level of significance (α) two-tailed one-tailed
0.10 -1.65 and +1.65 -1.29 or +1.29
0.05 -1.96 and +1.96 -1.65 or +1.65

One way to determine the location of the rejection region is to look at the direction in
which the inequality sign in the alternate hypothesis is pointing. When the alternate
hypothesis points >, it is a right tailed test; when it points < it becomes a left tailed test.

Two Tailed Test


If no direction is not specified in the alternate hypothesis, a two-tailed test is applied. An
example of alternate hypothesis that has no direction could be:

 H0 : There is no difference between the mean incomes of males and mean


incomes of females.
 H1 : There is a difference between the mean incomes of males and mean
incomes of females.

If the null hypothesis is rejected and the alternate hypothesis accepted in this test, the
mean incomes of men could be greater than that of females or vice versa. To
accommodate these two possibilities, the five percent rejection is divided into two areas
of the sampling distribution (2.5 % each) (refer to figure 3.3).

Illustration
Makini Steel Company manufactures and assembles desks at several plants in Western
Kenya. The weekly production of model A325 desk at department Z3 is normally
distributed, with a mean of 200 and a standard deviation of 16. Due to market expansion,
new production methods have been introduced and the new employees hired. The CEO

33

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

would like to investigate whether there has been an overall change in the weekly
production of the model A325. The CEO further informs you that he is willing to accept a
one percent error in the results of investigation.

Required:
Assist the CEO in the investigation.

Solution
Statistical hypothesis testing will be used to investigate whether the production has
changed from 200 per month.

Step 1
The null hypothesis is “the population mean is 200.” The alternate hypothesis is “the
mean is different from 200” or “the mean is not 200”

H0: μ = 200.
H1: μ ≠ 200.

This is a two-tailed test because the alternate hypothesis does not state a direction. In
other words, it does not state whether the mean production is greater than 200 or less than
200. The CEO only wants to find out whether the production rate is different from 200.

Step 2
As noted the one percent error translates to 0.01 level of significance. This is α, the
probability of committing a Type 1 error. That is, the probability of rejecting a true
hypothesis.

Step 3
The test statistic for this type of problem is z. transforming the production data into
standard units permits the use of z values not only in this problem but also in other
hypothesis testing problems.

x
Z=
 n

Step 4
The decision rule formulated by finding the critical value of z from the table. Since this is
a two tailed test, half of 0.01 or 0.005 is in each tail. The area where H0 is not rejected,
located in the two tails, is therefore 0.99.

34

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Figure 3.3: Areas of acceptance and Rejection of H0 (two-tailed test)

H0 is rejected, zα/2 =0.05


0.4950 0.4950
H0 is rejected, zα/2 =0.05
Acceptance region (1-α)

  0

The decision rule is therefore: reject the null hypothesis and accept alternate hypothesis
(which states that the population mean is not 200) if the computed value of z does not fail
in the region between –2.58 and +2.58. Do not the null hypothesis is z falls between –
2.58 and +2.58.

Step 5
Take a sample from the population (weekly production), compute z, and, based on the
decision rule, arrive at a decision to reject H0.

Illustration
Thomas discount store chain issues its own credit card. The credit manager wants to
know whether the mean monthly unpaid balance is more than Sh 400. The level of
significance is set at 0.05. A random check of 172 unpaid balances revealed the sample
mean is 404 and the standard deviation of the sample is 38. Should the credit manager
conclude the production mean is greater than Sh400, or it is reasonable that the difference
of Sh7 (407-400) is due to chance?

Solution

H0: μ ≤ 400
H1: μ > 400

This test is a one tailed test (right tailed test)

From the table, the critical value of z is 1.65. The computed value of z is 2.42, found by
using the formula

x
Z=
 n

35

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

407  400 7
=   2.42
38 172 2.8975

Because the computed value of the test statistic (2.42) is larger than the critical value
(1.65), the null hypothesis is rejected. The credit manager can conclude that the mean
unpaid balance is greater than Sh400.

The p-value provides additional insight into the decision. The p-value is the probability of
finding a test statistic as larger as or larger than that obtained when the null hypothesis is
true. The probability of a z value between 0 and 2.42 is 0.4922. Therefore, the probability
of a z value greater than 2.42 is 0.5-0.4922 = 0.0078. Therefore, the probability of finding
a z greater than 2.42 or larger when the null hypothesis is true is 0.78%. It is unlikely
therefore that the null hypothesis is true.

3.4 Hypothesis testing for small samples


Although many business applications involve large sample sizes, certain situations make
selection of large samples impractical due to time and cost. For example, an automobile
manufacturer may want to test whether its cars can withstand head-on crash at 15Km per
hour with a new bumper put in front of the car. This test will be very costly and the
company will afford only a few cars. If the sample size will be less than 30, z test will not
be applicable.

In 1908 W.S. Gosset, developed a method dealing with small samples. He showed that if
we used the same procedures of z-test for small samples, then Type I error would be
made more often. For example when α = , it means that using standardised normal
distribution for large samples will be making Type I error (rejecting null hypothesis when
in fact it is true), 5% of the times. But in small samples, this error will be committed
more than 5% of the times. The reason being that in repeated experiments with small
samples, the value of the sample standard deviation(s) tends to be quite variable.

Since the variation between sample mean and population mean is given by

Zσx where σx = σ/√n,

The population standard deviation can be approximated by sample standard deviation so


that,

σx = s/√n,

This relationship is not valid for small samples because of wide fluctuations in the values
of the sample standard deviations.

Based upon this variation, Gosset came up with different sets of critical scores called t-
scores (or student-t distribution). These t-scores are to be used in place of Z scores. The
larger the sample size, the closer will be the value of t-score to Z-score.

36

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The t-score distribution is useful not only when the sample size is small but also when the
population standard deviation is not known. If the population standard deviation is
known, then even a small sample from a normal population distribution can be treated as
normally distributed. Secondly, the population from which the small sample is taken must
be distributed close to normal. While a large sample from any population can be
approximated to normal distribution, a small sample must come from a normal or near
normal population, in order for a t-test to be used. The t-scores should not be used if
small samples come from a population which is distributed in a non-normal pattern.

The t-distribution has the following characteristics:


1. Like Z, it is a continuous distribution.
2. Like Z, it is symmetrical and bell shaped.
3. Unlike Z, it is not just one distribution, but a family of distributions.
4. It is more spread out at the centre than then the Z-curve and it is higher at the tail
ends. However as he sample size increases, the t-distribution approaches the Z-
distribution.

Inference concerning a population mean (small samples)


For a small sample from a normal population when the population standard deviation is
not known, the t statistic can be used as:

x -µ
t=
s/ n

Where x is the sample mean


μ is the population mean under null hypothesis
s is the sample standard deviation
n is the sample size

The confidence interval for the population mean is given by:

x  ts x

s
x ±t
n

Illustration
A claim is made that McNtany college students have an IQ of 120. To test this claim, a
random sample of 10 students was taken and their IQ scores are recorded as follows:

Student 1 2 3 4 5 6 7 8 9 10
IQ 105 110 120 125 100 130 120 115 125 130

37

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

At 0.05 level of significance, test the validity of this claim.

Solution
Since sample size is small (10) and population standard deviation is not known, we can
use the t-distribution with 10-1 degrees of freedom at α = 0.05.

x x x- x (x- x )2
100 118 -18 324
105 118 -13 169
110 118 -8 64
115 118 -3 9
120 118 2 4
120 118 2 4
125 118 7 49
125 118 7 49
130 118 12 144
130 118 12 144
1180 960

x = ∑x/n = 1180/10 = 118

s = √∑(x- x )2/n-1 = √960/9 = 10.33

H0: μ1 = 120
H1: μ1 ≠ 120

Using the t-test:

t = x - μ/( s/√n) = 118-120/(10.33/√10) = -0.612

Checking the critical t-score value from the table at α 0.05 and 9 degrees of freedom, the
t-score is 2.26. Hence, we do not reject the null hypothesis.

Comparing two population means (small independent samples)


A t-test can be used for comparison of two population means in order to establish
whether there is any significant difference between the two population means, provided
the following conditions are met:
1. each population is approximately normally distributed
2. The sample size is taken from each population is small (less than 30), but the
samples do not have to be equal in size.
3. The two samples should be unrelated (independent of each other).
4. The population standard deviations are unknown but are nearly equal to each
other.

38

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The t-statistic for comparison of two population means is similar to the procedure of
using the z-statistic for comparison of two population means. Two additional elements
are considered when using t-test. These are:

1. The number of degrees of freedom is the sum of the degrees of freedom for each
sample. [df = (n1 – 1) +( n2 – 1)].
2. the two standard deviations s1 and s2 calculated from two samples of size n1 and
n2 respectively are pooled together from a single estimate (sp) of the population
standard deviation, where (sp) is calculated as

(n1 - 1) s1  (n 2 - 1) s2
2 2

sp =
( n1  n 2 - 2

The t-statistic is calculated as follows:

t= x x1 2

s (n  n
2 1 1
p 1 2
)

Where: x 1
is the mean of the first sample

x 2
Is the mean of the second sample

n1 is the size of the first sample


n2 is the size of the second sample
sp is the pooled estimate of population standard deviation.

The calculated standard deviation is compared with the critical t-scores from the table at a
given level of significance and (n1 + n2 - 2) degrees of freedom (df). A decision is made
on whether to reject or accept the null hypothesis.

Illustration
A study was carried out to find out whether there is any significant difference in the
amount of money carried by KU male and female students on any given day. A random
sample of 8 male students and 10 female students was selected and the amount of money
they had was recorded. After calculating the sample means and sample standard
deviations, the following results were obtained:

Male Students Female Students


n 8 10
x 1
Sh205 Sh170
s Sh20 Sh15

39

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Solution
H0: μ1 = μ2
H1: μ1 ≠ μ2

t= x x 1 2

s (n  n
2 1 1
p 1 2
)
Calculating sp
sp = √[(n1 – 1) s12 +( n2 – 1) s22]/[( n1 – 1) +( n2 – 1)]

sp2 = [(8 – 1) 202 +( 10 – 1) 152]/[( 8 – 1) +( 10 – 1)]

= (280 + 202.5)/16 = 301.56

Therefore,
= 205-170/√ 301.56(8-1 + 10-1)

= 35/8.237 = 4.25
The critical value of t from the table at α0.05, n = 16 is 2.12, hence we reject the null
hypothesis since the calculated t-value is higher.

3.5 Chi Square (χ2) Test


The CHI SQUARE (χ2) Test is used for nominal or ordinal scale of measurement.
Nominal scale of measurement deals with data which can only be classified into
categories such as male and female, illiterate and literate, tall and short etc. there is no
particular order for these groupings and all categories are mutually exclusive (e.g. one
cannot be male and female at the came time)

Chi Square is used to analyse qualitative data such as opinions, habits, etc. and it has the
following properties:
1. It involves squared observations and hence it is always positive. Its value is
always greater than or equal to zero.
2. The distribution is not symmetrical. It is skewed to the right so that its skewness is
positive. As the number of degrees of freedom increases, Chi Square approaches a
symmetric distribution.
3. Similar to t distribution, there is a family of Chi Square distributions.

The estimation of degrees of freedom (df) for χ2 is based on the number of categories
(k-1) where k are the number of categories. For example, if a sample of 100 students is
divided into freshmen, juniors, and seniors, there will be (k-1 = 3-1) = 2 degrees of
freedom.

The χ2 will test whether there is significant difference between the observed number of
responses in each category and the expected number of responses for such category under
the assumption of null hypothesis. The objective is to find how well the distribution of
observed frequencies fo fit the distribution of expected frequencies fe.

40

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

χ2 = ∑( fo - fe )2/ fe

Where, fo is the observed frequency in a given category


fe is the expected frequency response in the same category under the
Assumption of null hypothesis

Illustration
Sixty children were asked to state the ice-cream flavour they preferred among the three
available categories of Vanilla, Strawberry, and Chocolate

Flavour Number of children


Vanilla 17
Strawberry 24
Chocolate 19
Total 60

Determine whether children favour any particular flavour compared to other flavours.

Solution
The null hypothesis states that there is no difference among the tastes of children in the
three flavours. Under this hypothesis, equal numbers of children are expected to prefer
each flavour. Therefore 20 should prefer Vanilla another 20 should prefer Strawberry and
the last 20 should prefer Chocolate.

The table of expected frequencies is shown below:

Flavour Observed frequency fo Expected frequency fe


Vanilla 17 20
Strawberry 24 20
Chocolate 19 20
Total 60 60

Calculating the χ2 we get,

( f f )
2

χ2 = o e

f e

= (17-20)2/20 + (24-20)2/20 + (19-20)2/20

= 9/20 + 16/20 + 1/20

= 1.3

41

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

At α =0.05 and 3-2 = 2df, the critical χ2 is 5.991. Since our computed value of χ2 is less
than the critical value of χ2, we cannot reject the null hypothesis.

Exercises
1. What is test of significance?
2. Explain the procedure generally followed in testing of hypothesis.
3. Differentiate between type I and type II errors.
4. Define the standard error of as statistic. How is it helpful in testing hypothesis and
decision making?
5. Distinguish between the null and alternate hypothesis.

6. In a sample of 1000 persons in an Estate, 660 are found to be consumers of maize


and the rest consumers of rice. Can it be concluded that both food articles are
equally popular.

7. Random samples of 100 bolts manufactured by machine X and 50 bolts from


machine Y showed 10 and 6 defective bolts respectively. Is there a significant
difference in the performance of the two machines?

8. A machine puts out 20 imperfect articles in a sample of 1000. After the machine
is overhauled, it puts out 5 imperfect articles in a sample of 300. Has the machine
improved?

9. A manufacturing co. is requires that the mean length of the rods it produces
should be 8.6 inches. The standard deviation of these rods is 0.3 inch. The
manufacturer would like to see if the process is working correctly by taking a
random sample of 36. There is no indication of whether or not the rods may be
too short or too long.
i. Establish null and alternative hypothesis for this problem
ii. If the random sample yields an average length of 8.7 inches, which
hypothesis should be accepted?

10. Bidii investment is an investment organisation that deals with portfolio


management. They claim that they can predict with 80 percent accuracy. Out of
their prediction of 40 stocks, 28 are correct. Is it worthwhile to engage the
services of this firm?

11. The time required by a doctor to treat patients follows a normal distribution curve.
A sample of six patients revealed the following times of treatment in number of
days.

10, 8, 11, 5, 9, and 7.

Construct a 99% confidence interval for the true mean estimate of the time
required for treatment in number of days.

42

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

12. A large manufacturing company producing motors claims that orders of motors
are filled on average in 10 days. A random sample of 8 files showed that the
orders were filed in

13, 9, 17, 14, 11, 18, 9, and 13 days.

Assuming α to be 0.05, is the claim of 10 days credible?

13. A machine operator is in charge of two machines (A) and (B). Machine A has
been in service longer than machine B. the operator is interested in finding out
whether both machines take on average the same amount of time in producing the
same product. The operator records the following data on time taken by each
machine in minutes:

Machine A 30 32 28 29 30 25 31 30 28 29
Machine B 33 32 31 30 32 31 31 33 32 32

Test if there is a significant difference in the times for machine A and Machine B
in producing the product. Assume both samples are normally distributed and α =
0.01.

14. In consultation with the professors teaching the introductory business course, the
Chairman proposed that the student’s grades should be based on the bell-shaped curve
and the following grading policy adopted.

Top 10% A
Next 20% B
Middle 40% C
Next 20% D
Bottom 10% F

Since a common exam is given to a large number of students in various sections


of the course, the chairman wants to check if his policy is being followed. A
random sample of 50 grades from the grade sheets showed the following results:

Grade Frequency
A 10
B 10
C 10
D 14
F 6

The chairman wants you to analyse the data and inform him whether all
professors are following his policy. Assume α = 0.05

43

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON FOUR

CORRELATION ANALYSIS

Lesson Objective
At the end of this lesson you should be able to:

 Explain Product-moment correlation coefficient Error! Bookmark not


defined.
 Explain the Spearman’s rank correlation coefficient
 Understand the coefficient of determination
 Distinguish between correlation and causation

4.0 Introduction
Correlation is a technique used to measure the strength of the relationship between two
variables. There are two techniques for measuring correlation:
 product moment and;
 Rank method.

The purpose of regression analysis is to identify a relationship for a given set of bivariate
data. However, regression alone cannot tell us how good is the relationship between X
and Y. correlation can be used to provide a measure of how well the regression line ‘fits’
the given set of data.

4.1 Product-Moment Correlation Coefficient


The Product-Moment Correlation Coefficient (r), or correlation coefficient of correlation
is a measure of the degree of linear relationship between two variables, usually labelled X
and Y. While in regression the emphasis is on predicting one variable from the other, in
correlation the emphasis is on the degree to which a linear model may describe the
relationship between two variables. In regression the interest is directional, one variable
is predicted and the other is the predictor; in correlation the interest is non-directional, the
relationship is the critical aspect.

The computation of the correlation coefficient is most easily accomplished with the aid of
a statistical calculator. However, it is important to know how the figure is obtained
without the aid of mechanical tools.

The product moment correlation coefficient formula is given by:

44

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

n xy   x y
r
n x  ( x ) n y  ( y )
2 2 2 2

The correlation coefficient may take on any value between plus and minus one.

 1.00  r  1.00

The sign of the correlation coefficient (+, -) defines the direction of the relationship,
either positive or negative. A positive correlation coefficient means that as the value of
one variable increases, the value of the other variable increases; as one decreases the
other decreases. A negative correlation coefficient indicates that as one variable
increases, the other decreases, and vice-versa.

Taking the absolute value of the correlation coefficient measures the strength of the
relationship. A correlation coefficient of r = 0.50 indicates a stronger degree of linear
relationship than one of r = 0.40. Likewise a correlation coefficient of r = - 0.50 shows a
greater degree of relationship than one of r = 0.40. Thus a correlation coefficient of zero
(r = 0.0) indicates the absence of a linear relationship and correlation coefficients of
r = +1.0 and r = -1.0 indicate a perfect linear relationship.

Determining Correlation Coefficient


The correlation coefficient may be understood by using the Scattergraph or by
calculation.

Scatterplots
The scatterplots presented below illustrate how the correlation coefficient changes as the
linear relationship between the two variables is altered. When r = 0.0 the points scatter
widely about the plot, the majority falling roughly in the shape of a circle. As the linear
relationship increases, the circle becomes more and more elliptical in shape until the
limiting case is reached (r = 1.00 or r = -1.00) and all the points fall on a straight line.

A number of scatterplots and their associated correlation coefficients are presented


below. It should be noted that although the scatterplot can be visually identified, one can
only guess the magnitude of correlation. Definitely, it is possible to infer that the
correlation is negative or positive but one can not say for example that the correlation is
0.6 or -0.4 by simply examining the scattergraph.

45

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration – Scattergraphs showing correlation between X and Y

Graph (a) strongly positive correlation

Weekly Maintenance Cost


500
400
Cost (Shs)

300
200
100
0
0 20 40 60 80
Machine Age

Graph (b) weak positive correlation

Weekly Maintenance Cost


500
400
Cost (Shs)

300
200
100
0
0 10 20 30 40 50
Machine Age

Graph (c) no correlation

Weekly Maintenance Cost


500
400
Cost (Shs)

300
200
100
0
0 10 20 30 40 50
Machine Age

46

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Graph (d) near perfectly negative correlation

Factory Worker Performance


250
Errors Made

200
150
100
50
0
0 20 40 60 80 100
Weeks of Experience

Calculation of Correlation Coefficient


The following formula is used to calculate correlation coefficient:

n xy   x y
r
n x  ( x ) n y  ( y )
2 2 2 2

Illustration
The following data relate to the weekly maintenance cost (Shs) to the age (in months) of
ten machines of similar type in a manufacturing company. You are required to calculate
the product moment correlation coefficient between age and cost.

Machine 1 2 3 4 5 6 7 8 9 10
Age 5 10 15 20 30 30 30 50 50 60
Cost 190 240 250 300 310 335 300 300 350 395

Solution
 Let the machine age be represented by x and;

 Cost is represented by y.

 Tabulate the data and workout the calculations as shown below:

47

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

X Y xy x2 y2
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
300 2970 97650 12050 913050

n 10
sum xy 97650
sum x 300
sum y 2970
sum x sq 12050
sum y sq 913050

corr (r) 0.879864

n xy   x y
r
n x  ( x ) n y  ( y )
2 2 2 2

(10 * 97650)  (300 * 2970)



10 *12050  300 10 * 913050  2970
2 2

= 85500/ 174.64 * 556.42

= 0.0.880
The result shows a strong measure of correlation between machine maintenance cost and
age of machine.

Correlation can exist in such a way that increases in the value of one variable tend to be
associated with increases in the value of the other. This is known as positive or direct
correlation. In positive correlation, r will take the value between 0 and 1. When r = 1, it
signifies a perfect positive correlation.

Negative correlation exists when increases in the value of one variable tend to be
associated with decreases in the value of the other (and vice versa). In this type of case
the correlation is said to be negative (or inverse). In this case, the correlation coefficient,
r, will take a value of between 0 and -1, with r = -1 signifying ‘perfect’ negative
correlation.

48

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

4.2 Spearman’s Rank Correlation Coefficient


An alternative method of measuring correlation, based on the ranks of the sizes of item
values, is available and known as rank correlation. The measure of rank correlation most
commonly used is known as spearman’s rank correlation coefficient and the procedure
for obtaining it is as follows:
 rank the x values (to give rx values)
 rank the y values (to give ry values)
 for each pair of ranks, calculate d2 = (rx – ry)2
 calculate ∑d2
 compute correlation coefficient r using the formula:

1  6 d
2

r=
n  (n  1)
2

Where n is the number of bivariate pairs.

Illustration:
Find the rank correlation between the rent and rates values shown below:

Rates (x) 1.68 1.46 1.57 13.37 3.18 1.95 1.07 1.71 1.22 6.46
Rent (y) 3.81 4.19 4.87 22.85 6.47 6.48 2.66 6.49 5.33 15.23

Solution

x y rx ry d d2
1.68 3.81 5 2 3 9
1.46 4.19 3 3 0 0
1.57 4.87 4 4 0 0
13.37 22.85 10 10 0 0
3.18 6.47 8 6 2 4
1.95 6.48 7 7 0 0
1.07 2.66 1 1 0 0
1.71 6.49 6 8 -2 4
1.22 5.33 2 5 -3 9
6.46 15.23 9 9 0 0
N = 10 Total 26

From the table, n = 10 and ∑d2 = 26.

49

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

1  6 d
2

Therefore, r=
n  (n  1)
2

= 1 – 6∑26/ 10(102-1)

= 1 – 6*26/10*99

= 0.842.

Thus, there is a high positive correlation between rent and rates.

4.3 Coefficient of Determination (R2)


Correlation measures whether there is any relationship between two variables. This
relationship can either be positive or negative. Although correlation might tell us about
the existence of relationship between two variables, it does not tell us the extend to which
one variable can predict the other. Coefficient of determination (R2) is a statistic that
measures how much of the dependent variable can be explained by the dependent
variable.

2 2 2
a  y  b xy  n( y ) 2
Coefficient of determination (R ) = r or R 
y 2
 n( y ) 2

Variance Interpretation
The coefficient of determination (R2) is the proportion of variance in Y that can be
accounted for by knowing X. Conversely, it is the proportion of variance in X that can be
accounted for by knowing Y.

One of the most important properties of variance is that it may be partitioned into
separate additive parts. For example, consider shoe size. The theoretical distribution of
shoe size may be presented as follows:

Note: Shaded region represent females

50

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

If the scores in this distribution were partitioned into two groups, one for males and one
for females, the distributions could be represented as follows:

If one knows the sex of an individual, one knows something about that person's shoe size,
because the shoe sizes of males are on the average somewhat larger than females.
However, the variance within each distribution, male and female, is variance that cannot
be predicted on the basis of sex (this is known as error variance). For example, even if
one knows the female sizes to be between 3.5 and 11, he cannot know the exact size of a
particular female.

Therefore, total variance is the sum of the variance that can be predicted and the error
variance, or variance that cannot be predicted. This relationship is summarized below:

The correlation coefficient squared is equal to the ratio of predicted to total variance:

This formula may be rewritten in terms of the error variance, rather than the predicted
variance as follows:

This formula illustrates the actual components of r2 = R2 = coefficient of determination.

4.4 The Correlation Matrix


A convenient way of summarizing a large number of correlation coefficients is to put
them in a single table, called a correlation matrix. A Correlation Matrix is a table of all
possible correlation coefficients between a set of variables. For example, suppose a
questionnaire is framed as follows:

1. AGE - What is your age? _____

51

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

2. KNOW - Number of correct answers to a Geography question which consisted of


correctly locating 10 districts on the map of Kenya.

3. VISIT - How many provinces of Kenya (excluding your home province) have you
visited? _____

4. DRIVE – Do you have a driving licence? _____

5. SEX - 1 = Male, 2 = Female

The above questionnaire will produce data matrix as follows

Age Know Visit Drive Sex


18 9 5 1 1
18 6 3 0 0
18 6 5 1 1
18 9 2 0 0
45 6 4 1 0
18 5 3 0 1
18 5 4 0 0

Since there are five questions on the example questionnaire there are 5 * 5 = 25 different
possible correlation coefficients to be computed. Each computed correlation is then
placed in a table with variables as both rows and columns at the intersection of the row
and column variable names. For example, one could calculate the correlation between
AGE and KNOW, AGE and VISIT, AGE and DRIVE, AGE and SEX, KNOW and
VISIT, etc., and place then in a table of the following form.

Age Know Visit Drive Sex


Age 1.00 -0.15 0.11 0.47 -0.35
Know -0.15 1.00 -0.07 0.23 0.05
Visit 0.11 -0.07 1.00 0.80 0.52
Drive 0.47 0.23 0.80 1.00 0.42
Sex -0.35 0.05 0.52 0.42 1.00

One would not need to calculate all possible correlation coefficients. Only the ten
calculations (in bold) are necessary. This reduction in the number of required calculations
in the table were caused by the following facts:
1. The correlation of a variable with itself is always one (as shown by all the figures
in the leading diagonal)
2. The correlation coefficient is non-directional. That is, it doesn't make any
difference whether the correlation is computed between AGE and KNOW with
AGE as X and KNOW as Y or KNOW as X and AGE as Y. For this reason the
correlation matrix is symmetrical around the diagonal. Therefore all the figures in
the upper diagonal are unnecessary. The overall effect is to reduce the number of
correlation coefficient from 25 to 10 [25 (total) - 5 (diagonals) - 10 (redundant
because of symmetry)] = 10 (different unique correlation coefficients).

52

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Cautions about Interpreting Correlation Coefficients


When using correlation coefficients, it is important to investigate the data for its
appropriateness and to rule out the presence of any outliers.

Appropriate Data Type


Calculators and computers will produce a correlation coefficient regardless of whether or
not the numbers are "meaningful" in a measurement sense. The appropriate data type is
the one that has been measured on the interval scale. Nominal and ordinal data are not
appropriate. Interval property is rarely, if ever, fully satisfied in real applications.
Consequently, there is some difference of opinion among statisticians about when it is
appropriate to assume the interval property.

Effect of Outliers
An outlier is a score that falls outside the range of the rest of the scores on the scatterplot.
For example, if age is a variable and the sample is a statistics class, an outlier would be a
retired individual. Depending upon where the outlier falls, the correlation coefficient may
be increased or decreased.

An outlier which falls near where the regression line would normally fall will
unnecessarily increase the size of the correlation coefficient. On the other hand, an outlier
that falls some distance away from the original regression line would unnecessarily
decrease the size of the correlation coefficient.

The effect of the outliers is somewhat muted when the sample size is fairly large. The
smaller the sample size, the greater the effect of the outlier. When a researcher encounters
an outlier, a decision must be made whether to include it in the data set. It may be that the
respondent was deliberately giving wrong answers, or simply did not understand the
question on the questionnaire. On the other hand, it may be that the outlier is real and
simply different.

Identifying and Remedies for Outliers


The best way of spotting an outlier is by drawing the scattergraph. The decision whether
to include or not include an outlier remains with the researcher; he or she must justify
deleting any data. It is also suggested that the correlation coefficient be computed and
reported both with and without the outlier if there is any doubt about whether or not it is
real data.

4.5 Correlation and Causation


No discussion of correlation would be complete without a discussion of causation. It is
possible for two variables to be related (correlated), but not have one variable cause
another.

53

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

For example, suppose there is a high correlation between the number of ice-cream sold
and the number of drowning deaths. Does that mean that one should not buy ice-cream
before one swim? Not necessarily. Both of the above variables are related to a common
variable, the heat of the day. The hotter it is, the more ice-cream sold and also the more
people swimming, thus the more drowning deaths. This is an example of correlation
without causation.

Much of the early evidence that cigarette smoking causes cancer was correlational. It may
be that people who smoke are more nervous and nervous people are more susceptible to
cancer. It may also be that smoking does indeed cause cancer. The cigarette companies
made the former argument, while some doctors made the latter. In this case rationality
demands one to believe the relationship is causal and therefore do not smoke (because it
is probable that the cigarette companies are interested in making money). Sociologists are
very much concerned with the question of correlation and causation because much of
their data is correlational. They developed a branch of correlational analysis, called path
analysis, precisely to determine causation from correlations. Before a correlation may
imply causation, certain requirements must be met. These requirements include:

i. The causal variable must temporally precede the variable it causes, and

ii. Certain relationships between the causal variable and other variables must
be met. If a high correlation was found between the age of the teacher and
the students' grades, it does not necessarily mean that older teachers are
more experienced, teach better, and give higher grades. Neither does it
necessarily imply that older teachers are soft touches, don't care, and give
higher grades. Some other explanation might also explain the results. The
correlation means that older teachers give higher grades; younger teachers
give lower grades. It does not explain why it is the case.

Exercises
1. What is correlation?
2. What is correlation coefficient?
3. Why does correlation coefficient always lie between -1 and +1?
4. Name and explain two types of correlation.
5. with the aid of a scattergraph, illustrate and explain
i. positive correlation
ii. no correlation
iii. negative correlation

6. Differentiate between correlation and causation


7. What precautions should be taken when interpreting correlation?
8. Plot a scattergraph using the data below. And calculate the product moment
correlation coefficient.

x 1 2 3 4 6 8 9 11 14
y 1 2 2 4 4 5 7 8 9

54

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

9. A cost accountant has derived the total cost against output (figures in thousands)
of standard size boxes from a factory over a period of ten weeks, yielding the
following data.

Output 20 2 4 23 18 14 10 8 13 8
Cost 60 25 26 66 49 48 35 18 40 33

Calculate:
i. the product moment correlation coefficient
ii. coefficient of determination and interpret the results

10. On ten different days chosen at random, the following values were obtained for
the share price of a particular company together with the value of the ESE index
on that day.

Price 77 46 80 76 65 71 60 75 76 88
Index 319 315 387 339 383 340 340 356 358 398

Calculate the Spearman’s rank correlation coefficient and determine whether the
ESE can be relied as an indicator of price.

55

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON FIVE

REGRESSION ANALYSIS

Lesson Objectives
At the end of this lesson you should be able to:

 Define the term “regression analysis”


 Understand the linear bivariate regression relationship
 Formulate regression equations Error! Bookmark not defined.
 Calculate regression coefficients
 Understand the properties of the regression coefficients
 Explain the standard error of estimate
 Understand how to calculate the coefficient of determination

5.0 Introduction
After having established the fact that two variables are closely related investigations can
be done to establish the predictability of one variable given the value of the other
variable. For example, if we know that advertising and sales are correlated, we may find
out the expected amount of sales for a given advertising expenditure. This is possible by
employing the technique of regression analysis.

Regression is the act of returning or going back. The term regression was first used in
1877 by Francis Galton while studying the relationship between the height of fathers and
sons. His findings showed that the height of fathers have a direct relationship with the
height of sons but the average height of sons of tall fathers were less than the average
height of the fathers; while the average height of sons of short fathers were more than the
height of the short fathers.

Regression analysis is a branch of statistical theory that is widely used in most scientific
areas. In economics it is the basic technique for measuring the relationship among
economic variables that constitute the essence of economic theory.

The regression analysis helps in three important ways:


1. It provides estimates of values. The device used to accomplish the estimation
procedure is the regression line which describes the average relationship existing
between X and Y variables.
2. The second goal of regression analysis is to obtain the measure of error involved
in using the regression line as a basis for estimations. For this purpose, the
standard error of estimate is calculated. If the line fits the data closely, good
estimate can be made of Y variable. On the other hand, if there is a great deal of
scatter of the observations around the fitted regression line, the line will not
produce accurate estimates of the dependent variable.

56

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

3. With the help of regression analysis, the measure of the degree of association or
correlation can be obtained. The coefficient of determination calculated for this
purpose measures the strength of relationship that exists between the variables. It
assesses the proportion of variance that has been accounted for by the regression
equation.

The tool of regression analysis can be extended to three or more variables. However, in
this lesson, we shall confine ourselves to problems of two variables only (simple
regression analysis)

Difference Between Correlation and Regression


1. Correlation measures the degree of relationship between X and Y whereas
regression analysis studies the nature of the relationship between X and Y.
2. The cause and effect relationship is clearly indicated in regression analysis than
by correlation. Correlation merely shows that there is a relationship between two
variables while regression shows which variable affects the other.

5.1The Linear Bivariate Regression Relationship


In regression analysis, we usually proceed by observing the sample data and using the
results obtained as estimates of the corresponding population relationship. To make valid
inferences, one must assume some population model. For a bivariate population, there are
many possible models that can be constructed to describe the mutual variations of the two
variables. The simple linear regression model is one of the models used to describe the
relationship between two variables. This model is constructed base d on the following
assumptions:
1. The value of the dependent variable Y, is dependent upon the value of the
independent variable X. the dependent variable X is assumed to be fixed
quantities that are selected and controlled by the experimenter. The requirement
that the independent variable assumes fixed values is not a critical one. Useful
results can still be obtained by regression analysis in the case where both X and Y
are random variables.
2. The average relationship between X and Y can be adequately described by a
linear equation Y = a + bX. Where a is the intercept and b is the coefficient (it can
be a positive or a negative value).
3. Associated with each value of X there is a sub-population of Y. the distribution of
the sub-population may be assumed to be normal or non-specified in the sense
that it is unknown. In any event, the distribution of each population Y is
conditional to the value of X
4. The mean of each sub-population Y is called the expected vale of Y for a given X:
E(Y/X) = μyx . Furthermore, under the assumption of a linear relationship between
X and Y, all values of E(Y/X) or μyx must fall on a straight line.
I.e. E(Y/X) = μyx = a + bX

Which is the population regression equation for the bivariate linear model. In this
equation a and b are called the population regression coefficients

57

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

5. an individual values in each sub-population Y, may be expressed as


Y = E(Y/X) +e

Where e is the deviation of a particular value of Y from μyx and is called the error
term or the stochastic disturbance term. The errors are assumed to be independent
random variables because Y’s are random variables and independent. The
expectations of these errors are zero; E(e) = 0. Moreover, if Y’s are normal
variables, the error can also be assumed to be normal.

6. It is assumed that the variances of all sub-populations, called variances of the


regression, are identical.

Regression Lines
If we take the case of two variables X and Y, two regression lines can be obtained:
regression line of X on Y and the regression line of Y on X. the regression line of Y
on X give the most probable values Y given the Values of X. on the other hand,
regression of X on Y gives the most probable values of X given the values of Y. thus,
we have two regression lines. However, when there is a perfect correlation between
the two variables, the two regression lines will coincide (i.e. we will have only one
line). The further the regression lines are from each other, the lesser the degree of
correlation and vice versa. If the variables are independent, r is zero and the lines of
regression are right angles (i.e. parallel to X-axis and Y-axis).

It should be noted that the regression lines cut each other at the point of average of X
and Y, i.e. if from the point where both the regression lines cut each other a
perpendicular is drawn on the X-axis, we will get the mean value of X and if from
that point a horizontal line is drawn on the Y-axis, we will get the mean value of X.

5.2 Regression equations


Regression equations are algebraic expressions of regression lines. Since there are
two regression lines, there are two regression equations- the regression equation of X
on Y is used to describe the variations in the values of X for given changes in Y and
the regression equation of Y on X is used to describe the variations in the values of Y
for given changes in X.

Regression Equation of Y on X
This equation is expressed as follows:

Ye = a + bX

Where Ye is the dependent variable to be estimated and X is the independent variable.

In this equation, ‘a’ and ‘b’ are two unknown constants which determine the position
of the line. The constants are called parameters of the line. If the value of either both
of them is changed, another line is determined. The parameter ‘a’ determines the

58

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

level of fitted line (the distance of the line directly above or below the origin). The
parameter ‘b’ determines the slope of the line (the change in Y for a unit change in X)

If the values of the constants a and b are obtained, the line is completely determined.
However, the challenge is the process of obtaining these parameters. The method of
least squares is used to determine these parameters. Least squares method aims at
minimizing the sum of squares of the vertical deviations of the actual Y values from
the estimated Y values. By doing so, the fitted line through the regression points will
be the best possible.

With the help of algebra and differential calculus, it can be shown that the following
two equations, if solved simultaneously, will yield values of the parameters a and b
such that the least squares requirement is fulfilled.

ΣY = Na + b ΣX
ΣXY = a ΣX + b ΣX2

These equations are usually called the normal equations. In the equations ΣX, ΣY,
ΣXY, ΣX2 indicate totals which are computed from the observed pairs of values of
two variables X and Y to which the least squares estimating line is to be fitted and N
is the total number of observed pairs of vales.

Regression Equation of X on Y
The regression equation of X on Y is expressed as follows:

Xe = a + bY

To determine the values of a and b the following two normal equations are to be
solved simultaneously.

ΣX = Na + b ΣY
ΣYX = a ΣY + b ΣY2

Illustration
Calculate the regression of equations of X on Y and Y on X from the following data:

X: 1 2 3 4 5
Y: 2 5 3 8 7

59

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Solution
Calculation of regression equations
X Y X2 Y2 XY
1 2 1 4 2
2 5 4 25 10
3 3 9 9 9
4 8 16 64 32
5 7 25 49 35
Σ 15 25 55 151 88

Regression equation of X on Y is given by

Xe = a + bY

ΣX = Na + b ΣY
ΣYX = a ΣY + b ΣY2

Substituting the values we get:

15 = 5a + 25b
88 = 25a + 151b

Solving the equations, we get


a = 0.5 and b = 0.5

the required regression equation of X on Y will be given by:


X = 0.5 + 0.5Y

Regression equation of Y on X is given by

Ye = a + bX

ΣY = Na + b ΣX
ΣXY = a ΣX + b ΣX2

Substituting the values we get:

25 = 5a + 15b
88 = 15a + 55b

Solving the equations, we get


a = 1.1 and b = 1.3

The required regression equation of Y on X will be given by:

Y = 1.1 + 1.3X

60

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration 2
After investigations, it has been found that the demand for automobiles in a town depends
mainly upon the number of families residing in that town. Below are given figures for the
sales of automobiles in the five cities for the year 2004, and the number of families
residing in those cities.

City No of Families ‘000s Sales of automobiles ‘000s


A 70 25.2
B 75 28.6
C 80 30.2
D 60 22.3
E 90 35.4

Find a linear equation of Y on X by the least square method and estimate the sales for the
year 2006 for city A which is estimated to have 300,000 families assuming the same
relationship holds true.

Solution
Calculation of regression equation

City X Y X2 XY
A 70 25.2 4900 1764
B 75 28.6 5625 2145
C 80 30.2 6400 2416
D 60 22.3 3600 1338
E 90 35.4 8100 3186
Σ 375 141.7 28625 10849

Regression equation of Y on X is given by

Y = a + bX

ΣY = Na + b ΣX
ΣXY = a ΣX + b ΣX2

Substituting the values we get:

141.7= 5a + 375b
10849 = 375a + 28625b

Solving the equations, we get


a = -4.885 and b = 0.443

61

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The required regression equation of Y on X will be given by:

Y = -4.885 + 0.443X

The estimated sales for city A in year 2006 will be


-4.885 + 0.443 * 300
= -4.885 + 132.9
= 128.015

Hence the expected sales for city A in year 2006 would be Sh128,015 given the
population of 300,000 families.

Deviations taken from arithmetic means of X and Y


The calculations by the direct method discussed above are quite cumbersome when the
values of X and Y are large. The work can be simplified instead of dealing with the actual
values of X and Y we take the deviations of X and Y series from their respective means.
In such case the equation Y = a + bX is changed to:

Y- Y = byx( X  X )

The value of byx can be obtained as follows:

byx =  xy
x
2

Where x  ( x  x ) and y  ( y  y )

The two normal equation when changed in terms of x and y become

∑y = Na + b∑x…………………………….(i)
∑xy = a∑x + b∑x2…………………………(ii)

Since ∑x = ∑y =0,

Equation (i) reduces to Na = 0 therefore, a = 0

Equation (ii) reduces to ∑xy = b∑x2 therefore, b or byx = ∑xy/∑x2

After obtaining the value of byx the regression equation can be written in
terms of X and Y by substituting for y, ( y  y ) and for x, ( x  x )
Similarly the regression equation X = a + bY is reduced to ( x  x )
= bxy ( y  y ) and the value of bxy can be obtained by

bxy = ∑xy/∑y2

62

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration
The table below contains recorded data showing the test scores made by salesmen on an
intelligence test and their weekly sales:

Salesmen: 1 2 3 4 5 6 7 8 9 10
Test score: 40 70 50 60 80 50 90 40 60 60
Sales ‘000s 2.5 6.0 4.0 5.0 4.0 2.5 5.5 3.0 4.5 3.0

Calculate the regression line of sales on test scores and estimate the probable weekly
sales volume if a salesman makes a score of 100.

Solution
Let the sales be denoted by Y and the test scores by X. the regression equation
will be the regression of Y on X since the logic is that test score can influence sales and
not the other way.

Therefore Y- Y = byx( X  X )

Calculation of regression line


Salesmen test score XX sales Y- Y
X x x2 Y y y2 xy
1 40 -20 400 2.5 -1.5 2.25 30
2 70 10 100 6.0 2.0 4.0 20
3 50 -10 100 4.0 0 0 0
4 60 0 0 5.0 1.0 1.0 0
5 80 20 400 4.0 0 0 0
6 50 -10 100 2.5 -1.5 2.25 15
7 90 30 900 5.5 1.5 2.25 45
8 40 -20 400 3.0 -1.0 1.00 20
9 60 0 0 4.5 0.5 0.25 0
10 60 0 0 3.0 -1.0 1.0 0
N = 10 ∑ 600 0 2400 40 0 14 130

X = ∑X/N = 600/10 = 60; Y = ∑Y/N = 40/10 = 4

bxy = ∑xy/∑x2 = 130/2400 = 0.054

Y-4 = 0.054 (X-60)

Y = 0.76 + 0.054X

When X is 100, Y would be

Y = 0.76 + 0.054(100)

= 6.16

63

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Thus the most probable weekly sales volume if the salesman makes a
score of 100 is Sh6,160.00.

Deviations Taken From Assumed Mean


When actual means of X and Y variables are in fractions, the calculations can be
simplified by taking the deviations from the assumed mean. The value of b will be
calculated as follows:

Regression equation of X on Y: (X- X ) = bxy(Y- Y )

Where: bxy = [N∑dxdy - ∑dx∑dy]/[ N∑dy2 - ∑(dy)2]

Regression equation of Y on X: (Y- Y ) = byx(X- X )

Where: byx = [N∑dydx - ∑dy∑dx]/[ N∑dx2 - ∑(dx)2]

Once the values of bxy and byx are determined in the above formula, the regression
equation can be calculated.

Illustration
A company wants to assess the impact of R&D expenditure on its annual profit. The
following data represents the information of the last eight years:
Year 2004 2003 2002 2001 2000 1999 1998 1997

R&D
Expenditure ‘000s 9 7 5 10 4 5 3 2
Profits 45 42 41 60 30 34 25 20

Required:
Estimate the regression equation and predict the profits for 2006 if the amounts allocated
for R&D is Sh10,000.

64

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Solution
Let R&D expenditure be denoted by X and profits by Y.

Calculation of regression line

Year X X-6 x Y Y-37


dx dx2 dy dy2 dxdy
1997 2 -4 16 20 -17 289 68
1998 3 -3 9 25 -12 144 36
1999 5 -1 1 34 -3 9 3
2000 4 -2 4 30 -7 49 14
2001 10 4 16 60 23 529 92
2002 5 -1 1 61 4 16 -4
2003 7 1 1 42 5 25 5
2004 9 3 9 45 8 64 24
N=8 ∑ 45 -3 57 297 1 1125 238

Fitting the regression equation of Y on X:

Y- Y = byx(X- X )

Y = ∑Y/N = 297/8 = 37.125

X = ∑X/N = 45/8 = 5.625

byx = [N∑dydx - ∑dy∑dx]/[ N∑dx2 - ∑(dx)2]

= [8 * 238 – (-3)(1)]/ [8 * 57 – (-3)2]

= 1904 + 3/ 456-9

= 1907/ 447 = 4.266

Y – 37.125 = 4.266 (X-5.625)

Y – 37.125 = 4.266X – 23.996

Y = 13.129 + 4.266X

When X = 10, Y shall be:

Y = 13.129 + 4.266 (10)

= 55.789

Therefore the expenditure of Sh10,000 on R&D is likely to result in profits of Sh55,789.

65

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

5.3 Regression Coefficients


The quantity b in the regression equations is called the regression coefficient or slope
coefficient. Since there are two regression equations, there are also two regression
coefficients:
1. regression coefficient of X on Y
2. regression coefficient of Y on X

Regression coefficient of X on Y
The regression coefficient of X on Y is represented by the symbol b xy or b1. It measures
the changes in X corresponding to a unit change in Y. this coefficient is given by:

bxy = r σx/σy

When deviations are taken from the means of X and Y, the regression coefficient is
obtained by

bxy = ∑xy/∑y2

When deviations are taken from assumed means, the value of bxy is obtained as follows:

bxy = [N∑dxdy - ∑dx∑dy]/[ N∑dy2 - ∑(dy)2]

Regression coefficient of Y on X
The regression coefficient of Y on X is represented by the symbol b yx or b2. It measures
the changes in Y corresponding to a unit change in X. this coefficient is given by:

byx = r σy/σx

When deviations are taken from the means of X and Y, the regression coefficient is
obtained by

bxy = ∑yx/∑x2

When deviations are taken from assumed means, the value of byx is obtained as follows:

byx = [N∑dydx - ∑dy∑dx]/[ N∑dx2 - ∑(dx)2]

5.3.1 Properties of the Regression Coefficients


1. the coefficient of correlation is the geometric mean of the two regression
coefficients:

r= b *b
xy yx

Proof: bxy = r σx/σy ; byx = r σy/σx. Therefore, bxy * byx = r σx/σy * r σy/σx = r2

66

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

2. If one of the regression coefficients is greater than unity, the other must be less
than unity. This is due to the fact that coefficient of correlation cannot exceed one.
For example, if bxy = 2.4 and byx is 1.6, r would be

2.4 *1.6 = 1.96 which is greater than unity.

For r to be less than unity, byx should be less than 0.41667 In other words byx must
not be greater than the reciprocal of bxy and vice versa for all cases.

3. Both the regression coefficients will have the same sign. It is not possible to have
one regression coefficient having a minus sign and the other one having a plus
sign.

4. The coefficient of correlation will have the same sign as that of the regression
coefficient. If the regression coefficients have a negative sign, r will also have a
negative sign and if the regression coefficients have a positive sign, r would also
be positive. For example,

If bxy = -0.2 and byx = -0.8

r=- 0.2 * 0.8 = -0.4

5. The average value of two regression coefficients would be greater than the value
of coefficient of correlation in absolute values (i.e. ignoring the negative signs). In
symbols (bxy + byx)/2 >r. in the example of paragraph (4) above,

(bxy + byx)/2

= (0.2 + 0.8)/2 = 0.5

Therefore, 0.5 > r of 0.4.

6. Regression coefficients are independent of change of origin but not scale.

67

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration
On the basis of the data recorded on supply and price for nine years, calculate the
regression coefficients and the value of r:

Year Supply Price


1 80 145
2 82 140
3 86 130
4 91 124
5 83 133
6 85 127
7 89 120
8 96 110
9 93 116

Solution
Let the price be denoted by Y and supply by X

Calculation of regression coefficients

Year Supply Price


X-90 Y-127
X dx dx2 Y dy dy2 dxdy
1 80 -10 100 145 18 324 -180
2 82 -8 64 140 13 169 -104
3 86 -4 16 130 3 9 -12
4 91 1 1 124 -3 9 -3
5 83 -7 49 133 6 36 -42
6 85 -5 25 127 0 0 0
7 89 -1 1 120 -7 49 7
8 96 6 36 110 -17 249 -102
9 93 3 9 116 -11 121 -33
N=9 ∑ 785 -25 301 1145 2 1006 -469

byx = [N∑dydx - ∑dy∑dx]/[ N∑dx2 - ∑(dx)2]

= [9 * -469 –(-25)(2)]/[9 * 301 – (-25)2]

= [-4221 +50]/[2709-625]

= - 4171/2084 = -2.001

bxy = [N∑dxdy - ∑dx∑dy]/[ N∑dy2 - ∑(dy)2]

= [9 * -469 –(-25)(2)]/[9 * 1006 – (2)2]

68

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

= [-4221 +50]/[9504-4]

= - 4171/9050 = -0.461

r= b *b xy yx

r=- 2.001* 0.461 = - 0.96

5.4 Standard Error of Estimate


In order to show how good and representative a regression line is as a description
of the average relationship between two series, we look for a measure of
dispersion about it. If we have a wide scatter or variation of the dots about the
regression line, then it would have to be considered a poor representative of the
relationship. The more closely the dots cluster around the line, the more
representative it is and the better the estimate based on the equation for this line.
And if the dots should all lie on the regression line (a hypothetical situation), then
there is no variation about the line and the correlation is perfect.

The variation about the line of average relationship can be measured in the
manner similar to measuring of the variation of items about an average. Thus, a
measuring of variation called the Standard error of estimate is used. This measure
is similar to the standard deviation.

The measure of variation of the observations around the computed regression line
is referred to as the standard error of estimate. Just as the standard deviation
measures the scatter of observations in a frequency distribution around the mean
of that distribution, the standard error of estimate measures the scatter of observed
values of Y around the corresponding computed values of Y on the regression
line. It is computed as a standard deviation, being also a square root of the mean
of the squared deviation. But deviations here are not the deviations of the items
from the arithmetic mean; they are rather the vertical distances of every dot from
the line of average relationship. Each dot indicates the Y value and each
corresponding point where the arrow meets the line indicates a Y value.

The deviation of each dot from the regression line is symbolised by Y-Ye. Thus
the square root of mean of the squared deviation is

Syx =
 (Y  Y )e
2

N 2

Another formula, which is more convenient for measuring the mean of squared
deviation, is given as:

69

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Syx =
 (Y 2
 a Y  b YX
N 2

Similarly, we can calculate Sxy as:

Sxy =
(X  X e )2
N 2

Sxy =
(X 2
 a X  b XY
N 2

The standard error of estimate can be calculated easily with the help of the
following formula:

 .x
Syx =
(1  r 2 )

 .y
Sxy =
(1  r 2 )

The standard error of estimate measures the accuracy of the estimated figures. The
smaller the value of standard error of estimate, the closer will be the dots to the
regression line and the better the estimates based on the equation for this line. If standard
error estimate is zero, then there is no variation about the line and the correlation will be
perfect.

5.5 Coefficient of Determination


The ratio of the unexplained variation to the variation represents the proportion of
variation in Y that is not explained by regression on X. subtraction of this proportion
from 1.0 gives the proportion of variation in Y that is explained by regression on X. the
statistic used to used to express this proportion is called the coefficient of determination
and is denoted by R2. It is represented by the following formula:

R2 = 1 – (Variations in Y remaining after regression on X)/ Total variation in Y

= 1 – error sum of squares (ESS)/ total sum of squares (TSS)

= 1 – ESS/TSS

70

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Some books explain R2 as ESS/TSS. In such a case ESS does not stand for error sum of
squares but it means the explained sum of squares. Therefore it is important to know what
the abbreviation ESS stands for in every context.

R2 therefore shows the percentage of the dependent variable that is explained by the
independent variable. The value of R2 is always calculated as a proportion and cannot
exceed unity. When interpreted, the proportion is read as a percentage. For example if R2
is 0.64, the interpretation will be “ 64 percent of the dependent variable is explained by
the independent variable. The remaining 36 percent are changes due to other factors”

Illustration:
Given the following bivariate data, fit the regression line of Y on X and X on Y.
 Predict Y if X is 10
 Predict X if Y is 2.5
 Calculate R2

X Y
-1 -6
5 1
3 0
2 0
1 1
1 2
7 1
3 5

Solution

X dx dx2 Y dy dy2 dydx


-1 -4 16 -6 -8 64 32
5 2 4 1 -1 1 -2
3 0 0 0 -2 4 0
2 -1 1 0 -2 4 2
1 -2 4 1 -1 1 2
1 -2 4 2 0 0 0
7 4 16 1 -1 1 -4
3 0 0 5 3 9 0
∑ 21 -3 45 4 -12 84 30

i) Regression of Y on X

Y- Y = byx(X- X )

byx = [N∑dxdy - ∑dy∑dx]/[ N∑dx2 - ∑(dx)2]

71

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

= [8 * 30 –(-3)(-12)]/[8 * 45 – (-1)2]

= [240 - 36]/[360 - 1]

= 204/359 = 0.568

Y = ∑Y/N = 4/8 = 0.5

X = ∑X/N = 21/8 = 2.625

Y – 0.5 = 0.568(X-2.625)

Y – 0.5 = 0.568X – 1.491

Y = 0.568 X – 0.991 if X is 10;

Y = 0.568 (10) – 0.991 = 4.689.

ii) Regression of X on Y

X- X = bxy(Y- Y )

bxy = [N∑dxdy - ∑dx∑dy]/[ N∑dy2 - ∑(dy)2]

= [8 * 30 –(-3)(-12)]/[8 * 84 – (-12)2]

= [240 - 36]/[672 - 144]

= 204/528 = 0.368

X – 2.625 = 0.386(Y-0.193)

X - 2.625 = 0.686Y – 0.193

X = 0.386Y + 2.432

If Y is 2.5, X = 0.386(2.5) + 2.432

= 3.397

iii. Coefficient of determination (R2)

2
(n xy   x y ) 2 2
a  y  b xy  n( y ) 2
R  or R 
(n x 2  ( x) 2 *(n y 2  ( y ) 2 y 2
 n( y ) 2

72

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

X Y xy x2 y2

-1 -6 6 1 36
5 1 5 25 1
3 0 0 9 0
2 0 0 4 0
1 1 1 1 1
1 2 2 1 4
7 1 7 49 1
3 5 15 9 25
∑ 21 4 36 99 68

[(8 * 36)  (21* 4)]2


R2 
(8 * 99  212 ) * (8 * 68  42 )

= (288-84)2/ (792-441)*(544-16)

= 2042/351*528 = 41616/185328 = 0.225

Exercises
1. Explain the concept of regression and the point out its usefulness in dealing with
business problems.

2. Distinguish between correlation and regression in studying the interdependence of


two varieties.

3. What is linear regression and when is it used?

4. The following data give the hardness (X) and the tensile strength (Y) of 7 samples
of metal in certain units. Find the linear regression equation of Y on X.

1 2 3 4 5 6 7
X 146 152 158 164 170 176 182
Y 75 78 77 89 82 85 86

5. The average daily wage for working class individual in Industrial Area is Sh150
and for those in Westlands is Sh180, their respective standard deviation are Sh 20
and Sh30 respectively. The coefficient of correlation is 0.67. Find the most likely
wage in Westlands corresponding to the wage of Sh200 in Industrial Area.

73

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

6. There are two series of index numbers D for disposable personal income and S for
a salary of the company. The mean and standard deviation of of the D series are
120 and 15 respectively and of the S series 115 and 10 respectively. The
coefficient of correlation between the two series is 0.75. From the given
information obtain a linear equation estimating the values of S for different values
of D obtained from the equation? Can the same equation be used for estimating
values of D given S.

7. The following marks have been obtained by a group of students in statistics:

1 2 3 4 5 6 7 8 9 10
Paper I 80 45 55 56 58 60 65 68 70 75
Paper II 82 56 50 48 60 62 64 65 70 75

Required:
Regress paper II on paper I

8. What are the precautionary measures that an analyst should consider before using
regression to solve business problems?

9. An industrial engineer collected the following data on experience and


performance rating of 8 operators:

Operators 1 2 3 4 5 6 7 8
Experience (yrs) 16 12 18 4 3 10 5 12
Performance 87 88 89 68 58 80 70 85

Estimate the performance rating of an operator having:


i. 9 years
ii. 15 years

10. A financial analyst obtained the following information relating to return on


security A and that of market portfolio M for the past 8 years:

Year 1 2 3 4 5 6 7 8
Return on A 10 15 18 14 16 16 18 4
Return on M 12 14 13 10 9 13 14 7

i. Develop an estimating equation that best describes these data.


ii. Find the coefficient of determination and interpret it.
iii. Calculate the standard error of estimate.

74

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON SIX

ANALYSIS OF VARIANCE (ANOVA)

Lesson Objectives
At the end of this lesson you should be able to:

 Explain the term ANOVA as used in statistics


 Differentiate between a one way and a two way ANOVA
 Know how to interpret the ANOVA table

6.0 Introduction
Analysis of variance is a technique employed by a researcher to find whether three or
more sample means are statistically significantly different from each other or instead can
be regarded as derived from the same population. ANOVA (analysis of variance) is a test
of the null hypothesis that population means are equal. The following are cases where
ANOVA can be used:
 The possibility that plumbers, electricians, and carpenters all have about the same
average income.
 Test of effectiveness of different promotional devices used to improve sales
 Test production volumes produced in different shifts in a factory

Being a test of difference among 3 or more means, ANOVA is an extension of the t-test,
which is used to test for difference between 2 means. If we reject the null hypothesis, we
still must determine which sample means differ from the others.

ANOVA is therefore a method for testing an hypothesis that sample means of several
groups are derived from the same population. In many studies there are several sources
of variation. For example, when studying the different crime rates in different regions of
a country, ANOVA would allow us to differentiate between the effects of state or region
from the effects of community size on crime rates. If only one of these effects is
examined, the process is known as "one-way ANOVA,".

F - Test
The F test statistic is computed by dividing the MSwithin into the MSbetween (MSbetween /
MSwithin). It is a ratio of two estimates of variance. The F-test can be used to test the null
hypothesis that none of the variance in the dependent variable is due to group effects. In
order to do this, there are two assumptions:
1. The groups are independently drawn from a normal distribution
2. The population variance is identical to the variances within each group (This
assumption is termed homoscedasticity. When population variances differ,
they are termed heteroscedastic.)

75

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Research hypotheses often involve inferences from sample data about the equality of
means of two populations in which case the t or z distributions are appropriate to test for
significant differences. If comparisons involve assessment of sameness vs. difference in
three or more means, the F distribution and ANOVA are instead used. The term
"analysis of variance" to evaluate differences of means may seem a little confusing. This
seeming misleading term is explained by the fact that the goal of ANOVA is to determine
whether there is a difference among a set of means but because there are more than two
means under consideration, the way to make this judgment is to evaluate the variance
among those means compared with the variance within each sub sample. To make these
comparisons, it is necessary to compare for differences in the number of cases comprising
the variances that are compared. The Between SS is divided by its degrees of freedom
(k - 1); similarly, the Within SS is divided by its number of degrees of freedom (N -
k). If the F ratio is large so as to warrant rejecting the null hypothesis, then the
differences among sub sample means are large relative to the average variance within sub
samples.

Use of the F distribution to test for differences among three or more means requires
making the assumptions that random, independent samples be drawn from two normal
populations that have the same variance. In actual practice, however, the F-test has been
found to work well even when these assumptions are not met unless the departures from
those assumptions are very large. The F-ratio distribution is nonsymmetric. The F-
distribution's shape depends upon the degrees of freedom associated with the numerator
and denominator. If the computed F - ratio is larger than the critical value (this critical
value is found in an F-distribution table in the back of most statistics books) associated
with a particular alpha level (e.g., 0.05, 0.01, 0.001), then we can reject the null
hypothesis and conclude that there are group effects.

In order to use an F distribution table you must calculate the degrees of freedom for the
mean sum of squares (both the MS "between" and "within"). After calculating these
values, go to the F distribution table. The degrees of freedom for the numerator
(MSbetween) are located across the top of the table. The degrees of freedom for the
denominator (MSwithin) are located down, along the left-hand side of the F distribution
table. Find the critical value associated with the degrees of freedom for the numerator
and denominator by finding the intersection of the two in the F distribution table. If the
computed F-ratio is larger than the critical value associated with a particular alpha level,
then we can reject the null hypothesis and conclude that there are group effects.

6.1 One-Way ANOVA


The one-way ANOVA is used if we are testing the hypothesis that several populations
(represented by samples) are identical. A one-way ANOVA partitions the total sum of
squares into two components: (1) the sum of squares lying between the means of the
group categories, the "between" sum of squares, or, SSbetween; and (2) the sum of squared
deviations from the group means, the "within" sum of squares, or, SS within. In creating
this division, the same value can be simultaneously added to and subtracted from any
expression without changing its value. The "between-group" sum of squares summarizes

76

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

the effects of the independent classification variable under study. However, cases may
differ within a specific group because of random factors such as sampling variation, or
effects of unobserved causal variables. The "within-group sum of squares" reflects
unmeasured factors.

After computing the sum of squares (SSbetween, SSwithin, and SStotal), and the degrees of
freedom, the next step in ANOVA is to compute the mean squares corresponding to
SSbetween and SSwithin. The two mean squares each estimate a variance. The SSbetween
estimates the variance due to group effects. The SSwithin estimates the variance due to
error. If no group effects occur, then the two estimates should be identical. However, if
a significant group effect exists, the "between-group" variance, called the mean square
between (MSbetween), will be larger than the "within-group" variance, called the means
square within (MSwithin). Finally, the MSbetween, and the MSwithin are used to calculate the
F - statistic. The F test statistic is computed by dividing the MS within into the MSbetween
(MSbetween / MSwithin).

Illustration
To test the significance of variation in retail prices of potatoes in three Kenyan towns
(Nairobi, Nyeri and Eldoret), four vendors were chosen at random in each city and the
prices recorded in Shillings.

Vendor 1 2 3 4
Nairobi 16 8 12 14
Nyeri 14 10 10 6
Eldoret 4 10 8 8

Do the data indicate that the prices in the three towns are significantly different?

Solution
Let us assume the null hypothesis that there is no significant difference in the prices of a
commodity in the three towns.

Sample 1 Sample 2 Sample 3


Nairobi Nyeri Eldoret
x1 x12 x2 x22 x3 x32
16 256 14 196 4 16
8 64 10 100 10 100
12 144 10 100 8 64
14 196 6 36 8 64
∑ 50 660 40 432 30 244

T = Sum of all observations in the three samples


= 50 + 40 + 30
= 120

77

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Correction Factor (CF)


CF = T2/ n

= 1202/ 12 = 1200

Total Sum of Squares (SST)


= sum of squares – CF

= (660 + 432 + 244) – 1200 = 136

Sum of squares between samples (SSB)


= ∑(xi/ n) 2 – CF

= (502/ 4 + 402/ 4 + 302/ 4) – 1200

= 5000/4 – 1200 = 50

Sum of Squares Within (SSW)


SSW = SST – SSB

= 136 – 50 = 86

Degrees of Freedom
v1 = k – 1 = 3 – 1 =2

v2 = n – k = 12– 3 =9

Mean Square Between (MSB)

SSB
MSB =
v1

= 50/2 = 25

Mean Square Within (MSW)

SSW
MSW =
v2

= 86/ 9 = 9.55

F – Test
M SB
F= = 25/ 9.55 = 2.617
M SW

78

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The results of a one-way analysis of variance are usually presented in an ANOVA


summary table. This table provides easy access to all of the information needed to
interpret the hypothesis test.

ANOVA table (for potato prices in three towns)

Source of Sum of Degrees of Mean Test-statistic

Variation squares freedom squares


Between
Samples 50 2 25 F = 25/ 9.55
= 2.617
Within

Samples 86 9 9.55

Total 136 11

The table value for v1 = 2, v2 = 9, and 0.05 level of significance is 4.26. Since calculated
value of F is less than critical (table) value, the null hypothesis is accepted. Hence we
conclude that there is no significant difference in the prices of potatoes in Nairobi, Nyeri
and Eldoret.

ANOVA are used when a classification results in three or more groups but this does not
mean that it cannot be utilized where there are only two groups. When two groups are
compared, both the ANOVA and t-test give identical results but for reporting purposes,
researchers use the t-test.

The one-way ANOVA model is very useful for instances when there is a single variable
that classifies observations into groups. However, for two or more variables that classify
observations into groups we must turn to the two-way ANOVA.

6.2 Two-Way ANOVA


In one-way ANOVA we worked with one nominal scale variable (the subsidiaries, or
columns). The activities performed with one-way ANOVA will help us understand the
two-way, or more complex N-way ANOVA. In a two-way analysis of variance we will
obtain the total variation and explain as much of the total variance as we can from our
first variable.

In a two-way analysis of variance we have three estimates of the variance and an estimate
based on the total sum of squares. These estimates will be used to make two separate F
tests. The numerators for both F tests will contain estimates of the between-columns and
between-rows sum of squares respectively. The error term will be in each of the
denominators for each of the F tests. These F tests are testing for a relationship between

79

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

the interval-scale variable and one of the nominal-scale variables. We could perform an
N-way analysis of variance by controlling for more variables in a similar manner.

A standard two-way ANOVA table is shown below:

Source of Sum of Degrees of Mean Test-Statistic

Variation squares freedom squares


Between
Columns SSC c-1 MSC = SSC/ (c-1) F1 = MSC/ MSE

Between

Rows SSR r-1 MSR = SSR/ (r-1) F2 = MSR/ MSE

Residual error SSE (c-1)(r-1) MSC = SSE/ (c-1)(r-1)

Total SST n-1

Illustration
To study the performance of three detergents and three different water temperatures, the
following readings were obtained with specially designed equipment:

Detergents performance table

Water Temperature Detergent A Detergent B Detergent C


Cold water 57 55 67
Warm water 49 52 68

Hot water 54 46 58

Perform a two-way analysis of variance, using 5% level of significance

Solution
Let the null hypothesis be that: there are no significant difference in the performance of
three detergents due to water temperature. For simplifications, we will code the data by
subtracting 50 from each observation.

80

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Detergents performance table (coded)

Water Temperature x1 x12 x2 x22 x3 x32 Row xi sum

Cold water 7 49 5 25 17 289 29


Warm water -1 1 2 4 18 324 19

Hot water 4 16 -4 16 8 64 8
∑ 10 66 3 45 43 677 56

T = sum of observations in three samples = 56

Correction Factor CF = T2/ n = 562/ 9 = 384.44

SSC = sum of squares between detergents (columns)

= (102 /3 + 32 /3 + 432/3) – CF

= (33.33 + 3 + 616.33) – 348.44 = 304.22

SSR = sum of squares between water temperatures (rows)

= (292 /3 + 192 /3 + 82/3) – CF

= (280.33 + 120.33 + 21.33) – 348.44 = 73.55

SST = Total sum of squares

= ∑xi2 – CF

= (66 + 45 + 677) – 384.44 = 439.56

SSE = SST – (SSC + SSR) = 439.56 – (304.22 + 73.55) = 61.79

MSC = SSC/ (c-1) = 304.22/(3-2) = 304.22/ 2 = 152.11

MSR = SSR/ (r-1) = 73.55/ (3-2) = 73.55/ 2 = 36.775

MSE = SSE/ (c-1)(r-1) = 61.79/ (2*2) = 61.79/4 = 15.447

81

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Two-way ANOVA Table

Source of Sum of Degrees of Mean Variance Ratio

Variation squares freedom squares


Between
Columns 304.22 2 152.11 F1 = 152.11/15.447
= 9.847

Between

Rows 73.55 2 36.775 F2 = 36.775/15.447

= 2.380

Residual error 61.79 4 15.447

Total 439.56 8

i. Since the calculated value of F1 = 9.87 at df1 = 2, df2 = 4 and α (level of


significance) = 0.05 is greater than the table value of F =6.94, the null hypothesis
is rejected. Hence we conclude that there is significant difference between the
performances of the three detergents.

ii. Since the calculated value of F2 = 2.380 at df1 = 2, df2 = 4 and α = 0.05 is less
than the table value of F =6.94, the null hypothesis is accepted. Hence we
conclude that water temperature do not make a significant difference in the
performances of the three detergents.

Exercises
1. What analysis of variance (ANOVA) and when is it used?
2. How is ANOVA used to solve business problems?
3. Briefly describe the procedure followed in ANOVA.
4. Explain the difference between one-way and two-way ANOVA.
5. Discuss the application of F-test in ANOVA.

6. In order to determine whether there are significant differences in the durability of


three makes of bicycles, sample sizes of n=5 are selected from each make and the
frequency of repair during the first year of purchase is observed. The following
results were recorded:

82

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Make
X Y Z
5 8 7
6 10 3
8 11 5
9 12 4
7 4 1

Is there any significant difference in the durability of the three types of bicycles?

7. Five different brands of tyres used by a car rental agency in the process of
deciding the brand of tyre to purchase, standard equipment used to record
distances covered showed the following data of kilometres endured by each brand
of tyre.

Tyre Brand
A B C D E
36000 46000 35000 45000 41000
37000 39000 42000 36000 39000
42000 35000 37000 39000 37000
48000 37000 43000 35000 35000
47000 48000 38000 32000 38000

Advice the company on whether it should be indifferent when purchasing the


tyres given that tyre prices are not significantly different.

8. A research conducted to test the soil fertility. Each of the three blocks of land was
subdivided into four equal parcels of land. Equal number of Sukuma-Wiki
seedling were grown on each parcel and the yields showed the following results:

Block
Parcel yield (kgs) A B C
1 50 40 30
2 90 70 50
3 110 80 80
4 100 100 90

Test whether the productivity of the parcels of land are significantly different.

83

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON SEVEN

TIME SERIES ANALYSIS

Lesson Objectives
At the end of this lesson you should be able to:

 Understand the various terminologies in time series of time series


 Understand the various methods of calculating time series trend
 Understand the seasonal variation component of time series
 Understand how time series is used in forecasting

7.0 Introduction
Statistical data that are described over time are known as time series. What is required of
such data in an understanding within which the data originate and the nature of variation
in the short and the long term. For example, one may want to ask:
 Why are sales varying from one month to the other?
 Why are purchases fluctuating from time to time?

The answers to the above questions could be investigated with the help of time series
knowledge. Time series enables the structure of data to be understood, trends to be
identified and forecasts made.

A time series is the name given to the values of some statistical variables measured over a
uniform set of time points. Any business, large or small, will need to keep records of
items like sales, purchases, stock values, and VAT. When these records accumulate over
time, they form a time series.

The framework within which time series are analysed is called a time series model. Time
series is a wide and complex statistical area and so are the models used to analyze time
series data. In this lesson, we shall confine ourselves to the basics of time series and basic
time series models. For this reason, two basic models will be considered:
 The additive model.
 The multiplicative model.

7.1 Terminologies in Time Series

Time Series Cycle


Normally, time series data has a general pattern which broadly repeats itself. The
repetition of time series data is called a cycle. For example, the sales of a supermarket
exhibit a definite 7-daily cycle (assuming the supermarket is open from Monday to
Sunday); the fees collected at Kenyatta University could exhibit a three period cycle

84

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

under the assumption that the university operates two normal semesters, and one
Trimester.

Time series models


Time series models are mathematical techniques which are used to explain the
relationship of time series data mathematically (usually in equation form). In order to
explain the movements of time series data, models can be constructed which describe
how various components combine to form individual data values. For example, a
marketing manager can use the following time series model to explain the expense
claimed by salesmen.

Y=f+t

Where: y = total weekly expenses


f = fixed expenses such as meals and accommodation
t = travelling expenses

Description of Time Series Components


Trend- the underlying long term tendency of data.

Seasonal Variation- short term cyclic fluctuations in the data about the trend.
The season can range from as short as one day to a long period like one year or a
decade.

Residual variation- these are other factors not explained by either the trend or
the seasonal factors. They consist of random factors or disturbances (such as
weather conditions, breakdowns, etc) and long-term cyclic factors (such as
standard trade cycles, economic recessions etc)

The major difference between seasonal factors and random factors is that the
former can be anticipated while the latter can not. For example, a Bar operator at
Nairobi’s Biashara Street, can know the average quantities of drinks normally
consumed on Friday (‘members day’) but a riot in the city centre forces the
customers to rush to their homes early.

Time Series Analysis


Time series analysis involves the evaluation and extraction of the components of a model.
The components of the model are isolated into particular series that are understandable
and explainable. Common components of time series models are:
 Trends
 Extraneous factors (such as seasonal factors)
 Residual components (error terms)

85

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Graphing Time Series


When a time series is represented graphically, the graph normally used is known as an
Historigrams. A Historigram is a line diagram that shows how the items of interest
(sales, purchases, expenses, prices etc.) fluctuate with time. For example the following
Historigram shows the monthly share price of a certain stock traded in Kenya in the year
2004

Company BTA Monthly Share Price (2004)

350

300

250

200
Price

150

100

50

0
1 2 3 4 5 6 7 8 9 10 11 12
Month

From the above chart can you notice any of the following:
 A Trend
 A Cycle

If you were a trader and knew the trend of this share price in advance, what would you do
assuming that you are:
 A Buyer
 A Seller

7.2 Time Series Trend


A time series can be additive or multiplicative. A time series is additive when it is of the
following structure:

y=t+s+r

86

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Where: y is a given time series value


t is the trend component
s is the seasonal component
r is the residual component

Multiplicative time series model is of the following structure:

y=tx Sx R

Where: y is a given time series value


t is the trend component
S is the seasonal component
R is the residual component

The small t in both models shows that the trend component will be a constant no matter
which of the two models are used. Seasonal and residual components will depend on
which model is being used.

The trend is the core component of the additive time series model about which the two
other components, seasonal (s) and residual ® variation, fluctuate. This component if
found by identifying separate trend (t) values, each corresponding to a time point. There
are several ways of obtaining trend values for a given time series.

Techniques for Measuring Trends


There are three techniques of extracting a trend from a set of time series values:

a) The Method of Semi-Averages


this is the simplest technique. It involves calculation of two points (x, y) averages which,
when plotted on a chart as two points for a straight line. The method of semi–averages for
obtaining the trend in time series is illustrated below:

Illustration
The following sales ‘000s were recorded for a firm. From this data, obtain a semi-average
trend.
Week 1 Sales
Mon 250
Tue 320
Wed 340
Thu 520
Fri 410
Week 2
Mon 260
Tue 380
Wed 410
Thu 670
Fri 420

87

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Note that the data is ordered according to time of occurrence, which is common for a
time series.

To obtain a trend using semi-average method, four steps are followed:

Step 1:
Split the data into an upper and a lower group
For the data given:
The lower group is week 1 (250, 320, 340, 520, and 410)
The upper group is week 2 (260, 380, 410, 670, and 420)

Step 2:
Find the mean value of each group
The mean of the lower group (L) is:

(250 + 320 + 340 + 520 + 410)/5

= 1840/5 = 368.

The mean of the upper group (U) is:

(260 + 380 + 410 + 670 + 420)/5

= 2140/5 = 428.

Step 3:
Plot on a graph, each mean against appropriate time point.
An appropriate time point can always be take as the median time point of the respective
group. Thus (L) would be plotted against Wednesday of week 1 and (U) against
Wednesday of week 2

Step 4:
The line joining the two plotted point is the required trend.
Note: it is important that the two groups in question have an equal number of data values.
If the given data, however, contains an odd number of data values, the middle value can
be ignored (for purposes of obtaining the trend line)

Once a trend line has been obtained, the trend values corresponding to each time point
can be read off from the graph.

Illustration (data to be referred in subsequent lesson illustrations)


The following set of data shows passenger movements between the town of Nairobi and
Eldoret.

88

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Year Quarter Passengers in ‘000s


1 1 220
1 2 500
1 3 790
1 4 320
2 1 290
2 2 520
2 3 820
2 4 380
3 1 320
3 2 580
3 3 910
3 4 410

Required:
Using the data given above:
i. Use the method of semi average to obtain and plot a trend line.
ii. Draw up the table showing the original data (y) values against the trend (t)
values obtained from the graph.

Solution
i. Split the data into lower (L) and upper (U) groups in equal proportion.

Year Quarter Passengers in ‘000s


1 1 220
1 2 500
1 3 790
1 4 320
2 1 290
2 2 520
∑ 2,640
mean 440

Year Quarter Passengers in ‘000s


2 3 820
2 4 380
3 1 320
3 2 580
3 3 910
3 4 410
∑ 3,420
mean 570

The items in italic (year 1: Q3 and Q4) and (year 3: Q1 and Q2) are the hypothetical
points where the lower points and upper points must be plotted respectively.

89

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The plotted graph is shown below:

Passengers (Nairobi-Eldoret) Route

700

600

500
Passengers '000s

400

300

200

100

0
Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y1Q4.5 Y2Q1 Y2Q2 Y2Q3 Y2Q4 Y3Q1 Y3Q1.5 Y3Q2 Y3Q3 Y3Q4
Quarter

ii. The trend values when read from the graph and compared with the original
values will be as follows:
Passengers in ‘000s
Year Quarter Original Values Trend Values
1 1 220 390
1 2 500 410
1 3 790 430
1 4 320 450
2 1 290 470
2 2 520 490
2 3 820 520
2 4 380 540
3 1 320 560
3 2 580 580
3 3 910 600
3 4 410 620

b) The Method of Least Squares Regression


The technique of least squares regression was explained and demonstrated in lesson four.
In order to use the method of least squares to obtain a trend-line, the following steps are
following:

90

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Step 1
Take the physical time points as values of the independent variable x..

Step 2
Take the data values themselves as values of the dependent variable y.

Step 3
Calculate the least squares regression line of y on x, y = a + bx

Step 4
Translate the regression line y = a + bx: to t = a + bx, where any given value of
time point x will yield a corresponding value of the trend, t.

Using the data of the previous illustration:


Let y be the number of passengers in a quarter
Let x be the time points (quarters) coded from 1 to 12 for the three years

x y xy x2
1 220 220 1
2 500 100 4
3 790 2370 9
4 320 1280 16
5 290 1450 25
6 520 3120 36
7 820 5740 49
8 380 3040 64
9 320 2880 80
10 580 580 100
11 910 10010 121
12 410 4920 144
∑ 78 6060 41830 650

The regression line is y = a + bx

a and b are calculated as follows:

n xy   x y
b=
n x 2  ( x) 2

(12 * 41830)  78 * 6060


=
12 * 650  78 2

= 29280/ 1716 = 17.06

91

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

a =
 y  b(  x )
n n

6060 17.06 * 78
= 
12 12

= 505 – 110.89 = 394.11

When the regression equation is translated into a trend equation, we get:

t = a + bx

t = 397 + 17x (values of a and b rounded)

Applying the trend equation, the trend values for each quarter will be:

x: 1 2 3 4 5 6 7 8 9 10 11 12

t: 411 428 445 462 479 496 513 530 547 564 581 598

The Method of Moving Averages


The moving average method is the most commonly used method for identifying a trend
and involves the calculation of a set of averages. The trend, when obtained and charted,
consists of straight lime segments. The moving averages overlap one another as the
averages range from one cycle to another. The number of values in each set is always the
same and is known as the period of the moving average.

To demonstrate the technique, consider the following set of a five period moving
average:

Original values:12 10 11 11 9 11 10 10 11 10

Moving totals: 53 52 52 51 51 52

Moving average 10.6 10.4 10.4 10.2 10.2 10.4

The table below shows how these averages are calculated based on a five period moving
average:

92

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Trend calculation: Moving average method


12 10 11 11 9 11 10 10 11 10
Trend position 10.6 10.4 10.4 10.2 10.2 10.4
Total Average
12 10 11 11 9 53 10.6

trend
10 11 11 9 11 52 10.4
11 11 9 11 10 52 10.4
11 9 11 10 10 51 10.2
9 11 10 10 11 51 10.2
11 10 10 11 10 52 10.4

The values in italic are calculated averages which are shown in bold

Note that the trend value is positioned on the middle of the values forming that average.
For example the first trend value (10.6) is the average of 12, 10, 11, 11, and 9 and
therefore 10.6 is placed under the first 11 (middle value).

Note:
 Trend values corresponding to the first two and last two in the table are missing.
This is one of the major disadvantages of moving averages.
 The period of moving average must coincide with the length of natural cycle of
the series. For example:
- Moving averages for the trend based on annual quarters must be based on
a four period moving averages.
- Total monthly sales of a business for a number of years would be
described by a moving average trend of period 12.
- The daily sales of a supermarket would be characterised by a seven period
moving average.
 Each moving average trend must correspond with an appropriate time point
(median of the corresponding periods). When moving averages have an even
numbered period, the values should be cantered.

Centering of Moving Averages


Consider the following time series data with four period cycle:

Time point 1 2 3 4 5 6 7 8 9 10
Averages (4p) 13.0 13.3 13.3 13.8 14.5 14.5 15.0
Centering 13.2 13.3 13.6 14.2 14.5 14.8

Note: centering has reduced the number of trend values from seven to six. As explained
earlier, this is a common disadvantage of moving average.

93

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

7.3 Seasonal Variation and Forecasting


Seasonal (short term cyclic) variation is present in many time series. For example, more
umbrellas will be sold during the rainy seasons, more foodstuffs will be sold in
December, more firm inputs will be sold between February and July, etc.

Seasonal values are normally expressed as deviations from the trend values. They show
by how mush a particular season will tend to increase or decrease the underlying trend.
The season component is expressed as an additive model or a multiplicative model.

The Additive Model


Given the original time series (y) values, together with the trend (t) values, the procedure
for calculating the seasonal variation is given as follows.

Step 1
Calculate, for each point, the value of y – t (the difference between the original value and
trend).

Step 2
For each season in turn, find the average (arithmetic mean) of the y-t values.

Step 3
If the total averages differ from zero, adjust one or more of them so that their total is zero.
The values so obtained are the appropriate seasonal variation values; i.e. the ‘s’ figures in
the additive model y = t + s + r.

The Multiplicative Trend


Given the original time series (y) values, together with the trend (t) values, the procedure
for calculating the seasonal variation is given as follows.

Step 1
Calculate, for each point, the value of [1 + (y – t)/t] (the difference between the original
value and trend expressed as a proportion of the trend).

Step 2
For each season in turn, find the average (arithmetic mean) of the above proportional
changes.

Step 3
If the total averages differ from zero, adjust one or more of them so that their total is zero.

Note: the process of adjusting the mean to zero is known as differencing. In practice, this
process is complex and requires the use of computer programs to differentiate the data.
Econometrics books deal with differencing in a more advanced way.

94

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration – Additive Model


The sales of a company (y, in ‘000s) are given below, together with previously calculated
trend (t). The subsequent calculations to find the seasonal variation are shown, below:

Step 1
y t y-t
Yr 1 Q1 20 23 -3
2 15 29 -14
3 60 34 26
4 30 39 -9
Yr 2 Q1 35 45 -10
2 25 50 -25
3 100 55 45
4 50 61 -11

Step 2
Deviations (y – t )
Q1 Q2 Q3 Q4 Sum
Year 1 -3 -14 26 -9
Year 2 -10 -25 45 -11
Totals -13 -39 71 -20
Averages -6.5 -19.5 35.5 -10.0 -0.5

Step 3
Since the averages sum to -0.5 (and not zero), it is necessary to adjust one or more of
them accordingly. In this case, since the difference is so small, only one will be adjusted.
In order to make the smallest percentage error, the largest value (35.5) is changed to 36.0.
This adjustment is shown in the following table:

Q1 Q2 Q3 Q4 Sum
Initial s values -6.5 -19.5 35.5 -10.0
Adjustment 0 0 +0.5 0
Adjusted values -6.5 -19.5 35 -10.0 0

The interpretation of the figures is that the average seasonal effect for quarter one , for
instance, is to deflate the trend by 6.5 (‘000s) and that of quarter three is to inflate the
trend by 36(‘000s).

Illustration –Multiplicative Model


The sales of a company (y, ‘000s) are given below, together with a previously calculated
trend (t). The subsequent calculations to find the seasonal variation are shown below:

95

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Step 1
y t (y-t)/t s = 1 + (y-t)/t
Yr 1 Q1 20 23 -0.13 0.87
2 15 29 -0.48 0.52
3 60 34 0.76 1.76
4 30 39 -0.23 0.77
Yr 2 Q1 35 45 -0.22 0.78
2 25 50 -0.50 0.50
3 100 55 0.82 1.82
4 50 61 -0.18 0.82

Step 2
Deviations [ 1 + (y-t)/t ]
Q1 Q2 Q3 Q4 Sum
Year 1 0.87 0.52 1.76 0.77
Year 2 0.78 0.50 1.82 0.82
Totals 1.65 1.02 3.58 1.59
Averages 0.82 0.51 1.79 0.79 3.91

Step 3
Since the averages sum to 3.91 (and not 4), it is necessary to add 0.09 to one or more of
them accordingly. In this case, since the difference is so small, only one will be adjusted.
In order to make the smallest percentage error, the largest value (1.79) is changed to 1.88.
This adjustment is shown in the following table:

Q1 Q2 Q3 Q4 Sum
Initial s values 0.82 0.51 1.79 0.79
Adjustment 0 0 +0.09 0
Adjusted values 0.82 0.51 1.88 0.79 4

The interpretation of the figures is that the average seasonal effect for quarter one, for
instance, is to deflate the trend by 18% (1-0.82) and that of quarter three is to inflate the
trend by 88 %( 1.88-1).

Adjusting Data for Seasonality


Seasonal values are used to adjust the original data in order to arrive at values that are not
inclusive of the seasonality variable. The effect of seasonal adjustment is to smooth away
seasonal fluctuations and the remaining values answers the question:
 What would happen if there were no seasonal fluctuations?

Adjusting the additive model:


The adjustment is performed by subtracting the appropriate seasonal figure from each of
the original time series values and representing the values as (y – s).

96

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

For the additive model, the adjusted data will be as below:

y s y-s (seasonally adjusted values)


Yr 1 Q1 20 -6.5 26.5.5
2 15 -19.5 34.5
3 60 36 24
4 30 -10.0 40
Yr 2 Q1 35 -6.5 41.5
2 25 -19.5 44.5
3 100 36.0 64
4 50 -10.0 60

Adjusting the Multiplicative Model:


The adjustment is performed by dividing each of the original time series values by s,
thus, representing the values as (y/s ).

For the multiplicative model, the adjusted data will be as below:

y sp y/sp (seasonally adjusted values)


Yr 1 Q1 20 0.82 24.3
2 15 0.51 29.5
3 60 1.88 31.9
4 30 .79 37.8
Yr 2 Q1 35 0.82 42.6
2 25 0.51 49.2
3 100 1.88 53.2
4 50 0.79 63.0

Note:
 The seasonally adjusted time series for an additive model is obtained by
subtracting the seasonality ( SAV = y – s )
 The seasonally adjusted time series for a multiplicative model is obtained by
dividing the values by the seasonality proportion ( SAV = y/sp )
 Majority of economic time series data published by the Central Bureau of
Statistics (CBS) is represented in terms of both ‘actual’ and ‘seasonally adjusted’
figures.

Forecasting using time series


A particular use of time series is to forecast or project the possible future economic
values. Business life would be much easier if future amounts of sales, expenditure,
production units, etc. would be known in advance. Since we cannot predict these values
accurately, the best we can do is to estimate them. Forecasting can be performed at
different levels such as:
 Use of personal judgment
 Use of Expert judgment
 Use of structured forecasting such as regression analysis, or time series.

97

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Any forecasts made using structured methods should be treated with caution. Since there
is no guarantee that patterns based on past data will recur.

Forecasting using time series involves the following steps:


 Estimate a trend value for the time point.
 Identify the seasonal variation value appropriate to the time point.
 Add or multiply these values, depending on whether the model is additive or
multiplicative
- Additive: y = t + s
- Multiplicative: y = t * s

Note:
Because of the complexity of residual variation, which, is beyond the scope of
this lesson, the residual component has been excluded deliberately. The
magnitude of the residual component shows how reliable are the data values
forecasted. The higher the residual amount, the lower the reliability of forecasted
values and vice versa.

Exercises
1. What is a time series?
2. Describe the difference between additive and multiplicative time series.
3. What is residual variation in time series?
4. What is a historigram?
5. What are the three most common techniques for obtaining a time series trend?
6. Describe the stages involved in obtaining a time series trend using the method of
semi-averages.
7. What is a seasonal variation?
8. What precautions should be taken when using time series data?

9. Use the data below to construct a historigram.

Sales of Company X (‘000s)


Qtr1 Qtr2 Qtr3 Qtr4
Year 2001 19 31 62 9
2002 20 32 65 17
2003 24 36 78 14
2004 24 39 83 20
2005 25 42 85 24

10. calculate the trend values using the method of semi-averages for the following
data:

13, 12, 16, 14, 18, 12, 14, 13, 18, and 13.

98

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

11. The data below relates to rate receipts (in millions) for a local authority with
corresponding trend values in brackets.

2002 2003 2004


Quarter 1 2.8 (3.3) 3.0 (3.7) 3.0 (4.2)
Quarter 2 4.2 (3.4) 4.2 (3.9) 4.7 (4.3)
Quarter 3 3.0 (3.5) 3.5 (4.0) 3.6 (4.4)
Quarter 4 4.6 (3.6) 5.0 (4.1) 5.3 (4.5)

Assuming an additive model:


i. Calculate the seasonal variation
ii. Estimate the receipts for 2005.

12. The following data describes personal savings as a percentage of earned income
for a particular region of Kenya.

2002 2003 2004


Quarter 1 10.1 12.6 11.9
Quarter 2 8.6 7.6 8.7
Quarter 3 8.0 7.6 8.3
Quarter 4 5.8 6.2 7.2

Use the additive and multiplicative models to seasonally adjust the above
percentages and forecast the percentage savings for quarter one of 2005.

99

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

LESSON EIGHT

NON-PARAMETRIC METHODS

Lesson Objectives
At the end of this lesson you should be able to:

 Understand the importance of non-parametric tests


 Explain the runs test for randomness
 Know how the sign test is applicable
 Know how to carry out a Mann-Whiteney U-Test
 Understand the wilcoxon signed-rank test and rank-sum test
 Calculate Kruskal-Wallis test statistic
 Calculate the Spearman coefficient of rank correlation

8.0 Introduction
The primary purpose of statistical analysis is to draw conclusions about the
population parameters based on samples selected from the populations. Certain
assumptions are inferred when using the sampled data such as the assumption that
populations are normally distributed, the samples are random and that observations
are independent of each other. Additionally, the t-test requires that in testing between
differences between two sample means, the samples must be drawn from populations
that are normally distributed with equal variances. Similarly, the F-test that test for
significant differences among several population means, is based on the same
assumption.

Nonparametric tests on the other hand, do not depend on the shape of distribution of
the population and hence are known as distribution-free tests and are applicable to
ordinal level data. For example, responses to questions can be ranked from high to
low and vice versa. Individuals or other items can also be ranked using the ordinal
scale such as a three star hotel, a two star army officer, etc. It is however important to
note that nonparametric tests should only be applied where it is not possible to use the
parametric tests. In some areas of sensitive disciplines such as Medicine and Military,
nonparametric statistics are not allowed. Nonparametric tests can be carried out by
applying several techniques. Some of these techniques are:
i. runs test for randomness
ii. the sign test
iii. Mann-Whiteney U-Test
iv. The wilcoxon signed-rank test
v. The wilcoxon rank-sum test
vi. Kruskal-Wallis analysis of variance by ranks
vii. Spearman coefficient of rank correlation

100

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

8.1 Runs Test for Randomness


This test requires randomness of the sample. The randomness can be tested by a simple
runs test. For example if we toss a coin 20 times, and the results show 8 heads in a row,
one can question the presence of randomness. A run is a succession of identical symbols
as shown below:

Assume the twenty trials of tossing the coin result in the following results:

H H T T T H H H T H H T T H H T H H T T

The above experiment can be grouped into the following 10 runs:

HH TTT HHH T HH TT HH T HH TT

Too few runs or too many runs will indicate the absence of randomness. For example, in
a sequence of 10 tosses, ten runs (HTHTHTHTHT) indicate non-randomness;
consequently two runs (HHHHHTTTTT) are non-random.

Runs test is always a two tailed test (lower levels or higher levels can lead to rejection of
the null hypothesis). Using the example of tossing the coin 20 times:

Let N = 20 (n1 + n2)


n1 = number of Hs = 11
n2 = number of Ts = 9
r = number of runs = 10
α = 0.05 (level of significance)

From the tables, r1 = 6 and r2 = 16. Therefore we cannot reject the null hypothesis since
the experiment produced 10 runs.

8.2 Sign Test


The sign test is based on the sign of difference between two related observations. A plus
sign is usually designated for a positive difference and a minus sign is used for a negative
difference. If, for example sales increased from Sh30,000 to Sh50,000 in the month of
April, we record the difference of 20,000 as a plus (+) sign. If on the other hand sales
drop from Sh50, 000 to Sh45, 000 in the month of May, we record the 5,000 difference as
a minus (-) sign.

The sign test has many applications. To illustrate this, suppose a taster’s choice markets
two kinds of coffee in two jars (one premium and the other regular). The market
researcher wants to know whether the customers prefer the premium or the regular.
Premium coffee cups can be coded (+) while the regular coded (-). If the population of
the customers do not have a preference, the number of (+) will be expected to be equal to
(-).

101

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

The hypotheses for the sign test will be:


H0: μ≤0.5
H1: μ>0.5

This is a one tailed test because the alternative hypothesis points to one direction. The
binomial distribution is used as the test statistic. The signs test meets all the binomial
assumptions, namely:
1. There are only two outcomes: a success and a failure
2. For each trial, the probability of success is assumed to be 0.5
3. The total number of trials is fixed
4. Each trial is independent

8.3 Mann-Whitney U-Test


The Mann-Whitney U-Test (or simply U-test) is superior to the sign test in that it utilises
more information in the sample. The test is used to determine whether the two
independent samples are drawn from the same population. It is a good substitute for t-test
when the stringent conditions of parametric test are not met.

Illustration
It is required to test whether the quality of education on Statistics in private school is
similar to one in public schools or not. A sample of 15 and 12 statistics students is picked
at random from private schools and public schools respectively. The students were then
subjected to the same exam and the following results obtained:

Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Private scores 73 75 83 77 72 69 56 80 68 60 84 61 64 71 86
Public scores 70 78 79 81 65 63 74 83 67 76 88 48 - - -

The U-test can be used to test the null hypothesis that there is no significant difference
between the performance in statistics in both private and public schools at 95%
confidence level.

102

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Private scores Rank( /27) Public scores Rank( /27)


R1 R2
73 14 70 11
75 16 78 19
83 23 79 20
77 18 81 22
72 13 65 7
69 10 63 5
56 2 74 15
80 21 83 24
68 9 67 8
60 3 76 17
84 25 88 27
61 4 48 1
64 6
71 12
86 26
Rank Sum R1 = 202 Rank Sum R2 = 176

The value of U statistic for each group is calculated as follows:

U1 = n1n2 + [n1(n1 + 1)/2] – R1

U2 = n1n2 + [n2(n2 + 1)/2] – R2

Once the U values are calculated, the lower value is selected for testing purposes. From
the data:

n1 = 15
n2 = 12
R1 = 202
R2 = 176

Substituting the two equations

U1 = (12)(15) + [15(15 + 1)/2] – 202

= 180 + 120 – 202

= 98

U2 = (12)(15) + [12(12 + 1)/2] – 176

= 180 + 78 – 176

= 82

The decision will be to take the lower U = U2 = 82

103

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Under the null hypothesis, which states that these score observations come from the same
populations can be shown through the sampling distribution of the statistic U.

mean E(U) = n1n2/2

= (12 * 15)/2

= 90

standard deviation σu = √[ n1n2 (n1 + n1 + 1)/12]

= √ [12 * 15(12 + 15 + 1)/12]

= √ (180 * 28/12)

= 20.5

if n1 ≥8 and n2 ≥8, then the sampling distribution of U can be approximated closely with a
normal curve so that a z-test can be performed.

Z = [U – E(U)]/ σu

= (82 -90)/20.5

= -0.39
Since the absolute value of z is 0.39 which is less than the critical value of z, at α = 0.05
of 1.96, we cannot reject the null hypothesis. In other words the evidence does not
suggest that there is any significant difference in the quality of education in Statistics
between the private and public school.

Note:
 In order to assume that the sampling distribution of statistic U is normally
distributed, the values of both n1 and n2 should be equal to or more than eight.
 All scores were different in value, so that there were no ties in ranks. However, if
there are ties between observations, then the identical values of scores are
assigned the mean of their tied ranks.

8.4 Wilcoxon signed-rank test


In 1945, Frank Wilcoxon developed a nonparametric test, based in the differences in
dependent samples, where normality assumption is not required. This teat is called
wilcoxon signed-rank test. The following example shows its application.

Jasho café restaurant in Nairobi offer a full dinner menu, but their specialty is chicken.
Recently, Barigei Mos, the owner of the restaurant developed a new spicy flavour to

104

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

improve the chicken meal. Before placing the current flavour, he wants to conduct some
tests to be sure that the patrons will like the spicy flavour better.

To begin, Barigei selects a random sample of 15 people. The sampled patrons are given
two pieces of cooked chicken; one without spice and another one spiced. Participants are
then asked to rank the overall test of the two cooked chicken pieces on a scale of 0 to 20.
A value near 20 indicates the participant liked the flavour, whereas a score near 0 indicate
that the participant did not like the flavour. The results are recorded below:

Participant Spiced Current


Flavour Flavour
1 14 12
2 8 16
3 6 2
4 18 4
5 20 12
6 16 16
7 14 5
8 6 16
9 19 10
10 18 10
11 16 13
12 18 2
13 4 13
14 7 14
15 16 4

Solution
The samples are dependent or related because the participant is told to rate both pieces.
Thus, if we compute the difference between the two pieces of chicken the results will
show the amount the participants favour over the other. If we choose to subtract the
current flavour score from the spicy flavour score, a positive will show the participant
favours the spicy chicken.

Hypotheses:
H0: there is no difference in the ratings of the two flavours.
H1: the spicy flavour is higher.

This is a one tailed test because Barigei will change to Spicy flavour if the participants
like it. The significance level is 0.05.

Steps
1. Compute the difference between the spicy and the current flavour
2. If the difference is zero, drop that participant
3. Determine the absolute difference (ignore the minus signs)

105

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

4. Rank the difference from the smallest to the largest. Participants with same scores
are averaged and each is assigned the average rank.
5. Separate the ranks based on their previous signs (positives (R+) on one side and
negatives (R-) on the other)
6. Total the R+ and R-.
7. Use the smaller of the two ranks to test the significance. If the table value is more
than the smaller rank, reject the null hypothesis; otherwise accept.

Participant Spicy Current Diff Absolute rank Signed Rank


score score Diff R+ R-
1 14 12 2 2 1 1
2 8 16 -8 8 6 6
3 6 2 4 4 3 3
4 18 4 14 14 13 13
5 20 12 8 8 6 6
6 16 16 - - - - -
7 14 5 9 9 9 9
8 6 16 -10 10 10 11 11
9 19 10 9 9 9 9
10 18 10 8 8 6 6
11 16 13 3 3 2 2
12 18 2 16 16 14 14
13 4 13 -9 9 9 9
14 7 14 -7 7 4 4
15 16 4 12 12 12 12
Total 75 30

From the table α 0.05, n 14 is 25 < 30; hence the null hypothesis cannot be rejected. The
conclusion is that there is no significant difference in the flavour of the spicy and current
chicken.

8.5 Wilcoxon rank-sum test


The wilcoxon rank-sum test is based on the average ranks. The data are ranked as if it
were from the same sample. If the null hypothesis is true, the ranks will be evenly
distributed between he two samples and the average of the two samples will be about the
same.

If each of the samples contains at least eight observations, the standard normal
distribution is used as the test statistic. The formula is:

Wilcoxon rank-sum test (z)

w  n1 (n1  n2  1) / 2
Z=
n1 n2 (n1  n2  1) / 12

where: n1 is the number of observations from the first population


n2 is the number of observations from the second population
w is the sum of ranks from the first population

106

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Illustration
The CEO of Makati ltd recently noted an increase in the amount of competitor’s exports
to USA. He is interested in determining whether there are more exports to the USA than
to the UK. After conducting market intelligence, he obtains the following monthly
results:

Month
1 2 3 4 5 6 7 8 9
USA 11 15 10 18 11 20 24 22 25
UK 13 14 10 8 16 9 17 21

Can you conclude that there are more exports to the USA than to the UK?

Solution
USA UK
Export Rank Export Rank
11 5.5 13 7
15 9 14 8
10 3.5 10 3.5
18 12 8 1
11 5.5 16 10
20 13 9 2
24 16 17 11
22 15 21 14
25 17
Total 96.5 56.5

96.5  9(9  8  1) / 2
z=
9 * 8( 9  8  1) / 12

= 1.49

The computed z value (1.49) is less than 1.65; hence the null hypothesis will not be
rejected. Therefore, we cannot conclude the there are more exports to the USA than the
UK.

Note: the z value is read as an absolute figure. Where the computed z score is negative,
ignore the minus sign.

8.6 Kruskal-Wallis analysis of variance by ranks


W.H. Kruskal and W.A. Wallis reported a nonparametric test in 1952 requiring only
ordinal-level (ranked) data. No assumptions about the shape of the population are
required by the test. For this test to be applied, the selected samples must be independent.

107

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

For example, if samples from three groups – executives, staff, and supervisors – are to be
selected and interviewed, the response of one group (say, executives) must in no way
affect the responses of others.

To compute Kruskal-Wallis test statistic,


1. The samples are combined
2. The combined values are ordered from low to high; and
3. The ordered values are replaced by ranks, starting with one for the smallest value.

Kruskal-Wallis test (H)

H = 12/n(n+1) [ (∑R1)2/ n1 + (∑R2)2/ n2 + ….. (∑Rk)2/ nk ] – 3(n+1)

Where: k is the number of populations


∑R1, ∑R2 , ….. ∑Rk are the sums of ranks of samples
n1, n2, …..nk are the sizes of samples
n is the combined number of observations for all samples

Illustration
A management seminar consisting of a large number of executives from manufacturing,
finance, and marketing is to be conducted. Before scheduling the seminar sessions the
seminar leader is interested in whether the three groups are equally knowledgeable about
management principles. A sample of seven manufacturing managers, eight from finance
and six from marketing were tested and their scores recorded as below.

Manufacturing Finance Marketing


56 103 42
39 87 38
48 51 89
38 95 75
73 95 35
50 42 61
62 107
89

Considering the scores as a single population, the marketing executive with a score of
35 is the lowest, so it is ranked 1. There are two scores of 38. To resolve this tie, each
score is given a rank of 2.5, [(2+3)/2]. This process is continued for all scores. The
highest score is 107, and the finance executive is given by a rank of 21. The scores,
the ranks and the sum of the ranks for each of the three samples are given in the table
below:

108

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Mfg R1 Fin R2 Mktg R3


56 10 103 20 42 5.5
39 4 87 16 38 2.5
48 7 51 9 89 17.5
38 2.5 95 19 75 15
73 14 95 13 35 1
50 8 42 5.5 61 11
63 12 107 21
89 17.5
Total 57.5 121 52.5

Solving for H;

H = 12/n(n+1) [ (∑R1)2/ n1 + (∑R2)2/ n2 + ….. (∑Rk)2/ nk ] – 3(n+1)

= 12/21(21+1) [ (∑57.5)2/ 7 + (∑121)2/ 8 + ….. (∑52.5)2/ 6 ] – 3(21+1)

= 5.736

Because the computed H (5.736) is not beyond 5.991, the null hypothesis is not rejected.
There is no difference among the manufacturing, finance, and marketing executive’s
knowledge on management principles.

8.7 Spearman coefficient of rank correlation


Charles Spearman, a British statistician, introduced the measure of correlation for
ordinal-level data. This measure allows one to measure the relationship between sets of
ranked data. For example, two staff members in the R&D office are asked to rank 10
research proposals. The question one will ask is whether the two staff members will
assign the same ranks to the proposals, assuming that the proposals will be graded as
most worthy and least worthy for funding.

Spearman’s coefficient of rank correlation, denoted by rs, provides a measure of the


association. The coefficient is computed as follows:

Spearman’s coefficient of rank correlation (rs )

6 d 2
rs = 1 
n(n 2  1)

Where: d is he difference between the ranks of each pair


n is the number of paired observations

109

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Like the coefficient of correlation, the coefficient of rank correlation assumes any value
from –1.00 up to +1.00. a value of –1.0 indicates a perfect negative correlation; +1.00
indicates a perfect positive correlation; and 0 indicates no correlation.

Illustration
The following table shows the ranks assigned by an executive (E) and a supervisor (S) on
the future performance of selected college graduates.

Executive Supervisor
(E) (S)
Rating Rating Rank Difference
Graduate x y (E) (S) d d2
1 8 4 3.5 3 0.5 0.25
2 10 4 6.5 3 3.5 12.25
3 9 4 5 3 2 4
4 4 3 1 1 - -
5 12 6 10.5 7 3.5 12.25
6 11 9 8.5 10.5 -2.0 4
7 11 9 8.5 10.5 -2.0 4
8 7 6 2 7 -5 25
9 8 6 3.5 7 -3.5 12.25
10 13 9 12 10.5 1.5 2.25
11 10 5 6.5 5 1.5 2.25
12 12 9 10.5 10.5 - -
Total 0.00 78.50

n2
t = rs
1  rs2

6 d 2
rs = 1 
n(n 2  1)

6 * 78.5
1
12(144  1)

= 0.726

The value of 0.726 is a strong positive association between the rating of the
executive and the supervisor. The graduate that received high rating from the
executive also received high rating from the supervisor.

110

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

Hypotheses:
H0: the rank correlation in the population is zero
H1: the rank correlation in the population is greater than zero.

12  2
t = 0.726
1  0.726 2

= 3.338

Computed t value (of 3.338) is greater than the table value of 1.812 hence we reject the
null hypothesis and conclude that there correlation between the executive ratings and the
supervisor’s rating.

Exercises
1. What is a nonparametric test?
2. Explain what a sign test entails.
3. Why do sign test involve a one tailed test?
4. Discuss the various types of non-parametric tests.

5. A random sample of seven young professional couples who own homes are
selected. The size of their homes is then compared with those of their parents
(average of the parent’s homes). At the 0.05 significance level, can you conclude
that the couples live in larger homes?

1 2 3 4 5 6 7
Couple’s home (sq ft.) 1725 1310 1670 1520 12890 1880 1530
Avg. Parent’s home (sq ft.) 1175 1120 1420 1640 1360 1750 1440

6. The following observations were selected from the populations that are not
necessarily normally distributed. Use 0.05 significance level, a two tailed test, and
the Wilcoxon rank-sum test to determine whether there is a difference between
the two populations.

1 2 3 4 5 6 7 8
Population A 38 45 56 57 61 69 70 79
Population B 26 31 35 42 51 52 57 62

7. Under what condition should Kruskal-Wallis test be used instead of:


i. Analysis of variance.
ii. Wilcoxon rank-sum test.

111

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

8. The following data were obtained from three populations that were not
necessarily normal.

1 2 3 4 5 6
Sample A 50 54 59 59 65
Sample B 48 49 49 52 56 57
Sample C 39 41 44 47 51

i. State the null hypothesis


ii. Using 0.05 significance level, state the decision rule.
iii. Compute the value of test statistic
iv. What is your decision on the null hypothesis?

9. The Cynmos TV network staff wants to pre-test a questionnaire to be mailed to


several thousand viewers. One question involves the ranking of male and female
senior citizens with respect to popularity of certain prime-time programs. The
composite rankings of a small group of senior citizens are:
Program 1 2 3 4 5
Male Ranking 1 4 3 2 5
Female Ranking 5 1 2 4 3

i. Draw a scatter diagram. Let the males be X


ii. Compute Spearman’s rank-order correlation coefficient and interpret the
results.

112

Created in Master PDF Editor - Demo Version


Created in Master PDF Editor - Demo Version

REFERENCES

Chandan S., (2003). Statistics for Business and Economics. Vikas Publishing House
PVT Ltd., New Delhi.

Francis A., (2001). Business Mathematics and Statistics: Fifth Edition, Martins the
Printers Ltd., London.

Gupta S., and Gupta M. (2001). Business Statistics: Eleventh Edition. Sultan Chand and
Sons, New Delhi.

Harper W., (2004). Statistics. Pitman Publishing, London.

Mason D., et al. (1999). Statistical Techniques in Business and Economics: Tenth
Edition. McGraw-Hill, Boston.

Sharma J., (2004). Business Statistics. Pearson Education Publishers, New Delhi.

113

Created in Master PDF Editor - Demo Version

You might also like