BUSINESS STATISTICS II notes (1)
BUSINESS STATISTICS II notes (1)
SCHOOL OF BUSINESS
BUSINESS STATISTICS II
1.0 Introduction
1.1 Distribution shapes
1.3 Normal distributions
1.4 The standard normal probability
1.5 standard normal probabilty given area
1.6 The probability for any normal curve
1.7 Distribution of sample means
ii
iii
Learning Objectives
Upon successful completion of Chapter 6, you will be able to:
I. Introduction
• Many continuous variables have distributions that are bell-shaped and are called
approximately normally distributed variables.
• A normal distribution is also known as the bell curve or the Gaussian distribution.
C. Symmetrical Distribution
In a symmetrical distribution, the data values
are evenly distributed on both sides of the mean.
Where:
e ≈ 2.718
π ≈ 3.14
μ = population mean
σ = population standard deviation
Notice: The shape and position of the normal distribution curve depend on 2 parameters:
the mean and the standard deviation.
Note: the z score is the number of standard deviation that a particular X value is away
from the mean; If z is positive, x is above the mean. If z is negative, x is below the mean.
• The probability to the left of z is found in Table E, located on the inside cover of the
textbook. Be sure the table in your textbook functions this way.
The area under the curve is more important than the frequencies because the area
corresponds to the probability!
1. Use the z column to find the row for the units and tenths.
2. Move to the appropriate hundredths column across the top of the table.
3. The Intersection of the row for the units and tenths with the column for hundredths is the
probability to the left of z.
II. Example:
Find the area under the normal distribution curve.
P (z < 1.34) =
Note: < points left and the area to the left of 1.34 is shaded.
P ( z < 1.34) = .9099 is at the intersection of the z-column row 1.3 and the top column
labeled .04.
Question 1
Find P (z < 2.16) =
• Sketch the curve.
• Mark the mean.
• Shade the area.
• Use the table to find the area to the
left of 2.16
II. Example:
P (z > 2.83) =
1. Find the area to the left of 2.83 because it is given in the table. P (z < 2.83) = .9977
2. Subtract the area on the left (.9977) from the total area (1.00) to find the area on the
right: P (z > 2.83) = 1.00 - .9977 = .0023
Question 2
P (z > -2.45) =
Note: > points to the right. We need to find the area to the right to -2.45.
–z1 0 z2
II. Example:
Find the area between z = 1.51 and z = 2.12:
1. Sketch the curve
2. Mark the mean
3. Shade the area
4. Find the area to the left of each z separately
5. Subtract the smaller area from the larger
Question 3
Find the area between
z = 0.79 and z = 1.28:
Question 4
Find the area between z = 1.50 and z = 2.50:
Question 5
Find the area between z = -2.16 and z = 0:
Question 6
P (z > -2.56) =
a) .4948
b) .0052
c) .9947
d) -.4948
Question 7
P (-1.76 < z < 2.45) =
a) .9537
b) .0321
c) -.0321
d) None of the above
Question 8
P (1.87 < z < 2.76) =
a) .9664
b) .0278
c) -.0278
d) None of the above
Question 9
P (z > 2.61) =
a) .4955
b) .9955
c) .0045
d) -.0045
Question 10
Calculate each probability using the method listed above. The first one has been completed
for you.
A. Example:
What is the z-score associated with the 85th percentile?
In Table E, the Normal Probability Table, find the “area” entry that is closest to 0.8500:
z 0.00 0.01 0.02 0.03 0.04 0.05
Question 11
Find the z value that corresponds to the given area. If you’d like, review the method
discussed above before starting.
Question 12
Find the z value that corresponds to the given area or find the z-values that bound the
center 90% of the normal curve.
• Since each normally distributed variable has a different mean and standard deviation, the
shape and location of these curves will vary.
• A different table of probabilities would be needed for each mean, and standard
deviation pair.
• To avoid this problem, transform any normal distribution to the standard normal
distributions.
• Find the probability for the standard normal distribution and you have the probability for
the original normal distribution.
Dr. Janet Winter, [email protected] Stat 200 Page 11
𝑥 − 𝑥̅
𝑧=
𝑠
2. Find the standard normal probability using the appropriate methods.
II. Example:
3. Heights of 2 year old boys are normally distributed with mean 28 inches and standard
deviation 2.4. What is the chance a 2 year old boy is less than 26 inches tall?
III. Example:
The average daily jail population in the United States is 618,319. If the distribution is normal
and the standard deviation is 50,200, find the probability that on a randomly selected day,
the jail population is:
1. greater than 700,000;
Question 13
The average credit card debt for college seniors is $3262. If the debt is normally distributed
with a standard deviation of $1100, find these probabilities:
I. Method:
a) Given the area or probability, find the probability or area under the normal curve to
the left of x.
b) Use Table E to locate the area to the left of x inside table and determine z. (See page 10)
c) Find x where:
𝑥−𝜇
𝑧=
𝜎
𝑧𝜎 = 𝑥 − 𝜇
𝑥 =𝑧∙𝜎+𝜇
II. Example:
If the tallest 10% and the shortest 10% of the population is considered abnormal, what is
the range for normal heights of women if heights for women are N (63.6 in, 2.5 in)?
III. Example:
The price of a new home in Blair County is normally distributed with a mean price of
$145,000 and a standard deviation of $1,500. Find the minimum and maximum
price to attract the middle 90% of this market.
The middle 90% is split into 2 equal parts by the mean, or the area to the left of –z is .05
and the area to the left of +z is .45+.50 =.9500.
2. Find x where: x = z ∙ σ + μ
z = + 1.645 μ = 145,500 σ = 1500
x = 143032.5 or x = 147967.5
143,032.50 and 147,967.50 bound the middle 90% of the market prices for these new
homes.
Question 14
Test scores for the police academy are normally distributed with mean 95 and standard
deviation 15. To qualify, a candidate must score in the top 10%. Find the lowest possible
score to qualify.
Question 15
A.C. Nelson reported that children between 2 and 5 years old watch an average of 14 hours
of TV per week. Assuming the variable (hrs of TV watched) is normally distributed with a
standard deviation of 3, find the chance that:
1. My neighbor John, who is 4, will watch 20 or more hours of TV this week.
2. The average number of hours watched by the 25 members of John’s play group is 20 hours or
more.
Question 16
The average cholesterol content of a certain brand of eggs is 215 milligrams, and the
standard deviation is 15 milligrams. Assume the variable is normally distributed.
1. If a single egg is selected, find the probability that the cholesterol content will be less than
210 milligrams.
2. If a dozen eggs are selected, what is the chance their average cholesterols is 217 or less?
Cholesterol for eggs is N(215, 15).
II. Example:
P(2.34 < z < 2.78) = normalcdf(2.34, 2.78) = 0.0069
Dr. Janet Winter, [email protected] Stat 200 Page 16
Question 17
P(-1.34 < z < 3.45) =
Question 18
P(z > 2.34) =
Question 19
P(z < -1.79) =
II. Example:
For a normal random variable with mean 50 and standard deviation 2, find the
proportion of scores greater than 45.
normalcdf (45, 1000, 50, 2) = .9938
Question 20
For z, find P(z < 2.87)
Question 21
For a normal random variable x with mean 12 and variance 6, find the
probability that x is more than 16.
IX. Summary
• Many variables such as heights, weights, and temperatures have normally distributed data.
• The standard normal distribution has a mean of 0 and a standard deviation of 1.
• All normal distributions are bell-shaped with equal mean, median, and mode.
𝑥̅ −𝜇
• For individual data values from a normal distribution use: 𝑧 = 𝜎
• For group averages with n ≥ 30, use:
Answer: Question 1
P ( z < 2.16) = .9846 is at the intersection of the Z-column row 2.1 and the top column labeled .06
1. Sketch the curve.
2. Mark the mean.
3. Shade the area.
4. Use the table to find the area to the left of 2.16
Answer: Question 2
Find the area to the right of -2.45
Answer: Question 3
Find the area between z = 0.79 and z = 1.28:
Answer: Question 4
Find the area between z = 1.50 and z = 2.50:
Answer: Question 5
Find the area between z = -2.16 and z = 0:
Answer: Question 6
P(z > -2.56) =
a.) .4948
b.) .0052
c.) .9948
d.) -.4948
Answer: Question 7
P(-1.76 < z < 2.45) =
a.) .9537
b.) .0321
c.) -.0321
d.) None of the above.
Answer: Question 8
P(1.87 < z < 2.76) =
a.) .9664
b.) .0278
c.) -.0278
d.) None of the above.
Answer: Question 9
P(z > 2.61) =
a.) .4955
b.) .9955
c.) .0045
d.) -.0045
Answer: Question 10
Calculate each probability. The first one has been completed for you.
1. P(z > 2.56) = .0052
2. P(-1.76 < z < 2.45) = .9929 - .0392 = .9537
3. P(-1.87 < z < 0) = .5 - .0307 = .4693
4. P(z > 2.61) = 1 - .9955 = .0045
5. P(0 < z < 2.34) = .9904 - .5000 = .4904
Answer: Question 11
Find the z value that corresponds to the given area.
Method:
1. Sketch
2. The area or probability to the left of z is: 1 - .8962 = .1038
3. .1038 is located inside Table E at the intersection of row -1.2 and column .06.
4. z = -1.2 + -.06 = -1.26
Answer: Question 12
Find the z-values that correspond to the given area or find the z-values that will bound the center
90% of the normal curve.
Answer: Question 13
The average credit card debt for college seniors is $3262. If the debit is normally distributed with a
standard deviation of $1100, find these probabilities:
P(3000 < x < 4000) = P(-0.24 < z < 0.67) = .7486 - .4052
= 0.3434 or 34.34%
Answer: Question 14
Test scores for the police academy are normally distributed with mean 95 and standard deviation 15.
To qualify, a candidate must score in the top 10%. Find the lowest possible score to qualify.
x=z∙σ+μ
z = 1.28
μ = 95
σ = 15
x = 1.28 ∙ 15 + 95
x = 114.2
114.2 is the lowest score for a candidate to qualify.
Answer: Question 15
A. C. Nelson reported that children between 2 and 5 years old watch an average of 14 hours of TV
per week. Assuming the variable (hrs of TV watched) is normally distributed with a standard
deviation of 3, find the chance that:
1. My neighbor John, who is 4, will watch 20 or more hours of TV this week.
- for an individual, use N(14, 3)
P(x > 20) = P (z > 2) = 1 - .9772 = .0228
2. The average number of hours watched by the 25 members of John’s play group is 20 hours or
more.
Answer: Question 16
The average cholesterol content of a certain brand of eggs is 215 milligrams, and the standard
deviation is 15 milligrams. Assume the variable is normally distributed.
1. If a single egg is selected, find the probability that the cholesterol content will be less than
210 milligrams.
2. If a dozen eggs are selected, what is the chance their average cholesterols is 217 or less?
Cholesterol for eggs is N(215, 15).
Answer: Question 17
P(-1.34 < z < 3.45) = normalcdf(-1.34, 3.45) = .9096
Answer: Question 18
P(z > 2.34) = normalcdf(2.34, 1000) = .0096
Answer: Question 19
P(z < -1.79) = normalcdf(-1000, -1.79) = .0367
Answer: Question 20
For z, find P (z < 2.87)
normalcdf (-1000, 2.87) = .9979
Answer: Question 21
For a normal random variable x with mean 12 and variance 6, find the probability that x is more
than 16.
LESSON TWO
Learning Objectives
At the end of this lesson you should be able:
2.0 Introduction
Statistical inference about the characteristics of a population or process can not be
possible without enough information to calculate exact population [parameters ( such as
µ, σ, p ) and therefore make the best estimate of this value from the corresponding sample
statistic ( such as x , s, and p ). The need to use the sample statistic to draw conclusions
about the population characteristic is one of the fundamental applications of statistical
inference in business and economics. A few applications are given below:
In all such cases, a decision-maker needs to examine the following two concepts that are
useful for drawing statistical inference about an unknown population or process
parameters based upon random samples:
In this lesson, we shall discuss the methods used to estimate unknown population
parameters and then to determine the range of values (confidence interval) likely to
contain the parameter value.
17
unbiasedness
consistency
efficiency
18
Unbiasedness
The value of a statistic measured from a given sample is likely to be above or below the
actual value of population parameter of interest due to sampling error. Thus it is desirable
that the expected value or mean of all possible values of a statistic from the estimates
over all possible values of a statistic from the estimates over all possible random samples
is equal to the population parameter being estimated. If this is true, then the sample
statistic is said to be an unbiased estimator of the population parameter. Hence, the
sample statistic is said to be an unbiased estimator of the population parameter,
provided E( ) = θ where E( ) is the expected value or mean of the sample statistic
Sampling distribution of
statistic
1 Sampling distribution of
statistic
2
Bias
s
θ E ( )
Consistency
A point estimator is said to be consistent if its value tends to become closer to the
population parameter θ as the sample size increases. For example, the standard error of
sampling distribution of the mean x = σ/√n, tends to become smaller as sample size n
increases. Thus the sample mean x is a consistent estimator of the population mean µ.
Similarly the sample proportion p is a consistent estimator of the population proportion
p because p = σ/√n
19
Efficiency
For the same population, out of two unbiased point estimators an unbiased estimator with
smaller standard deviation is said to be efficient because it provides as estimate closer to
the population parameter. It is because of this reason that there is less variation in the
sampling distribution of the statistic. For example, for a simple random sample of size n,
if 1 and 2 are two unbiased point estimators of the population parameter θ, then
relative efficiency of 1 to 2
is given by
1
Relative efficiency =
2
The figure below shows the sampling distribution of two statistics 1 and 2
which are
being considered for estimation of the population parameter θ. Since standard deviation
(or error) of statistic 2 is less than that of 1 , therefore values of 2 are more likely
to provide an estimate that is close to parameter θ for a given sample. The statistic 1
tends to be a larger estimation error both above and below the parameter θ. Thus the
estimates obtained from statistic 2 are more consistently close to θ than those of 1
Statistic
2
Statistic
1
θ
Parameter
20
possible sampling errors involved based on the sampling distribution of the statistic. It is
therefore important to know the precision of an estimate before relying on it to make
decisions. Thus, decision makers prefer to use an interval estimate that is likely to contain
the population parameter value. However, it is also important to state how confident the
interval estimate actually contains the parameter value. Hence an interval estimate of a
population parameter is therefore a confidence interval with a statement of confidence
that the interval contains the parameter value.
x z / 2 x or x z / 2
n
Or x z / 2 x z / 2
n n
Where zα/2 is the z value representing an area α/2 in the right and left tails of the standard
normal probability distribution, and (1- α) is the level of confidence as shown in figure
2.2 below:
21
Part A (47.5%)
+ Part B (47.5%)
= 95% confidence
Tail 1 = α/2
=2.5% Tail 2 = α/2
=2.5%
A B
Figure 2.2 shows a 95% level of confidence (two tailed test). The desire is to estimate the
mean with 95% confidence. This area will be composed of two equal segments (A and B)
each containing an area of 47.5% (i.e. 50%-2.5% for A and 97.5%-50% for B).
2 2
x1 x 2
n
1
n
2
1 2
For a desired confidence level, the confidence interval limits for the population mean
(μ1 – μ2) are given by
( x1 - x 2 ) ± z /2 x1 x 2
Illustration
The strength of wire produced by company A has a mean of 9000kg and a standard
deviation of 400 kg. Company B has a mean of 8000kg and a standard deviation of
600kg. A sample of 50 wires of company A and 100 of company B are selected at
random to select strength. Find 99% confidence limits on the difference in average
strength of populations of wires produced by the two companies.
22
Solution
The following information is given:
Company A: x 1
4500, 1
200 n 50
1
Company B: x 2
4000, 2
300 n 100
2
Therefore: x x1 2
4500 4000 500; and z /2
2.576
2 2
x1 x 2
n
1
n
2
1 2
200 300
2 2
50 100
= 41.23
s
x z / 2 s x x z / 2
n
When the population standard deviation is not known and the sample size is small, the
procedure of interval estimation of population mean is based on a probability distribution
known as the t-distribution. This distribution is very similar to the normal distribution.
However, the t-distribution has more area in the tails and less in the centre than does
normal distribution. The t-distribution depends on a parameter known as degree of
freedom. As the number of the degrees of freedom increases, the t-distribution gradually
approaches the normal distribution, and the sample standard deviation s becomes a better
estimate of a population standard deviation σ.
23
The interval estimate of a population mean when the sample size is small (n≤30) with
confidence coefficient (1-α), is given by
s s s
x t / 2 or x t / 2 x t / 2
n n n
where tα/2 is the critical value of t-test statistic providing an area α/2 in the right tail of the
t-distribution with n-1 degrees of freedom, and
s ( x x ) / n1
i
The critical value of t for the given degrees of freedom can be obtained from the table of
t-distribution. The procedure of the confidence interval estimation of population mean μ
when the population standard deviation is unknown and sample size is large or small, is
summarized on the table below
Large
- σ assumed known x z / 2
n
s
- σ estimated by s x z / 2
n
Small
-σ assumed known x z / 2
n
s
- σ estimated by s x t / 2
n
Illustration
The personnel department of an organization would like to estimate the family dental
expenses of its employees to determine the feasibility of providing a dental insurance
plan. A random sample of 10 employees reveals the following family dental expenses (in
thousands) in the previous year:
11, 37, 25, 62, 51, 21, 18, 43, 32, 20.
24
Set up 99% confidence interval of average family dental expenses for the employees of
this organization.
Solution
The calculations for sample mean and standard deviation are shown below:
x 320 / 10 32
Σx/n
(x x) / n 1
2
=
= √2358/9
= Sh 5.11
Hence the mean expenses per family are likely to fall between Ksh. 29.038 and
Ksh. 34.962. (29.038≤ ≤34.962)
25
pq
p zc
n
Exercises
1. Explain the term ‘estimation’ as applied in the field of statistics.
2. Differentiate between point estimation and interval estimation.
3. What criteria are observed when selecting an estimator?
4. What is unbiasedness?
5. What is the difference between consistency and efficiency?
9. A factory is producing 50,000 pairs of shoes daily. From a sample of 500 pairs,
20% were found to be of substandard quality. Estimate the number of pairs that
can be reasonably expected to be spoiled in the daily production and assign limits
at 5% level of significance.
26
LESSON THREE
HYPOTHESIS TESTING
Lesson Objectives
At the end of this lesson you should be able to:
3.0 Introduction
The statistician draws inference about the population parameters using the knowledge of
point estimate and interval estimate. Statistical inference is therefore based on the sample
statistic. A statistical hypothesis is a claim or belief about an unknown population
parameter value. The methodology that enables a decision maker to draw inference about
population characteristics by analysing the difference between the value of sample
statistic and the corresponding hypothesized par ammeter value is called hypothesis
testing. For example, a drug manufacturing company plans to test the efficiency of a new
drug against a disease on the belief that 95% of all persons suffering from the disease get
cured. The company will then draw a random sample of patients suffering from the
disease and administer the drug. The number of sampled patients that get cured will
determine the success of the drug. If the percentage of those who get cured is more than
95%, then the drug is likely to be successful.
In statistical analysis, the difference between the value of the statistic and the
hypothesised parameter is specified in terms of a given level of probability whether the
particular level of difference is significant or not when the hypothesised value is correct.
The probability that a particular level of deviation occurs by chance can be calculated
from the known sampling distribution of the test statistic.
27
The probability level at which the decision maker concludes that observed difference
between the value of test statistic and hypothesised parameter value cannot be due to
chance is called the level of significance of the test. For example, a decision-maker may
feel that a difference that could occur by chance five per cent of the time is not significant
even if the hypothesis is correct.
Step One: State the Null Hypothesis (H0) and Alternative Hypothesis (H1)
The null hypothesis H0 refers to the hypothesised numerical value or range of values of
the population parameter. Theoretically hypothesis testing requires that the null
hypothesis be considered true until it is proved false on the basis of results observed from
the sample data. The null hypothesis is always expressed in the form of an equation
making a claim regarding the specific value of the population parameter. That is:
H0: μ = μ0
An alternative hypothesis, H1, is the logical opposite of the null hypothesis, that is, an
alternative hypothesis must be true when the null hypothesis is found to be false. The
alternative hypothesis is written as:
H1: μ μ0
Step Two: State the Level of Significance α (alpha) for the Test
The level of significance, usually denoted by α, is specified before the samples are drawn,
so that the results obtained should not influence the choice of the decision-maker. It is
specified in terms of the level of probability of null hypothesis H0 being wrong. That is,
the probability of rejecting H0 when it is true.
28
H0 is rejected, zα/2
H0 is rejected, zα/2
Acceptance region (1-α)
0
If the value of test statistic falls into the acceptance region, the null hypothesis is
accepted, otherwise it is rejected. The rejection region consists of all values of the test
statistic that are likely to occur if null hypothesis is true. On the other hand, these values
are not likely to occur if the null hypothesis is false. The value of the sample statistic that
separates the regions of acceptance and rejection is called critical value.
The size of rejection region is directly related to the level of precision needed to make
decision about a population parameter. Decision rules concerning null hypothesis are as
follows:
Prob (H0 is true) ≤ α, reject H0
In other words, if probability of H0 being true is less than α, reject H0, otherwise
accept.
x x
For example, z ; σ is known, n>30
x / n
The choice of a probability distribution of a sample statistic is guided by the sample size
n and the value of population standard deviation σ as shown below:
29
The same decision rules hold for the t-distribution associated with the sampling
distribution of means or proportions when the sample size is small.
Type I Error
Type one error occurs when the decision maker reject the null hypothesis (H0), when it is
true. The Greek letter α designates this error term.
Type II Error
The probability of committing another type of error, called type II error, is designated by
Greek letter beta (β). Type II error occurs when one accepts the null hypothesis when it is
false.
The following table summarizes the decisions the researcher could make and the possible
consequences.
Researcher
Null Hypothesis Accepts Rejects
30
Illustration
The firm manufacturing bottles would commit a type two error if, unknown to the
manufacturer, an incoming shipment of raw materials contained 15% substandard
materials yet its expectations was only 7%. This could happen if out of a sample of 50
items, two items (4%) were found to be substandard. According to the standard
procedure, the shipment should be accepted because the sample has passed the test (bad
items in the sample is less than 7%).
In actual sense, the decision-maker cannot study every item or individual in the
population. Thus there is a possibility of two types of errors: (1) rejecting the shipment
when the substandard items are less than 7%; and (2) accepting the shipment when the
substandard items are more than 7%. The latter error can be more costly than the former
in real life situations.
Illustration
Suppose that the packaging department at Home Economics Corporation is concerned
that some boxes are overweight. The cereal is packaged in 500-gram boxes. The null
hypothesis will be:
H0: μ ≤ 500.
This is read, “The population mean μ is equal to or less than 500”. The alternative
hypothesis is therefore:
This is read, “The population mean μ is greater than 500”. The inequality sign in the
alternative hypothesis (>) points to the region of rejection in the right (upper) tail. It
should be noted that the null hypothesis includes the equal sign while the alternate
hypothesis has no equal sign. The packaging problem can be represented using the
diagram below:
31
Do not reject H0
The above illustration showed a problem where the rejection region was to the right. In
some problems, the rejection region is in the opposite direction (i.e. to the left).
Illustration
Consider the problem of an automobile manufacturer, large automobile leasing
companies, and other organisations that purchase large quantities of tyres. They want the
tyres to average 40,000 miles of wear under normal usage. They will therefore reject a
shipment of tyres if accelerated life tests show that the life of tyres is significantly below
40,000 miles on average. They will however gladly accept a shipment if the mean life is
greater than 40,000 miles. The procedure to test this problem is to have a sample and test
the null and alternate hypotheses. The tests will be as follows:
H0: μ ≥ 40,000.
H1: μ < 40,000.
32
Do not reject H0
Note: the z values for any level of significance will depend on whether the test is a two-
tailed, or a one tailed test. For example:
z-value (s)
Level of significance (α) two-tailed one-tailed
0.10 -1.65 and +1.65 -1.29 or +1.29
0.05 -1.96 and +1.96 -1.65 or +1.65
One way to determine the location of the rejection region is to look at the direction in
which the inequality sign in the alternate hypothesis is pointing. When the alternate
hypothesis points >, it is a right tailed test; when it points < it becomes a left tailed test.
If the null hypothesis is rejected and the alternate hypothesis accepted in this test, the
mean incomes of men could be greater than that of females or vice versa. To
accommodate these two possibilities, the five percent rejection is divided into two areas
of the sampling distribution (2.5 % each) (refer to figure 3.3).
Illustration
Makini Steel Company manufactures and assembles desks at several plants in Western
Kenya. The weekly production of model A325 desk at department Z3 is normally
distributed, with a mean of 200 and a standard deviation of 16. Due to market expansion,
new production methods have been introduced and the new employees hired. The CEO
33
would like to investigate whether there has been an overall change in the weekly
production of the model A325. The CEO further informs you that he is willing to accept a
one percent error in the results of investigation.
Required:
Assist the CEO in the investigation.
Solution
Statistical hypothesis testing will be used to investigate whether the production has
changed from 200 per month.
Step 1
The null hypothesis is “the population mean is 200.” The alternate hypothesis is “the
mean is different from 200” or “the mean is not 200”
H0: μ = 200.
H1: μ ≠ 200.
This is a two-tailed test because the alternate hypothesis does not state a direction. In
other words, it does not state whether the mean production is greater than 200 or less than
200. The CEO only wants to find out whether the production rate is different from 200.
Step 2
As noted the one percent error translates to 0.01 level of significance. This is α, the
probability of committing a Type 1 error. That is, the probability of rejecting a true
hypothesis.
Step 3
The test statistic for this type of problem is z. transforming the production data into
standard units permits the use of z values not only in this problem but also in other
hypothesis testing problems.
x
Z=
n
Step 4
The decision rule formulated by finding the critical value of z from the table. Since this is
a two tailed test, half of 0.01 or 0.005 is in each tail. The area where H0 is not rejected,
located in the two tails, is therefore 0.99.
34
0
The decision rule is therefore: reject the null hypothesis and accept alternate hypothesis
(which states that the population mean is not 200) if the computed value of z does not fail
in the region between –2.58 and +2.58. Do not the null hypothesis is z falls between –
2.58 and +2.58.
Step 5
Take a sample from the population (weekly production), compute z, and, based on the
decision rule, arrive at a decision to reject H0.
Illustration
Thomas discount store chain issues its own credit card. The credit manager wants to
know whether the mean monthly unpaid balance is more than Sh 400. The level of
significance is set at 0.05. A random check of 172 unpaid balances revealed the sample
mean is 404 and the standard deviation of the sample is 38. Should the credit manager
conclude the production mean is greater than Sh400, or it is reasonable that the difference
of Sh7 (407-400) is due to chance?
Solution
H0: μ ≤ 400
H1: μ > 400
From the table, the critical value of z is 1.65. The computed value of z is 2.42, found by
using the formula
x
Z=
n
35
407 400 7
= 2.42
38 172 2.8975
Because the computed value of the test statistic (2.42) is larger than the critical value
(1.65), the null hypothesis is rejected. The credit manager can conclude that the mean
unpaid balance is greater than Sh400.
The p-value provides additional insight into the decision. The p-value is the probability of
finding a test statistic as larger as or larger than that obtained when the null hypothesis is
true. The probability of a z value between 0 and 2.42 is 0.4922. Therefore, the probability
of a z value greater than 2.42 is 0.5-0.4922 = 0.0078. Therefore, the probability of finding
a z greater than 2.42 or larger when the null hypothesis is true is 0.78%. It is unlikely
therefore that the null hypothesis is true.
In 1908 W.S. Gosset, developed a method dealing with small samples. He showed that if
we used the same procedures of z-test for small samples, then Type I error would be
made more often. For example when α = , it means that using standardised normal
distribution for large samples will be making Type I error (rejecting null hypothesis when
in fact it is true), 5% of the times. But in small samples, this error will be committed
more than 5% of the times. The reason being that in repeated experiments with small
samples, the value of the sample standard deviation(s) tends to be quite variable.
Since the variation between sample mean and population mean is given by
σx = s/√n,
This relationship is not valid for small samples because of wide fluctuations in the values
of the sample standard deviations.
Based upon this variation, Gosset came up with different sets of critical scores called t-
scores (or student-t distribution). These t-scores are to be used in place of Z scores. The
larger the sample size, the closer will be the value of t-score to Z-score.
36
The t-score distribution is useful not only when the sample size is small but also when the
population standard deviation is not known. If the population standard deviation is
known, then even a small sample from a normal population distribution can be treated as
normally distributed. Secondly, the population from which the small sample is taken must
be distributed close to normal. While a large sample from any population can be
approximated to normal distribution, a small sample must come from a normal or near
normal population, in order for a t-test to be used. The t-scores should not be used if
small samples come from a population which is distributed in a non-normal pattern.
x -µ
t=
s/ n
x ts x
s
x ±t
n
Illustration
A claim is made that McNtany college students have an IQ of 120. To test this claim, a
random sample of 10 students was taken and their IQ scores are recorded as follows:
Student 1 2 3 4 5 6 7 8 9 10
IQ 105 110 120 125 100 130 120 115 125 130
37
Solution
Since sample size is small (10) and population standard deviation is not known, we can
use the t-distribution with 10-1 degrees of freedom at α = 0.05.
x x x- x (x- x )2
100 118 -18 324
105 118 -13 169
110 118 -8 64
115 118 -3 9
120 118 2 4
120 118 2 4
125 118 7 49
125 118 7 49
130 118 12 144
130 118 12 144
1180 960
H0: μ1 = 120
H1: μ1 ≠ 120
Checking the critical t-score value from the table at α 0.05 and 9 degrees of freedom, the
t-score is 2.26. Hence, we do not reject the null hypothesis.
38
The t-statistic for comparison of two population means is similar to the procedure of
using the z-statistic for comparison of two population means. Two additional elements
are considered when using t-test. These are:
1. The number of degrees of freedom is the sum of the degrees of freedom for each
sample. [df = (n1 – 1) +( n2 – 1)].
2. the two standard deviations s1 and s2 calculated from two samples of size n1 and
n2 respectively are pooled together from a single estimate (sp) of the population
standard deviation, where (sp) is calculated as
(n1 - 1) s1 (n 2 - 1) s2
2 2
sp =
( n1 n 2 - 2
t= x x1 2
s (n n
2 1 1
p 1 2
)
Where: x 1
is the mean of the first sample
x 2
Is the mean of the second sample
The calculated standard deviation is compared with the critical t-scores from the table at a
given level of significance and (n1 + n2 - 2) degrees of freedom (df). A decision is made
on whether to reject or accept the null hypothesis.
Illustration
A study was carried out to find out whether there is any significant difference in the
amount of money carried by KU male and female students on any given day. A random
sample of 8 male students and 10 female students was selected and the amount of money
they had was recorded. After calculating the sample means and sample standard
deviations, the following results were obtained:
39
Solution
H0: μ1 = μ2
H1: μ1 ≠ μ2
t= x x 1 2
s (n n
2 1 1
p 1 2
)
Calculating sp
sp = √[(n1 – 1) s12 +( n2 – 1) s22]/[( n1 – 1) +( n2 – 1)]
Therefore,
= 205-170/√ 301.56(8-1 + 10-1)
= 35/8.237 = 4.25
The critical value of t from the table at α0.05, n = 16 is 2.12, hence we reject the null
hypothesis since the calculated t-value is higher.
Chi Square is used to analyse qualitative data such as opinions, habits, etc. and it has the
following properties:
1. It involves squared observations and hence it is always positive. Its value is
always greater than or equal to zero.
2. The distribution is not symmetrical. It is skewed to the right so that its skewness is
positive. As the number of degrees of freedom increases, Chi Square approaches a
symmetric distribution.
3. Similar to t distribution, there is a family of Chi Square distributions.
The estimation of degrees of freedom (df) for χ2 is based on the number of categories
(k-1) where k are the number of categories. For example, if a sample of 100 students is
divided into freshmen, juniors, and seniors, there will be (k-1 = 3-1) = 2 degrees of
freedom.
The χ2 will test whether there is significant difference between the observed number of
responses in each category and the expected number of responses for such category under
the assumption of null hypothesis. The objective is to find how well the distribution of
observed frequencies fo fit the distribution of expected frequencies fe.
40
χ2 = ∑( fo - fe )2/ fe
Illustration
Sixty children were asked to state the ice-cream flavour they preferred among the three
available categories of Vanilla, Strawberry, and Chocolate
Determine whether children favour any particular flavour compared to other flavours.
Solution
The null hypothesis states that there is no difference among the tastes of children in the
three flavours. Under this hypothesis, equal numbers of children are expected to prefer
each flavour. Therefore 20 should prefer Vanilla another 20 should prefer Strawberry and
the last 20 should prefer Chocolate.
( f f )
2
χ2 = o e
f e
= 1.3
41
At α =0.05 and 3-2 = 2df, the critical χ2 is 5.991. Since our computed value of χ2 is less
than the critical value of χ2, we cannot reject the null hypothesis.
Exercises
1. What is test of significance?
2. Explain the procedure generally followed in testing of hypothesis.
3. Differentiate between type I and type II errors.
4. Define the standard error of as statistic. How is it helpful in testing hypothesis and
decision making?
5. Distinguish between the null and alternate hypothesis.
8. A machine puts out 20 imperfect articles in a sample of 1000. After the machine
is overhauled, it puts out 5 imperfect articles in a sample of 300. Has the machine
improved?
9. A manufacturing co. is requires that the mean length of the rods it produces
should be 8.6 inches. The standard deviation of these rods is 0.3 inch. The
manufacturer would like to see if the process is working correctly by taking a
random sample of 36. There is no indication of whether or not the rods may be
too short or too long.
i. Establish null and alternative hypothesis for this problem
ii. If the random sample yields an average length of 8.7 inches, which
hypothesis should be accepted?
11. The time required by a doctor to treat patients follows a normal distribution curve.
A sample of six patients revealed the following times of treatment in number of
days.
Construct a 99% confidence interval for the true mean estimate of the time
required for treatment in number of days.
42
12. A large manufacturing company producing motors claims that orders of motors
are filled on average in 10 days. A random sample of 8 files showed that the
orders were filed in
13. A machine operator is in charge of two machines (A) and (B). Machine A has
been in service longer than machine B. the operator is interested in finding out
whether both machines take on average the same amount of time in producing the
same product. The operator records the following data on time taken by each
machine in minutes:
Machine A 30 32 28 29 30 25 31 30 28 29
Machine B 33 32 31 30 32 31 31 33 32 32
Test if there is a significant difference in the times for machine A and Machine B
in producing the product. Assume both samples are normally distributed and α =
0.01.
14. In consultation with the professors teaching the introductory business course, the
Chairman proposed that the student’s grades should be based on the bell-shaped curve
and the following grading policy adopted.
Top 10% A
Next 20% B
Middle 40% C
Next 20% D
Bottom 10% F
Grade Frequency
A 10
B 10
C 10
D 14
F 6
The chairman wants you to analyse the data and inform him whether all
professors are following his policy. Assume α = 0.05
43
LESSON FOUR
CORRELATION ANALYSIS
Lesson Objective
At the end of this lesson you should be able to:
4.0 Introduction
Correlation is a technique used to measure the strength of the relationship between two
variables. There are two techniques for measuring correlation:
product moment and;
Rank method.
The purpose of regression analysis is to identify a relationship for a given set of bivariate
data. However, regression alone cannot tell us how good is the relationship between X
and Y. correlation can be used to provide a measure of how well the regression line ‘fits’
the given set of data.
The computation of the correlation coefficient is most easily accomplished with the aid of
a statistical calculator. However, it is important to know how the figure is obtained
without the aid of mechanical tools.
44
n xy x y
r
n x ( x ) n y ( y )
2 2 2 2
The correlation coefficient may take on any value between plus and minus one.
1.00 r 1.00
The sign of the correlation coefficient (+, -) defines the direction of the relationship,
either positive or negative. A positive correlation coefficient means that as the value of
one variable increases, the value of the other variable increases; as one decreases the
other decreases. A negative correlation coefficient indicates that as one variable
increases, the other decreases, and vice-versa.
Taking the absolute value of the correlation coefficient measures the strength of the
relationship. A correlation coefficient of r = 0.50 indicates a stronger degree of linear
relationship than one of r = 0.40. Likewise a correlation coefficient of r = - 0.50 shows a
greater degree of relationship than one of r = 0.40. Thus a correlation coefficient of zero
(r = 0.0) indicates the absence of a linear relationship and correlation coefficients of
r = +1.0 and r = -1.0 indicate a perfect linear relationship.
Scatterplots
The scatterplots presented below illustrate how the correlation coefficient changes as the
linear relationship between the two variables is altered. When r = 0.0 the points scatter
widely about the plot, the majority falling roughly in the shape of a circle. As the linear
relationship increases, the circle becomes more and more elliptical in shape until the
limiting case is reached (r = 1.00 or r = -1.00) and all the points fall on a straight line.
45
300
200
100
0
0 20 40 60 80
Machine Age
300
200
100
0
0 10 20 30 40 50
Machine Age
300
200
100
0
0 10 20 30 40 50
Machine Age
46
200
150
100
50
0
0 20 40 60 80 100
Weeks of Experience
n xy x y
r
n x ( x ) n y ( y )
2 2 2 2
Illustration
The following data relate to the weekly maintenance cost (Shs) to the age (in months) of
ten machines of similar type in a manufacturing company. You are required to calculate
the product moment correlation coefficient between age and cost.
Machine 1 2 3 4 5 6 7 8 9 10
Age 5 10 15 20 30 30 30 50 50 60
Cost 190 240 250 300 310 335 300 300 350 395
Solution
Let the machine age be represented by x and;
Cost is represented by y.
47
X Y xy x2 y2
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
300 2970 97650 12050 913050
n 10
sum xy 97650
sum x 300
sum y 2970
sum x sq 12050
sum y sq 913050
n xy x y
r
n x ( x ) n y ( y )
2 2 2 2
= 0.0.880
The result shows a strong measure of correlation between machine maintenance cost and
age of machine.
Correlation can exist in such a way that increases in the value of one variable tend to be
associated with increases in the value of the other. This is known as positive or direct
correlation. In positive correlation, r will take the value between 0 and 1. When r = 1, it
signifies a perfect positive correlation.
Negative correlation exists when increases in the value of one variable tend to be
associated with decreases in the value of the other (and vice versa). In this type of case
the correlation is said to be negative (or inverse). In this case, the correlation coefficient,
r, will take a value of between 0 and -1, with r = -1 signifying ‘perfect’ negative
correlation.
48
1 6 d
2
r=
n (n 1)
2
Illustration:
Find the rank correlation between the rent and rates values shown below:
Rates (x) 1.68 1.46 1.57 13.37 3.18 1.95 1.07 1.71 1.22 6.46
Rent (y) 3.81 4.19 4.87 22.85 6.47 6.48 2.66 6.49 5.33 15.23
Solution
x y rx ry d d2
1.68 3.81 5 2 3 9
1.46 4.19 3 3 0 0
1.57 4.87 4 4 0 0
13.37 22.85 10 10 0 0
3.18 6.47 8 6 2 4
1.95 6.48 7 7 0 0
1.07 2.66 1 1 0 0
1.71 6.49 6 8 -2 4
1.22 5.33 2 5 -3 9
6.46 15.23 9 9 0 0
N = 10 Total 26
49
1 6 d
2
Therefore, r=
n (n 1)
2
= 1 – 6∑26/ 10(102-1)
= 1 – 6*26/10*99
= 0.842.
2 2 2
a y b xy n( y ) 2
Coefficient of determination (R ) = r or R
y 2
n( y ) 2
Variance Interpretation
The coefficient of determination (R2) is the proportion of variance in Y that can be
accounted for by knowing X. Conversely, it is the proportion of variance in X that can be
accounted for by knowing Y.
One of the most important properties of variance is that it may be partitioned into
separate additive parts. For example, consider shoe size. The theoretical distribution of
shoe size may be presented as follows:
50
If the scores in this distribution were partitioned into two groups, one for males and one
for females, the distributions could be represented as follows:
If one knows the sex of an individual, one knows something about that person's shoe size,
because the shoe sizes of males are on the average somewhat larger than females.
However, the variance within each distribution, male and female, is variance that cannot
be predicted on the basis of sex (this is known as error variance). For example, even if
one knows the female sizes to be between 3.5 and 11, he cannot know the exact size of a
particular female.
Therefore, total variance is the sum of the variance that can be predicted and the error
variance, or variance that cannot be predicted. This relationship is summarized below:
The correlation coefficient squared is equal to the ratio of predicted to total variance:
This formula may be rewritten in terms of the error variance, rather than the predicted
variance as follows:
51
3. VISIT - How many provinces of Kenya (excluding your home province) have you
visited? _____
Since there are five questions on the example questionnaire there are 5 * 5 = 25 different
possible correlation coefficients to be computed. Each computed correlation is then
placed in a table with variables as both rows and columns at the intersection of the row
and column variable names. For example, one could calculate the correlation between
AGE and KNOW, AGE and VISIT, AGE and DRIVE, AGE and SEX, KNOW and
VISIT, etc., and place then in a table of the following form.
One would not need to calculate all possible correlation coefficients. Only the ten
calculations (in bold) are necessary. This reduction in the number of required calculations
in the table were caused by the following facts:
1. The correlation of a variable with itself is always one (as shown by all the figures
in the leading diagonal)
2. The correlation coefficient is non-directional. That is, it doesn't make any
difference whether the correlation is computed between AGE and KNOW with
AGE as X and KNOW as Y or KNOW as X and AGE as Y. For this reason the
correlation matrix is symmetrical around the diagonal. Therefore all the figures in
the upper diagonal are unnecessary. The overall effect is to reduce the number of
correlation coefficient from 25 to 10 [25 (total) - 5 (diagonals) - 10 (redundant
because of symmetry)] = 10 (different unique correlation coefficients).
52
Effect of Outliers
An outlier is a score that falls outside the range of the rest of the scores on the scatterplot.
For example, if age is a variable and the sample is a statistics class, an outlier would be a
retired individual. Depending upon where the outlier falls, the correlation coefficient may
be increased or decreased.
An outlier which falls near where the regression line would normally fall will
unnecessarily increase the size of the correlation coefficient. On the other hand, an outlier
that falls some distance away from the original regression line would unnecessarily
decrease the size of the correlation coefficient.
The effect of the outliers is somewhat muted when the sample size is fairly large. The
smaller the sample size, the greater the effect of the outlier. When a researcher encounters
an outlier, a decision must be made whether to include it in the data set. It may be that the
respondent was deliberately giving wrong answers, or simply did not understand the
question on the questionnaire. On the other hand, it may be that the outlier is real and
simply different.
53
For example, suppose there is a high correlation between the number of ice-cream sold
and the number of drowning deaths. Does that mean that one should not buy ice-cream
before one swim? Not necessarily. Both of the above variables are related to a common
variable, the heat of the day. The hotter it is, the more ice-cream sold and also the more
people swimming, thus the more drowning deaths. This is an example of correlation
without causation.
Much of the early evidence that cigarette smoking causes cancer was correlational. It may
be that people who smoke are more nervous and nervous people are more susceptible to
cancer. It may also be that smoking does indeed cause cancer. The cigarette companies
made the former argument, while some doctors made the latter. In this case rationality
demands one to believe the relationship is causal and therefore do not smoke (because it
is probable that the cigarette companies are interested in making money). Sociologists are
very much concerned with the question of correlation and causation because much of
their data is correlational. They developed a branch of correlational analysis, called path
analysis, precisely to determine causation from correlations. Before a correlation may
imply causation, certain requirements must be met. These requirements include:
i. The causal variable must temporally precede the variable it causes, and
ii. Certain relationships between the causal variable and other variables must
be met. If a high correlation was found between the age of the teacher and
the students' grades, it does not necessarily mean that older teachers are
more experienced, teach better, and give higher grades. Neither does it
necessarily imply that older teachers are soft touches, don't care, and give
higher grades. Some other explanation might also explain the results. The
correlation means that older teachers give higher grades; younger teachers
give lower grades. It does not explain why it is the case.
Exercises
1. What is correlation?
2. What is correlation coefficient?
3. Why does correlation coefficient always lie between -1 and +1?
4. Name and explain two types of correlation.
5. with the aid of a scattergraph, illustrate and explain
i. positive correlation
ii. no correlation
iii. negative correlation
x 1 2 3 4 6 8 9 11 14
y 1 2 2 4 4 5 7 8 9
54
9. A cost accountant has derived the total cost against output (figures in thousands)
of standard size boxes from a factory over a period of ten weeks, yielding the
following data.
Output 20 2 4 23 18 14 10 8 13 8
Cost 60 25 26 66 49 48 35 18 40 33
Calculate:
i. the product moment correlation coefficient
ii. coefficient of determination and interpret the results
10. On ten different days chosen at random, the following values were obtained for
the share price of a particular company together with the value of the ESE index
on that day.
Price 77 46 80 76 65 71 60 75 76 88
Index 319 315 387 339 383 340 340 356 358 398
Calculate the Spearman’s rank correlation coefficient and determine whether the
ESE can be relied as an indicator of price.
55
LESSON FIVE
REGRESSION ANALYSIS
Lesson Objectives
At the end of this lesson you should be able to:
5.0 Introduction
After having established the fact that two variables are closely related investigations can
be done to establish the predictability of one variable given the value of the other
variable. For example, if we know that advertising and sales are correlated, we may find
out the expected amount of sales for a given advertising expenditure. This is possible by
employing the technique of regression analysis.
Regression is the act of returning or going back. The term regression was first used in
1877 by Francis Galton while studying the relationship between the height of fathers and
sons. His findings showed that the height of fathers have a direct relationship with the
height of sons but the average height of sons of tall fathers were less than the average
height of the fathers; while the average height of sons of short fathers were more than the
height of the short fathers.
Regression analysis is a branch of statistical theory that is widely used in most scientific
areas. In economics it is the basic technique for measuring the relationship among
economic variables that constitute the essence of economic theory.
56
3. With the help of regression analysis, the measure of the degree of association or
correlation can be obtained. The coefficient of determination calculated for this
purpose measures the strength of relationship that exists between the variables. It
assesses the proportion of variance that has been accounted for by the regression
equation.
The tool of regression analysis can be extended to three or more variables. However, in
this lesson, we shall confine ourselves to problems of two variables only (simple
regression analysis)
Which is the population regression equation for the bivariate linear model. In this
equation a and b are called the population regression coefficients
57
Where e is the deviation of a particular value of Y from μyx and is called the error
term or the stochastic disturbance term. The errors are assumed to be independent
random variables because Y’s are random variables and independent. The
expectations of these errors are zero; E(e) = 0. Moreover, if Y’s are normal
variables, the error can also be assumed to be normal.
Regression Lines
If we take the case of two variables X and Y, two regression lines can be obtained:
regression line of X on Y and the regression line of Y on X. the regression line of Y
on X give the most probable values Y given the Values of X. on the other hand,
regression of X on Y gives the most probable values of X given the values of Y. thus,
we have two regression lines. However, when there is a perfect correlation between
the two variables, the two regression lines will coincide (i.e. we will have only one
line). The further the regression lines are from each other, the lesser the degree of
correlation and vice versa. If the variables are independent, r is zero and the lines of
regression are right angles (i.e. parallel to X-axis and Y-axis).
It should be noted that the regression lines cut each other at the point of average of X
and Y, i.e. if from the point where both the regression lines cut each other a
perpendicular is drawn on the X-axis, we will get the mean value of X and if from
that point a horizontal line is drawn on the Y-axis, we will get the mean value of X.
Regression Equation of Y on X
This equation is expressed as follows:
Ye = a + bX
In this equation, ‘a’ and ‘b’ are two unknown constants which determine the position
of the line. The constants are called parameters of the line. If the value of either both
of them is changed, another line is determined. The parameter ‘a’ determines the
58
level of fitted line (the distance of the line directly above or below the origin). The
parameter ‘b’ determines the slope of the line (the change in Y for a unit change in X)
If the values of the constants a and b are obtained, the line is completely determined.
However, the challenge is the process of obtaining these parameters. The method of
least squares is used to determine these parameters. Least squares method aims at
minimizing the sum of squares of the vertical deviations of the actual Y values from
the estimated Y values. By doing so, the fitted line through the regression points will
be the best possible.
With the help of algebra and differential calculus, it can be shown that the following
two equations, if solved simultaneously, will yield values of the parameters a and b
such that the least squares requirement is fulfilled.
ΣY = Na + b ΣX
ΣXY = a ΣX + b ΣX2
These equations are usually called the normal equations. In the equations ΣX, ΣY,
ΣXY, ΣX2 indicate totals which are computed from the observed pairs of values of
two variables X and Y to which the least squares estimating line is to be fitted and N
is the total number of observed pairs of vales.
Regression Equation of X on Y
The regression equation of X on Y is expressed as follows:
Xe = a + bY
To determine the values of a and b the following two normal equations are to be
solved simultaneously.
ΣX = Na + b ΣY
ΣYX = a ΣY + b ΣY2
Illustration
Calculate the regression of equations of X on Y and Y on X from the following data:
X: 1 2 3 4 5
Y: 2 5 3 8 7
59
Solution
Calculation of regression equations
X Y X2 Y2 XY
1 2 1 4 2
2 5 4 25 10
3 3 9 9 9
4 8 16 64 32
5 7 25 49 35
Σ 15 25 55 151 88
Xe = a + bY
ΣX = Na + b ΣY
ΣYX = a ΣY + b ΣY2
15 = 5a + 25b
88 = 25a + 151b
Ye = a + bX
ΣY = Na + b ΣX
ΣXY = a ΣX + b ΣX2
25 = 5a + 15b
88 = 15a + 55b
Y = 1.1 + 1.3X
60
Illustration 2
After investigations, it has been found that the demand for automobiles in a town depends
mainly upon the number of families residing in that town. Below are given figures for the
sales of automobiles in the five cities for the year 2004, and the number of families
residing in those cities.
Find a linear equation of Y on X by the least square method and estimate the sales for the
year 2006 for city A which is estimated to have 300,000 families assuming the same
relationship holds true.
Solution
Calculation of regression equation
City X Y X2 XY
A 70 25.2 4900 1764
B 75 28.6 5625 2145
C 80 30.2 6400 2416
D 60 22.3 3600 1338
E 90 35.4 8100 3186
Σ 375 141.7 28625 10849
Y = a + bX
ΣY = Na + b ΣX
ΣXY = a ΣX + b ΣX2
141.7= 5a + 375b
10849 = 375a + 28625b
61
Y = -4.885 + 0.443X
Hence the expected sales for city A in year 2006 would be Sh128,015 given the
population of 300,000 families.
Y- Y = byx( X X )
byx = xy
x
2
Where x ( x x ) and y ( y y )
∑y = Na + b∑x…………………………….(i)
∑xy = a∑x + b∑x2…………………………(ii)
Since ∑x = ∑y =0,
After obtaining the value of byx the regression equation can be written in
terms of X and Y by substituting for y, ( y y ) and for x, ( x x )
Similarly the regression equation X = a + bY is reduced to ( x x )
= bxy ( y y ) and the value of bxy can be obtained by
bxy = ∑xy/∑y2
62
Illustration
The table below contains recorded data showing the test scores made by salesmen on an
intelligence test and their weekly sales:
Salesmen: 1 2 3 4 5 6 7 8 9 10
Test score: 40 70 50 60 80 50 90 40 60 60
Sales ‘000s 2.5 6.0 4.0 5.0 4.0 2.5 5.5 3.0 4.5 3.0
Calculate the regression line of sales on test scores and estimate the probable weekly
sales volume if a salesman makes a score of 100.
Solution
Let the sales be denoted by Y and the test scores by X. the regression equation
will be the regression of Y on X since the logic is that test score can influence sales and
not the other way.
Therefore Y- Y = byx( X X )
Y = 0.76 + 0.054X
Y = 0.76 + 0.054(100)
= 6.16
63
Thus the most probable weekly sales volume if the salesman makes a
score of 100 is Sh6,160.00.
Once the values of bxy and byx are determined in the above formula, the regression
equation can be calculated.
Illustration
A company wants to assess the impact of R&D expenditure on its annual profit. The
following data represents the information of the last eight years:
Year 2004 2003 2002 2001 2000 1999 1998 1997
R&D
Expenditure ‘000s 9 7 5 10 4 5 3 2
Profits 45 42 41 60 30 34 25 20
Required:
Estimate the regression equation and predict the profits for 2006 if the amounts allocated
for R&D is Sh10,000.
64
Solution
Let R&D expenditure be denoted by X and profits by Y.
Y- Y = byx(X- X )
= 1904 + 3/ 456-9
Y = 13.129 + 4.266X
= 55.789
65
Regression coefficient of X on Y
The regression coefficient of X on Y is represented by the symbol b xy or b1. It measures
the changes in X corresponding to a unit change in Y. this coefficient is given by:
bxy = r σx/σy
When deviations are taken from the means of X and Y, the regression coefficient is
obtained by
bxy = ∑xy/∑y2
When deviations are taken from assumed means, the value of bxy is obtained as follows:
Regression coefficient of Y on X
The regression coefficient of Y on X is represented by the symbol b yx or b2. It measures
the changes in Y corresponding to a unit change in X. this coefficient is given by:
byx = r σy/σx
When deviations are taken from the means of X and Y, the regression coefficient is
obtained by
bxy = ∑yx/∑x2
When deviations are taken from assumed means, the value of byx is obtained as follows:
r= b *b
xy yx
Proof: bxy = r σx/σy ; byx = r σy/σx. Therefore, bxy * byx = r σx/σy * r σy/σx = r2
66
2. If one of the regression coefficients is greater than unity, the other must be less
than unity. This is due to the fact that coefficient of correlation cannot exceed one.
For example, if bxy = 2.4 and byx is 1.6, r would be
For r to be less than unity, byx should be less than 0.41667 In other words byx must
not be greater than the reciprocal of bxy and vice versa for all cases.
3. Both the regression coefficients will have the same sign. It is not possible to have
one regression coefficient having a minus sign and the other one having a plus
sign.
4. The coefficient of correlation will have the same sign as that of the regression
coefficient. If the regression coefficients have a negative sign, r will also have a
negative sign and if the regression coefficients have a positive sign, r would also
be positive. For example,
5. The average value of two regression coefficients would be greater than the value
of coefficient of correlation in absolute values (i.e. ignoring the negative signs). In
symbols (bxy + byx)/2 >r. in the example of paragraph (4) above,
(bxy + byx)/2
67
Illustration
On the basis of the data recorded on supply and price for nine years, calculate the
regression coefficients and the value of r:
Solution
Let the price be denoted by Y and supply by X
= [-4221 +50]/[2709-625]
= - 4171/2084 = -2.001
68
= [-4221 +50]/[9504-4]
= - 4171/9050 = -0.461
r= b *b xy yx
The variation about the line of average relationship can be measured in the
manner similar to measuring of the variation of items about an average. Thus, a
measuring of variation called the Standard error of estimate is used. This measure
is similar to the standard deviation.
The measure of variation of the observations around the computed regression line
is referred to as the standard error of estimate. Just as the standard deviation
measures the scatter of observations in a frequency distribution around the mean
of that distribution, the standard error of estimate measures the scatter of observed
values of Y around the corresponding computed values of Y on the regression
line. It is computed as a standard deviation, being also a square root of the mean
of the squared deviation. But deviations here are not the deviations of the items
from the arithmetic mean; they are rather the vertical distances of every dot from
the line of average relationship. Each dot indicates the Y value and each
corresponding point where the arrow meets the line indicates a Y value.
The deviation of each dot from the regression line is symbolised by Y-Ye. Thus
the square root of mean of the squared deviation is
Syx =
(Y Y )e
2
N 2
Another formula, which is more convenient for measuring the mean of squared
deviation, is given as:
69
Syx =
(Y 2
a Y b YX
N 2
Sxy =
(X X e )2
N 2
Sxy =
(X 2
a X b XY
N 2
The standard error of estimate can be calculated easily with the help of the
following formula:
.x
Syx =
(1 r 2 )
.y
Sxy =
(1 r 2 )
The standard error of estimate measures the accuracy of the estimated figures. The
smaller the value of standard error of estimate, the closer will be the dots to the
regression line and the better the estimates based on the equation for this line. If standard
error estimate is zero, then there is no variation about the line and the correlation will be
perfect.
= 1 – ESS/TSS
70
Some books explain R2 as ESS/TSS. In such a case ESS does not stand for error sum of
squares but it means the explained sum of squares. Therefore it is important to know what
the abbreviation ESS stands for in every context.
R2 therefore shows the percentage of the dependent variable that is explained by the
independent variable. The value of R2 is always calculated as a proportion and cannot
exceed unity. When interpreted, the proportion is read as a percentage. For example if R2
is 0.64, the interpretation will be “ 64 percent of the dependent variable is explained by
the independent variable. The remaining 36 percent are changes due to other factors”
Illustration:
Given the following bivariate data, fit the regression line of Y on X and X on Y.
Predict Y if X is 10
Predict X if Y is 2.5
Calculate R2
X Y
-1 -6
5 1
3 0
2 0
1 1
1 2
7 1
3 5
Solution
i) Regression of Y on X
Y- Y = byx(X- X )
71
= [8 * 30 –(-3)(-12)]/[8 * 45 – (-1)2]
= [240 - 36]/[360 - 1]
= 204/359 = 0.568
Y – 0.5 = 0.568(X-2.625)
ii) Regression of X on Y
X- X = bxy(Y- Y )
= [8 * 30 –(-3)(-12)]/[8 * 84 – (-12)2]
= 204/528 = 0.368
X – 2.625 = 0.386(Y-0.193)
X = 0.386Y + 2.432
= 3.397
2
(n xy x y ) 2 2
a y b xy n( y ) 2
R or R
(n x 2 ( x) 2 *(n y 2 ( y ) 2 y 2
n( y ) 2
72
X Y xy x2 y2
-1 -6 6 1 36
5 1 5 25 1
3 0 0 9 0
2 0 0 4 0
1 1 1 1 1
1 2 2 1 4
7 1 7 49 1
3 5 15 9 25
∑ 21 4 36 99 68
= (288-84)2/ (792-441)*(544-16)
Exercises
1. Explain the concept of regression and the point out its usefulness in dealing with
business problems.
4. The following data give the hardness (X) and the tensile strength (Y) of 7 samples
of metal in certain units. Find the linear regression equation of Y on X.
1 2 3 4 5 6 7
X 146 152 158 164 170 176 182
Y 75 78 77 89 82 85 86
5. The average daily wage for working class individual in Industrial Area is Sh150
and for those in Westlands is Sh180, their respective standard deviation are Sh 20
and Sh30 respectively. The coefficient of correlation is 0.67. Find the most likely
wage in Westlands corresponding to the wage of Sh200 in Industrial Area.
73
6. There are two series of index numbers D for disposable personal income and S for
a salary of the company. The mean and standard deviation of of the D series are
120 and 15 respectively and of the S series 115 and 10 respectively. The
coefficient of correlation between the two series is 0.75. From the given
information obtain a linear equation estimating the values of S for different values
of D obtained from the equation? Can the same equation be used for estimating
values of D given S.
1 2 3 4 5 6 7 8 9 10
Paper I 80 45 55 56 58 60 65 68 70 75
Paper II 82 56 50 48 60 62 64 65 70 75
Required:
Regress paper II on paper I
8. What are the precautionary measures that an analyst should consider before using
regression to solve business problems?
Operators 1 2 3 4 5 6 7 8
Experience (yrs) 16 12 18 4 3 10 5 12
Performance 87 88 89 68 58 80 70 85
Year 1 2 3 4 5 6 7 8
Return on A 10 15 18 14 16 16 18 4
Return on M 12 14 13 10 9 13 14 7
74
LESSON SIX
Lesson Objectives
At the end of this lesson you should be able to:
6.0 Introduction
Analysis of variance is a technique employed by a researcher to find whether three or
more sample means are statistically significantly different from each other or instead can
be regarded as derived from the same population. ANOVA (analysis of variance) is a test
of the null hypothesis that population means are equal. The following are cases where
ANOVA can be used:
The possibility that plumbers, electricians, and carpenters all have about the same
average income.
Test of effectiveness of different promotional devices used to improve sales
Test production volumes produced in different shifts in a factory
Being a test of difference among 3 or more means, ANOVA is an extension of the t-test,
which is used to test for difference between 2 means. If we reject the null hypothesis, we
still must determine which sample means differ from the others.
ANOVA is therefore a method for testing an hypothesis that sample means of several
groups are derived from the same population. In many studies there are several sources
of variation. For example, when studying the different crime rates in different regions of
a country, ANOVA would allow us to differentiate between the effects of state or region
from the effects of community size on crime rates. If only one of these effects is
examined, the process is known as "one-way ANOVA,".
F - Test
The F test statistic is computed by dividing the MSwithin into the MSbetween (MSbetween /
MSwithin). It is a ratio of two estimates of variance. The F-test can be used to test the null
hypothesis that none of the variance in the dependent variable is due to group effects. In
order to do this, there are two assumptions:
1. The groups are independently drawn from a normal distribution
2. The population variance is identical to the variances within each group (This
assumption is termed homoscedasticity. When population variances differ,
they are termed heteroscedastic.)
75
Research hypotheses often involve inferences from sample data about the equality of
means of two populations in which case the t or z distributions are appropriate to test for
significant differences. If comparisons involve assessment of sameness vs. difference in
three or more means, the F distribution and ANOVA are instead used. The term
"analysis of variance" to evaluate differences of means may seem a little confusing. This
seeming misleading term is explained by the fact that the goal of ANOVA is to determine
whether there is a difference among a set of means but because there are more than two
means under consideration, the way to make this judgment is to evaluate the variance
among those means compared with the variance within each sub sample. To make these
comparisons, it is necessary to compare for differences in the number of cases comprising
the variances that are compared. The Between SS is divided by its degrees of freedom
(k - 1); similarly, the Within SS is divided by its number of degrees of freedom (N -
k). If the F ratio is large so as to warrant rejecting the null hypothesis, then the
differences among sub sample means are large relative to the average variance within sub
samples.
Use of the F distribution to test for differences among three or more means requires
making the assumptions that random, independent samples be drawn from two normal
populations that have the same variance. In actual practice, however, the F-test has been
found to work well even when these assumptions are not met unless the departures from
those assumptions are very large. The F-ratio distribution is nonsymmetric. The F-
distribution's shape depends upon the degrees of freedom associated with the numerator
and denominator. If the computed F - ratio is larger than the critical value (this critical
value is found in an F-distribution table in the back of most statistics books) associated
with a particular alpha level (e.g., 0.05, 0.01, 0.001), then we can reject the null
hypothesis and conclude that there are group effects.
In order to use an F distribution table you must calculate the degrees of freedom for the
mean sum of squares (both the MS "between" and "within"). After calculating these
values, go to the F distribution table. The degrees of freedom for the numerator
(MSbetween) are located across the top of the table. The degrees of freedom for the
denominator (MSwithin) are located down, along the left-hand side of the F distribution
table. Find the critical value associated with the degrees of freedom for the numerator
and denominator by finding the intersection of the two in the F distribution table. If the
computed F-ratio is larger than the critical value associated with a particular alpha level,
then we can reject the null hypothesis and conclude that there are group effects.
76
the effects of the independent classification variable under study. However, cases may
differ within a specific group because of random factors such as sampling variation, or
effects of unobserved causal variables. The "within-group sum of squares" reflects
unmeasured factors.
After computing the sum of squares (SSbetween, SSwithin, and SStotal), and the degrees of
freedom, the next step in ANOVA is to compute the mean squares corresponding to
SSbetween and SSwithin. The two mean squares each estimate a variance. The SSbetween
estimates the variance due to group effects. The SSwithin estimates the variance due to
error. If no group effects occur, then the two estimates should be identical. However, if
a significant group effect exists, the "between-group" variance, called the mean square
between (MSbetween), will be larger than the "within-group" variance, called the means
square within (MSwithin). Finally, the MSbetween, and the MSwithin are used to calculate the
F - statistic. The F test statistic is computed by dividing the MS within into the MSbetween
(MSbetween / MSwithin).
Illustration
To test the significance of variation in retail prices of potatoes in three Kenyan towns
(Nairobi, Nyeri and Eldoret), four vendors were chosen at random in each city and the
prices recorded in Shillings.
Vendor 1 2 3 4
Nairobi 16 8 12 14
Nyeri 14 10 10 6
Eldoret 4 10 8 8
Do the data indicate that the prices in the three towns are significantly different?
Solution
Let us assume the null hypothesis that there is no significant difference in the prices of a
commodity in the three towns.
77
= 1202/ 12 = 1200
= 5000/4 – 1200 = 50
= 136 – 50 = 86
Degrees of Freedom
v1 = k – 1 = 3 – 1 =2
v2 = n – k = 12– 3 =9
SSB
MSB =
v1
= 50/2 = 25
SSW
MSW =
v2
= 86/ 9 = 9.55
F – Test
M SB
F= = 25/ 9.55 = 2.617
M SW
78
Samples 86 9 9.55
Total 136 11
The table value for v1 = 2, v2 = 9, and 0.05 level of significance is 4.26. Since calculated
value of F is less than critical (table) value, the null hypothesis is accepted. Hence we
conclude that there is no significant difference in the prices of potatoes in Nairobi, Nyeri
and Eldoret.
ANOVA are used when a classification results in three or more groups but this does not
mean that it cannot be utilized where there are only two groups. When two groups are
compared, both the ANOVA and t-test give identical results but for reporting purposes,
researchers use the t-test.
The one-way ANOVA model is very useful for instances when there is a single variable
that classifies observations into groups. However, for two or more variables that classify
observations into groups we must turn to the two-way ANOVA.
In a two-way analysis of variance we have three estimates of the variance and an estimate
based on the total sum of squares. These estimates will be used to make two separate F
tests. The numerators for both F tests will contain estimates of the between-columns and
between-rows sum of squares respectively. The error term will be in each of the
denominators for each of the F tests. These F tests are testing for a relationship between
79
the interval-scale variable and one of the nominal-scale variables. We could perform an
N-way analysis of variance by controlling for more variables in a similar manner.
Between
Illustration
To study the performance of three detergents and three different water temperatures, the
following readings were obtained with specially designed equipment:
Hot water 54 46 58
Solution
Let the null hypothesis be that: there are no significant difference in the performance of
three detergents due to water temperature. For simplifications, we will code the data by
subtracting 50 from each observation.
80
Hot water 4 16 -4 16 8 64 8
∑ 10 66 3 45 43 677 56
= (102 /3 + 32 /3 + 432/3) – CF
= ∑xi2 – CF
81
Between
= 2.380
Total 439.56 8
ii. Since the calculated value of F2 = 2.380 at df1 = 2, df2 = 4 and α = 0.05 is less
than the table value of F =6.94, the null hypothesis is accepted. Hence we
conclude that water temperature do not make a significant difference in the
performances of the three detergents.
Exercises
1. What analysis of variance (ANOVA) and when is it used?
2. How is ANOVA used to solve business problems?
3. Briefly describe the procedure followed in ANOVA.
4. Explain the difference between one-way and two-way ANOVA.
5. Discuss the application of F-test in ANOVA.
82
Make
X Y Z
5 8 7
6 10 3
8 11 5
9 12 4
7 4 1
Is there any significant difference in the durability of the three types of bicycles?
7. Five different brands of tyres used by a car rental agency in the process of
deciding the brand of tyre to purchase, standard equipment used to record
distances covered showed the following data of kilometres endured by each brand
of tyre.
Tyre Brand
A B C D E
36000 46000 35000 45000 41000
37000 39000 42000 36000 39000
42000 35000 37000 39000 37000
48000 37000 43000 35000 35000
47000 48000 38000 32000 38000
8. A research conducted to test the soil fertility. Each of the three blocks of land was
subdivided into four equal parcels of land. Equal number of Sukuma-Wiki
seedling were grown on each parcel and the yields showed the following results:
Block
Parcel yield (kgs) A B C
1 50 40 30
2 90 70 50
3 110 80 80
4 100 100 90
Test whether the productivity of the parcels of land are significantly different.
83
LESSON SEVEN
Lesson Objectives
At the end of this lesson you should be able to:
7.0 Introduction
Statistical data that are described over time are known as time series. What is required of
such data in an understanding within which the data originate and the nature of variation
in the short and the long term. For example, one may want to ask:
Why are sales varying from one month to the other?
Why are purchases fluctuating from time to time?
The answers to the above questions could be investigated with the help of time series
knowledge. Time series enables the structure of data to be understood, trends to be
identified and forecasts made.
A time series is the name given to the values of some statistical variables measured over a
uniform set of time points. Any business, large or small, will need to keep records of
items like sales, purchases, stock values, and VAT. When these records accumulate over
time, they form a time series.
The framework within which time series are analysed is called a time series model. Time
series is a wide and complex statistical area and so are the models used to analyze time
series data. In this lesson, we shall confine ourselves to the basics of time series and basic
time series models. For this reason, two basic models will be considered:
The additive model.
The multiplicative model.
84
under the assumption that the university operates two normal semesters, and one
Trimester.
Y=f+t
Seasonal Variation- short term cyclic fluctuations in the data about the trend.
The season can range from as short as one day to a long period like one year or a
decade.
Residual variation- these are other factors not explained by either the trend or
the seasonal factors. They consist of random factors or disturbances (such as
weather conditions, breakdowns, etc) and long-term cyclic factors (such as
standard trade cycles, economic recessions etc)
The major difference between seasonal factors and random factors is that the
former can be anticipated while the latter can not. For example, a Bar operator at
Nairobi’s Biashara Street, can know the average quantities of drinks normally
consumed on Friday (‘members day’) but a riot in the city centre forces the
customers to rush to their homes early.
85
350
300
250
200
Price
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12
Month
From the above chart can you notice any of the following:
A Trend
A Cycle
If you were a trader and knew the trend of this share price in advance, what would you do
assuming that you are:
A Buyer
A Seller
y=t+s+r
86
y=tx Sx R
The small t in both models shows that the trend component will be a constant no matter
which of the two models are used. Seasonal and residual components will depend on
which model is being used.
The trend is the core component of the additive time series model about which the two
other components, seasonal (s) and residual ® variation, fluctuate. This component if
found by identifying separate trend (t) values, each corresponding to a time point. There
are several ways of obtaining trend values for a given time series.
Illustration
The following sales ‘000s were recorded for a firm. From this data, obtain a semi-average
trend.
Week 1 Sales
Mon 250
Tue 320
Wed 340
Thu 520
Fri 410
Week 2
Mon 260
Tue 380
Wed 410
Thu 670
Fri 420
87
Note that the data is ordered according to time of occurrence, which is common for a
time series.
Step 1:
Split the data into an upper and a lower group
For the data given:
The lower group is week 1 (250, 320, 340, 520, and 410)
The upper group is week 2 (260, 380, 410, 670, and 420)
Step 2:
Find the mean value of each group
The mean of the lower group (L) is:
= 1840/5 = 368.
= 2140/5 = 428.
Step 3:
Plot on a graph, each mean against appropriate time point.
An appropriate time point can always be take as the median time point of the respective
group. Thus (L) would be plotted against Wednesday of week 1 and (U) against
Wednesday of week 2
Step 4:
The line joining the two plotted point is the required trend.
Note: it is important that the two groups in question have an equal number of data values.
If the given data, however, contains an odd number of data values, the middle value can
be ignored (for purposes of obtaining the trend line)
Once a trend line has been obtained, the trend values corresponding to each time point
can be read off from the graph.
88
Required:
Using the data given above:
i. Use the method of semi average to obtain and plot a trend line.
ii. Draw up the table showing the original data (y) values against the trend (t)
values obtained from the graph.
Solution
i. Split the data into lower (L) and upper (U) groups in equal proportion.
The items in italic (year 1: Q3 and Q4) and (year 3: Q1 and Q2) are the hypothetical
points where the lower points and upper points must be plotted respectively.
89
700
600
500
Passengers '000s
400
300
200
100
0
Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y1Q4.5 Y2Q1 Y2Q2 Y2Q3 Y2Q4 Y3Q1 Y3Q1.5 Y3Q2 Y3Q3 Y3Q4
Quarter
ii. The trend values when read from the graph and compared with the original
values will be as follows:
Passengers in ‘000s
Year Quarter Original Values Trend Values
1 1 220 390
1 2 500 410
1 3 790 430
1 4 320 450
2 1 290 470
2 2 520 490
2 3 820 520
2 4 380 540
3 1 320 560
3 2 580 580
3 3 910 600
3 4 410 620
90
Step 1
Take the physical time points as values of the independent variable x..
Step 2
Take the data values themselves as values of the dependent variable y.
Step 3
Calculate the least squares regression line of y on x, y = a + bx
Step 4
Translate the regression line y = a + bx: to t = a + bx, where any given value of
time point x will yield a corresponding value of the trend, t.
x y xy x2
1 220 220 1
2 500 100 4
3 790 2370 9
4 320 1280 16
5 290 1450 25
6 520 3120 36
7 820 5740 49
8 380 3040 64
9 320 2880 80
10 580 580 100
11 910 10010 121
12 410 4920 144
∑ 78 6060 41830 650
n xy x y
b=
n x 2 ( x) 2
91
a =
y b( x )
n n
6060 17.06 * 78
=
12 12
t = a + bx
Applying the trend equation, the trend values for each quarter will be:
x: 1 2 3 4 5 6 7 8 9 10 11 12
t: 411 428 445 462 479 496 513 530 547 564 581 598
To demonstrate the technique, consider the following set of a five period moving
average:
Original values:12 10 11 11 9 11 10 10 11 10
Moving totals: 53 52 52 51 51 52
The table below shows how these averages are calculated based on a five period moving
average:
92
trend
10 11 11 9 11 52 10.4
11 11 9 11 10 52 10.4
11 9 11 10 10 51 10.2
9 11 10 10 11 51 10.2
11 10 10 11 10 52 10.4
The values in italic are calculated averages which are shown in bold
Note that the trend value is positioned on the middle of the values forming that average.
For example the first trend value (10.6) is the average of 12, 10, 11, 11, and 9 and
therefore 10.6 is placed under the first 11 (middle value).
Note:
Trend values corresponding to the first two and last two in the table are missing.
This is one of the major disadvantages of moving averages.
The period of moving average must coincide with the length of natural cycle of
the series. For example:
- Moving averages for the trend based on annual quarters must be based on
a four period moving averages.
- Total monthly sales of a business for a number of years would be
described by a moving average trend of period 12.
- The daily sales of a supermarket would be characterised by a seven period
moving average.
Each moving average trend must correspond with an appropriate time point
(median of the corresponding periods). When moving averages have an even
numbered period, the values should be cantered.
Time point 1 2 3 4 5 6 7 8 9 10
Averages (4p) 13.0 13.3 13.3 13.8 14.5 14.5 15.0
Centering 13.2 13.3 13.6 14.2 14.5 14.8
Note: centering has reduced the number of trend values from seven to six. As explained
earlier, this is a common disadvantage of moving average.
93
Seasonal values are normally expressed as deviations from the trend values. They show
by how mush a particular season will tend to increase or decrease the underlying trend.
The season component is expressed as an additive model or a multiplicative model.
Step 1
Calculate, for each point, the value of y – t (the difference between the original value and
trend).
Step 2
For each season in turn, find the average (arithmetic mean) of the y-t values.
Step 3
If the total averages differ from zero, adjust one or more of them so that their total is zero.
The values so obtained are the appropriate seasonal variation values; i.e. the ‘s’ figures in
the additive model y = t + s + r.
Step 1
Calculate, for each point, the value of [1 + (y – t)/t] (the difference between the original
value and trend expressed as a proportion of the trend).
Step 2
For each season in turn, find the average (arithmetic mean) of the above proportional
changes.
Step 3
If the total averages differ from zero, adjust one or more of them so that their total is zero.
Note: the process of adjusting the mean to zero is known as differencing. In practice, this
process is complex and requires the use of computer programs to differentiate the data.
Econometrics books deal with differencing in a more advanced way.
94
Step 1
y t y-t
Yr 1 Q1 20 23 -3
2 15 29 -14
3 60 34 26
4 30 39 -9
Yr 2 Q1 35 45 -10
2 25 50 -25
3 100 55 45
4 50 61 -11
Step 2
Deviations (y – t )
Q1 Q2 Q3 Q4 Sum
Year 1 -3 -14 26 -9
Year 2 -10 -25 45 -11
Totals -13 -39 71 -20
Averages -6.5 -19.5 35.5 -10.0 -0.5
Step 3
Since the averages sum to -0.5 (and not zero), it is necessary to adjust one or more of
them accordingly. In this case, since the difference is so small, only one will be adjusted.
In order to make the smallest percentage error, the largest value (35.5) is changed to 36.0.
This adjustment is shown in the following table:
Q1 Q2 Q3 Q4 Sum
Initial s values -6.5 -19.5 35.5 -10.0
Adjustment 0 0 +0.5 0
Adjusted values -6.5 -19.5 35 -10.0 0
The interpretation of the figures is that the average seasonal effect for quarter one , for
instance, is to deflate the trend by 6.5 (‘000s) and that of quarter three is to inflate the
trend by 36(‘000s).
95
Step 1
y t (y-t)/t s = 1 + (y-t)/t
Yr 1 Q1 20 23 -0.13 0.87
2 15 29 -0.48 0.52
3 60 34 0.76 1.76
4 30 39 -0.23 0.77
Yr 2 Q1 35 45 -0.22 0.78
2 25 50 -0.50 0.50
3 100 55 0.82 1.82
4 50 61 -0.18 0.82
Step 2
Deviations [ 1 + (y-t)/t ]
Q1 Q2 Q3 Q4 Sum
Year 1 0.87 0.52 1.76 0.77
Year 2 0.78 0.50 1.82 0.82
Totals 1.65 1.02 3.58 1.59
Averages 0.82 0.51 1.79 0.79 3.91
Step 3
Since the averages sum to 3.91 (and not 4), it is necessary to add 0.09 to one or more of
them accordingly. In this case, since the difference is so small, only one will be adjusted.
In order to make the smallest percentage error, the largest value (1.79) is changed to 1.88.
This adjustment is shown in the following table:
Q1 Q2 Q3 Q4 Sum
Initial s values 0.82 0.51 1.79 0.79
Adjustment 0 0 +0.09 0
Adjusted values 0.82 0.51 1.88 0.79 4
The interpretation of the figures is that the average seasonal effect for quarter one, for
instance, is to deflate the trend by 18% (1-0.82) and that of quarter three is to inflate the
trend by 88 %( 1.88-1).
96
Note:
The seasonally adjusted time series for an additive model is obtained by
subtracting the seasonality ( SAV = y – s )
The seasonally adjusted time series for a multiplicative model is obtained by
dividing the values by the seasonality proportion ( SAV = y/sp )
Majority of economic time series data published by the Central Bureau of
Statistics (CBS) is represented in terms of both ‘actual’ and ‘seasonally adjusted’
figures.
97
Any forecasts made using structured methods should be treated with caution. Since there
is no guarantee that patterns based on past data will recur.
Note:
Because of the complexity of residual variation, which, is beyond the scope of
this lesson, the residual component has been excluded deliberately. The
magnitude of the residual component shows how reliable are the data values
forecasted. The higher the residual amount, the lower the reliability of forecasted
values and vice versa.
Exercises
1. What is a time series?
2. Describe the difference between additive and multiplicative time series.
3. What is residual variation in time series?
4. What is a historigram?
5. What are the three most common techniques for obtaining a time series trend?
6. Describe the stages involved in obtaining a time series trend using the method of
semi-averages.
7. What is a seasonal variation?
8. What precautions should be taken when using time series data?
10. calculate the trend values using the method of semi-averages for the following
data:
13, 12, 16, 14, 18, 12, 14, 13, 18, and 13.
98
11. The data below relates to rate receipts (in millions) for a local authority with
corresponding trend values in brackets.
12. The following data describes personal savings as a percentage of earned income
for a particular region of Kenya.
Use the additive and multiplicative models to seasonally adjust the above
percentages and forecast the percentage savings for quarter one of 2005.
99
LESSON EIGHT
NON-PARAMETRIC METHODS
Lesson Objectives
At the end of this lesson you should be able to:
8.0 Introduction
The primary purpose of statistical analysis is to draw conclusions about the
population parameters based on samples selected from the populations. Certain
assumptions are inferred when using the sampled data such as the assumption that
populations are normally distributed, the samples are random and that observations
are independent of each other. Additionally, the t-test requires that in testing between
differences between two sample means, the samples must be drawn from populations
that are normally distributed with equal variances. Similarly, the F-test that test for
significant differences among several population means, is based on the same
assumption.
Nonparametric tests on the other hand, do not depend on the shape of distribution of
the population and hence are known as distribution-free tests and are applicable to
ordinal level data. For example, responses to questions can be ranked from high to
low and vice versa. Individuals or other items can also be ranked using the ordinal
scale such as a three star hotel, a two star army officer, etc. It is however important to
note that nonparametric tests should only be applied where it is not possible to use the
parametric tests. In some areas of sensitive disciplines such as Medicine and Military,
nonparametric statistics are not allowed. Nonparametric tests can be carried out by
applying several techniques. Some of these techniques are:
i. runs test for randomness
ii. the sign test
iii. Mann-Whiteney U-Test
iv. The wilcoxon signed-rank test
v. The wilcoxon rank-sum test
vi. Kruskal-Wallis analysis of variance by ranks
vii. Spearman coefficient of rank correlation
100
Assume the twenty trials of tossing the coin result in the following results:
H H T T T H H H T H H T T H H T H H T T
HH TTT HHH T HH TT HH T HH TT
Too few runs or too many runs will indicate the absence of randomness. For example, in
a sequence of 10 tosses, ten runs (HTHTHTHTHT) indicate non-randomness;
consequently two runs (HHHHHTTTTT) are non-random.
Runs test is always a two tailed test (lower levels or higher levels can lead to rejection of
the null hypothesis). Using the example of tossing the coin 20 times:
From the tables, r1 = 6 and r2 = 16. Therefore we cannot reject the null hypothesis since
the experiment produced 10 runs.
The sign test has many applications. To illustrate this, suppose a taster’s choice markets
two kinds of coffee in two jars (one premium and the other regular). The market
researcher wants to know whether the customers prefer the premium or the regular.
Premium coffee cups can be coded (+) while the regular coded (-). If the population of
the customers do not have a preference, the number of (+) will be expected to be equal to
(-).
101
This is a one tailed test because the alternative hypothesis points to one direction. The
binomial distribution is used as the test statistic. The signs test meets all the binomial
assumptions, namely:
1. There are only two outcomes: a success and a failure
2. For each trial, the probability of success is assumed to be 0.5
3. The total number of trials is fixed
4. Each trial is independent
Illustration
It is required to test whether the quality of education on Statistics in private school is
similar to one in public schools or not. A sample of 15 and 12 statistics students is picked
at random from private schools and public schools respectively. The students were then
subjected to the same exam and the following results obtained:
Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Private scores 73 75 83 77 72 69 56 80 68 60 84 61 64 71 86
Public scores 70 78 79 81 65 63 74 83 67 76 88 48 - - -
The U-test can be used to test the null hypothesis that there is no significant difference
between the performance in statistics in both private and public schools at 95%
confidence level.
102
Once the U values are calculated, the lower value is selected for testing purposes. From
the data:
n1 = 15
n2 = 12
R1 = 202
R2 = 176
= 98
= 180 + 78 – 176
= 82
103
Under the null hypothesis, which states that these score observations come from the same
populations can be shown through the sampling distribution of the statistic U.
= (12 * 15)/2
= 90
= √ (180 * 28/12)
= 20.5
if n1 ≥8 and n2 ≥8, then the sampling distribution of U can be approximated closely with a
normal curve so that a z-test can be performed.
Z = [U – E(U)]/ σu
= (82 -90)/20.5
= -0.39
Since the absolute value of z is 0.39 which is less than the critical value of z, at α = 0.05
of 1.96, we cannot reject the null hypothesis. In other words the evidence does not
suggest that there is any significant difference in the quality of education in Statistics
between the private and public school.
Note:
In order to assume that the sampling distribution of statistic U is normally
distributed, the values of both n1 and n2 should be equal to or more than eight.
All scores were different in value, so that there were no ties in ranks. However, if
there are ties between observations, then the identical values of scores are
assigned the mean of their tied ranks.
Jasho café restaurant in Nairobi offer a full dinner menu, but their specialty is chicken.
Recently, Barigei Mos, the owner of the restaurant developed a new spicy flavour to
104
improve the chicken meal. Before placing the current flavour, he wants to conduct some
tests to be sure that the patrons will like the spicy flavour better.
To begin, Barigei selects a random sample of 15 people. The sampled patrons are given
two pieces of cooked chicken; one without spice and another one spiced. Participants are
then asked to rank the overall test of the two cooked chicken pieces on a scale of 0 to 20.
A value near 20 indicates the participant liked the flavour, whereas a score near 0 indicate
that the participant did not like the flavour. The results are recorded below:
Solution
The samples are dependent or related because the participant is told to rate both pieces.
Thus, if we compute the difference between the two pieces of chicken the results will
show the amount the participants favour over the other. If we choose to subtract the
current flavour score from the spicy flavour score, a positive will show the participant
favours the spicy chicken.
Hypotheses:
H0: there is no difference in the ratings of the two flavours.
H1: the spicy flavour is higher.
This is a one tailed test because Barigei will change to Spicy flavour if the participants
like it. The significance level is 0.05.
Steps
1. Compute the difference between the spicy and the current flavour
2. If the difference is zero, drop that participant
3. Determine the absolute difference (ignore the minus signs)
105
4. Rank the difference from the smallest to the largest. Participants with same scores
are averaged and each is assigned the average rank.
5. Separate the ranks based on their previous signs (positives (R+) on one side and
negatives (R-) on the other)
6. Total the R+ and R-.
7. Use the smaller of the two ranks to test the significance. If the table value is more
than the smaller rank, reject the null hypothesis; otherwise accept.
From the table α 0.05, n 14 is 25 < 30; hence the null hypothesis cannot be rejected. The
conclusion is that there is no significant difference in the flavour of the spicy and current
chicken.
If each of the samples contains at least eight observations, the standard normal
distribution is used as the test statistic. The formula is:
w n1 (n1 n2 1) / 2
Z=
n1 n2 (n1 n2 1) / 12
106
Illustration
The CEO of Makati ltd recently noted an increase in the amount of competitor’s exports
to USA. He is interested in determining whether there are more exports to the USA than
to the UK. After conducting market intelligence, he obtains the following monthly
results:
Month
1 2 3 4 5 6 7 8 9
USA 11 15 10 18 11 20 24 22 25
UK 13 14 10 8 16 9 17 21
Can you conclude that there are more exports to the USA than to the UK?
Solution
USA UK
Export Rank Export Rank
11 5.5 13 7
15 9 14 8
10 3.5 10 3.5
18 12 8 1
11 5.5 16 10
20 13 9 2
24 16 17 11
22 15 21 14
25 17
Total 96.5 56.5
96.5 9(9 8 1) / 2
z=
9 * 8( 9 8 1) / 12
= 1.49
The computed z value (1.49) is less than 1.65; hence the null hypothesis will not be
rejected. Therefore, we cannot conclude the there are more exports to the USA than the
UK.
Note: the z value is read as an absolute figure. Where the computed z score is negative,
ignore the minus sign.
107
For example, if samples from three groups – executives, staff, and supervisors – are to be
selected and interviewed, the response of one group (say, executives) must in no way
affect the responses of others.
Illustration
A management seminar consisting of a large number of executives from manufacturing,
finance, and marketing is to be conducted. Before scheduling the seminar sessions the
seminar leader is interested in whether the three groups are equally knowledgeable about
management principles. A sample of seven manufacturing managers, eight from finance
and six from marketing were tested and their scores recorded as below.
Considering the scores as a single population, the marketing executive with a score of
35 is the lowest, so it is ranked 1. There are two scores of 38. To resolve this tie, each
score is given a rank of 2.5, [(2+3)/2]. This process is continued for all scores. The
highest score is 107, and the finance executive is given by a rank of 21. The scores,
the ranks and the sum of the ranks for each of the three samples are given in the table
below:
108
Solving for H;
= 5.736
Because the computed H (5.736) is not beyond 5.991, the null hypothesis is not rejected.
There is no difference among the manufacturing, finance, and marketing executive’s
knowledge on management principles.
6 d 2
rs = 1
n(n 2 1)
109
Like the coefficient of correlation, the coefficient of rank correlation assumes any value
from –1.00 up to +1.00. a value of –1.0 indicates a perfect negative correlation; +1.00
indicates a perfect positive correlation; and 0 indicates no correlation.
Illustration
The following table shows the ranks assigned by an executive (E) and a supervisor (S) on
the future performance of selected college graduates.
Executive Supervisor
(E) (S)
Rating Rating Rank Difference
Graduate x y (E) (S) d d2
1 8 4 3.5 3 0.5 0.25
2 10 4 6.5 3 3.5 12.25
3 9 4 5 3 2 4
4 4 3 1 1 - -
5 12 6 10.5 7 3.5 12.25
6 11 9 8.5 10.5 -2.0 4
7 11 9 8.5 10.5 -2.0 4
8 7 6 2 7 -5 25
9 8 6 3.5 7 -3.5 12.25
10 13 9 12 10.5 1.5 2.25
11 10 5 6.5 5 1.5 2.25
12 12 9 10.5 10.5 - -
Total 0.00 78.50
n2
t = rs
1 rs2
6 d 2
rs = 1
n(n 2 1)
6 * 78.5
1
12(144 1)
= 0.726
The value of 0.726 is a strong positive association between the rating of the
executive and the supervisor. The graduate that received high rating from the
executive also received high rating from the supervisor.
110
Hypotheses:
H0: the rank correlation in the population is zero
H1: the rank correlation in the population is greater than zero.
12 2
t = 0.726
1 0.726 2
= 3.338
Computed t value (of 3.338) is greater than the table value of 1.812 hence we reject the
null hypothesis and conclude that there correlation between the executive ratings and the
supervisor’s rating.
Exercises
1. What is a nonparametric test?
2. Explain what a sign test entails.
3. Why do sign test involve a one tailed test?
4. Discuss the various types of non-parametric tests.
5. A random sample of seven young professional couples who own homes are
selected. The size of their homes is then compared with those of their parents
(average of the parent’s homes). At the 0.05 significance level, can you conclude
that the couples live in larger homes?
1 2 3 4 5 6 7
Couple’s home (sq ft.) 1725 1310 1670 1520 12890 1880 1530
Avg. Parent’s home (sq ft.) 1175 1120 1420 1640 1360 1750 1440
6. The following observations were selected from the populations that are not
necessarily normally distributed. Use 0.05 significance level, a two tailed test, and
the Wilcoxon rank-sum test to determine whether there is a difference between
the two populations.
1 2 3 4 5 6 7 8
Population A 38 45 56 57 61 69 70 79
Population B 26 31 35 42 51 52 57 62
111
8. The following data were obtained from three populations that were not
necessarily normal.
1 2 3 4 5 6
Sample A 50 54 59 59 65
Sample B 48 49 49 52 56 57
Sample C 39 41 44 47 51
112
REFERENCES
Chandan S., (2003). Statistics for Business and Economics. Vikas Publishing House
PVT Ltd., New Delhi.
Francis A., (2001). Business Mathematics and Statistics: Fifth Edition, Martins the
Printers Ltd., London.
Gupta S., and Gupta M. (2001). Business Statistics: Eleventh Edition. Sultan Chand and
Sons, New Delhi.
Mason D., et al. (1999). Statistical Techniques in Business and Economics: Tenth
Edition. McGraw-Hill, Boston.
Sharma J., (2004). Business Statistics. Pearson Education Publishers, New Delhi.
113