0% found this document useful (0 votes)
11 views74 pages

4. Inferential Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views74 pages

4. Inferential Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Introduction to Statistical Methods

Inferential Statistics
Revision-1.0

Prof Vineet Garg

BITS Pilani Work Integrated Learning Programmes (WILP)


Bangalore Professional Development Center
Introduction
• We have reviewed Descriptive Statistics, Probability and Probability Distributions in the first few modules.
• We have an idea that Inferential Statistics involves taking a sample from a population, computing a statistic on the sample,
and then inferring from this statistic the value of the corresponding parameter of the population.
• Many times the sample statistics are going to be used to verify the claim of population parameters (Hypothesis Testing).
• Sampling from the population is done for several reasons; primarily because of feasibility and to save time and resources.
• Below is the proposed path to study inferential statistics:
 Sampling and Sampling Distribution:
o Sampling Techniques: ways to select a smaller part of the
population.
o Sampling Distribution: when several samples are picked up, they
might provide different statistics e.g. different values of mean and
standard deviation. What is the distribution of theses statistics?
 Point and Interval Estimation of the Population Parameter:
o Point Estimation: single value estimation of population parameter
from the sample statistic.
o Interval Estimation: estimation of an interval (range) where the
population parameter is likely to fall.
 Hypothesis Testing: accept or reject a claim about a population
parameter gathering the evidence from the sample statistic.
 This module focuses on few of the above topics.
Sampling Techniques
Random Sampling
The sampling technique where every unit of the population has the same probability of
being selected into the sample, is called the Random Sampling. Few methods:
• Simple Random Sampling: a list of 30 companies are provided out of which 6 are to
be selected randomly. A single digit random number generator output is used to
perform the sampling. The random numbers are arranged in a group of 5 for ease of
reading, otherwise they can be treated as a sequence of individual digits horizontally.
o Since companies are numbered in two digits from 01 to 30, the pair of digits ranging from
01 to 30 are of interest in the random numbers. Other pairs will be ignored.
o First, second and third pairs are 91, 56 and 74 respectively. They are ignored. The next pair is List of 30 Companies
25 which is acceptable so the company Occidental Petroleum is picked up in the sample.
Continuing in the same way other companies that would be selected in the sample are:
Procter & Gamble, Alaska Airlines, Bank of America, Alcoa and Sears.
• Stratified Random Sampling: the population is divided into non-overlapping sub-
populations called strata and then a random sample from each stratum is drawn. It is
advantageous in situations when representation of the population is required in the
sample. Size of random sample from each stratum can be proportionate to the size of Random Numbers
stratum in the population or disproportionate.
• Systematic Sampling: Every kth item is to be selected in the sample (size = n) from the
population (size = N) where k = N/n. It may not be a good sample if population is
arranged in some order and the value of k is in syncopation with it. If population is
random, then it could be a good approach.
• Cluster Sampling: The population is divided into clusters then randomly selected
elements from all clusters are taken into the sample. It is a cost effective approach but
does not guarantee the true representation of the population in the sample.
Non-Random Sampling
• The sampling technique where not every unit of the population has the same
probability of being selected into the sample is called the Non-Random
Sampling. It is also called the non-probability sampling.
• Non-Random Sampling is NOT RECOMMENDED for inferential statistics. Few of
its variants are described here so that as the data scientists we are alert for such
situations:
• Convenience Sampling: typically those elements are sampled that are readily
available, nearby or willing to participate.
• Judgmental Sampling: sampling technique where the researcher selects
elements to be sampled based on his own existing knowledge or his professional
judgment. For example if researcher finds that males of age 50 and above are
good candidates for his survey.
• Quota Sampling: it sounds similar to stratified random sampling but here the
researcher selects the elements from each strata non-randomly. For example, in-
place of randomly collecting data about Bengalis in New Delhi, the researcher
visits Chitranjan Park in New Delhi itself (a populated place for Bengalis in New
Delhi) and surveys.
• Snowball Sampling: The researcher identifies a person who fits the profile for the
sampling. The researcher then asks this person for the names and locations of
others who would also fit the profile of subjects.
Exercise

1. Unknowingly, best 50 performing stocks are kept on the top in the list of 150 companies of a stock
exchange and 3 such stock exchanges’ lists are prepared. If a random sample of 3 companies is to
created from the population of these lists, explain why systematic sampling may not be a good
technique.
2. An advertising company interested in determining the impact of TV advertisements conducts a survey
in a geographical area. The area has three towns A, B and C. Town A is built around a factory and mostly
factory workers with young kids live there. Town B contains mainly old retired people and the town
C has mainly farmers. Assume all households have at least one TV set. There are 155 households in
town A, 62 in town B and 93 in town C. The company wants to pickup up a sample of 40 households.
Suggest them a suitable random sampling technique.
Sampling Distribution
x y Mean
54 54 54.00
54 55 54.50

Central Limit Theorem (CLT) 54


54
54
54
59
63
64
68
56.50
58.50
59.00
61.00
54 69 61.50
54 70 62.00

• There is a small finite population of N = 8 numbers as: 54, 55, 59, 63, 64, 68, 69
55 54 54.50
55 55 55.00
55 59 57.00
and 70 with µ as mean and σ as standard deviation. 55
55
63
64
59.00
59.50

• The histogram is drawn, where [ ] indicates inclusive and () indicates exclusive


55 68 61.50
55 69 62.00
55 70 62.50
values in the x-axis bins. Right-end inclusion method is followed here. 59
59
54
55
56.50
57.00

• The population size is small so its distribution cannot be ascertained. 59


59
59
63
59.00
61.00
59 64 61.50
• Different samples of size n = 2 (as x and y) from this population are drawn with 59
59
68
69
63.50
64.00

replacement (82 = 64 possibilities). 59


63
70
54
64.50
58.50

• For these samples, means are also calculated. The mean of these samples (ഥ
63 55 59.00
X) will 63 59 61.00

be different every time. May be the variations in the mean are not high.
Population Histogram 63
63
63
64
63.00
63.50
63 68 65.50

• The questions arises here, if ഥ


63 69 66.00
X is considered as a random variable, what kind of 63
64
70
54
66.50
59.00
distribution it would exhibit? 64
64
55
59
59.50
61.50

• A histogram of the means of different samples is also drawn. It is normal in 64


64
63
64
63.50
64.00
64 68 66.00
distribution. 64
64
69
70
66.50
67.00

• Central Limit Theorem (CLT): if there is a population with mean μ and standard 68
68
54
55
61.00
61.50
68 59 63.50
deviation σ and if several random samples from the population with 68
68
63
64
65.50
66.00
replacement are taken, then the distribution of the sample means will be 68
68
68
69
68.00
68.50
68 70 69.00
approximately normally distributed. This will be true regardless of distribution of 69
69
54
55
61.50
62.00
the population. 69
69
59
63
64.00
66.00

• The mean of the sample means (μX ) and the standard deviation (𝜎𝑋 ) is given by: 69
69
64
68
66.50
68.50
69 69 69.00

𝝁𝑿 = 𝝁
69 70 69.50
70 54 62.00
70 55 62.50
70 59 64.50
𝝈 Histogram of Sample Means 70 63 66.50

𝝈𝑿 = 70
70
64
68
67.00
69.00

𝒏 70
70
69
70
69.50
70.00

• The mean of the sample means = 62.75 and standard deviation = 4.12. Samples with
Means
Central Limit Theorem: Features
• Central Limit Theorem holds for any type of population distribution
for large sample sizes (n >= 30).
• If the population itself is normally distributed then CLT holds even for
smaller values of n.
• The expression for standard deviation is true for infinite size of
population.
𝝈
𝝈𝑿 =
𝒏
• If the population is finite (N) then correction factor is applied to it
(normally not used because N will be still be >> n ):
𝝈 𝑵−𝒏
𝝈𝑿 =
𝒏 𝑵−𝟏

• If 𝑥 (where 𝑥 ∈ 𝑋)is the mean of a random sample of size n, taken


from a population having the mean (µ) and standard deviation (σ),
then the z-transformation for the sample mean will be:
𝒙− 𝝁
𝒛=
𝝈/ 𝒏
Example
In a large shopping mall the average number of customers/hour is 448, with
a standard deviation of 21 customers. This data is collected over several
years. What is the probability that a random sample of 49 different
shopping hours will yield a sample mean between 441 and 446 customers?
Given that N = ∞, µ = 448, σ = 21, n = 49, x1 = 441 and x2 = 446
x− μ
z= so:
σ/ n
441 − 448 −7
z1 = = = −2.33, Probability = −0.4901
21/ 49 3
446 − 448 −2
z2 = = = −0.67, Probability = −0.2486
21/ 49 3

So, P(441 <= X <= 446) = -0.2486 – (-0.4901) = 0.2415 = 24.15%

z-Distribution Table
The t-distribution
• In the Central Limit Theorem (CLT), we reviewed that the means of samples follow a normal distribution for large sample sizes and
the theory of standard normal distribution can be applied for some analysis.
𝝈 𝒙− 𝝁
𝝁𝑿 = 𝝁, 𝝈𝑿 = , 𝒛=
𝒏 𝝈/ 𝒏
• The issue arises when:
o Population is known to be normally distributed.
o It is not feasible to draw several random samples.
o It is not feasible to draw a large-sized sample.
o The population standard deviation (σ) and mean (µ) are not known.
o The estimation of population mean (µ) is to be done or the claim about population mean (µ) is to be ascertained.

• We have reviewed that even for small sample sizes, the mean of samples follow the normal distribution if these samples are
drawn from normally distributed population.
• In such situations, Student’s t-distribution can be used. Student’s t-distribution was developed by William S Gosset, a 19th
century English statistician. His pen name was Student. The t-statistic can be calculated using sample mean (𝑥) and standard
deviation (s ) as:
𝒙− 𝝁 µ is not known, so how t-statistic
𝒕= will be calculated?
𝒔/ 𝒏
• The assumption while calculating the t-statistic that population is to be normally distributed is not so severe. The results are fairly
close for non-normally distributed populations also unless they are not too skewed.
t-distribution: Characteristics
• Like normal distribution, t-distribution is also bell shaped and symmetrical. But they are flatter in
the middle and have more area in the tails.
• The mean of t-distribution is also zero (like standard normal distribution) but variance depends on
(n-1) which is called the Degree of Freedom (DF) and denoted by the Greek alphabet v (nu).
• The variance of t-distribution approaches 1 as n→∞ or in another words the t-distribution
approaches to standard normal distribution when n→∞.
• The term DF (v) refers to the number of independent observations for a source of variation minus
the number of independent parameters estimated in computing the variation. Here there are n
observations in the sample and one parameter (the mean of population) is being estimated. So v =
n-1.
• The value tα;v on the x-axis for which the area (tail area) under the t-curve with v DF to right is α is
called the t-critical value.
• In other words, the probability that some tv will be greater than tα;v is α for v DF. The Student’s t-
distribution table provides the values of tα;v for few α and v.
• In the table for a given v, when α decreases, the value of tα;v increases. That means moving away
from the mean towards right.
• In the table for a given α, when v increases, the value of tα;v gradually decreases.
• For v = ∞, the tα;v values are the corresponding z values for (1 - α) in the standard normal
distribution table. E.g. tα;v = 1.960 for v = ∞ and α = 0.025. The z-distribution table value for z =
1.960 is 0.9750 that is (1 - α).
Student’s t-distribution Table
No inference is made; only table
Example-1 calculations are done in this example!

An electric equipment manufacturer claims that on 20% overload its circuit


breakers can sustain for 12.40 minutes on an average.
To test the claim a QA engineer selected a random sample of 20 circuit breakers
and put them under 20% overload. The mean time of the sustenance was 10.63
minutes and standard deviation was 2.48 minutes.
Calculate the t-statistic. What is the range of α expected from this data?
Given that:
µ = 12.40
n = 20, v = 20 - 1 = 19, 𝑥 = 10.63 and s = 2.48
So
x− μ
t=
s/ n
10.63 −12.40
= 2.48
20

= -3.19
From the Student’s t-distribution table, we can observe that the t value -3.19 will
occur in the range of 0.005 > α > 0.001 for v = 19 for the left half of the t-curve. Student’s t-distribution Table
No inference is made; only table
Example-2 calculations are done in this example!

The population mean of the heights of five-year old boys is claimed to be 100
cm. A teacher measures the height of her 25 students, obtaining a mean height
of 105 cm and standard deviation 18 cm.
Calculate the t-statistic. What is the range of α expected from this data?
Given that:
µ = 100
n = 25, v = 25 - 1 = 24, 𝑥 = 105 and s = 18
So
x− μ
t=
s/ n
105 −100
= 18
25

= 1.39
From the Student’s t-distribution table, we can observe that the t value 1.39
will occur in the range of 0.1 > α > 0.05 for v = 24 for the right half of the t-
curve.

Student’s t-distribution Table


Exercise
1. The tensile strength of a new composite material can be modelled as a normal distribution. A random
sample of size 25 has mean of 45.3 and standard deviation as 7.9. In case the population mean is 40.5,
calculate the t-value and the range of α.
(Answer t = 3.038, 0.005 > α > 0.001)
2. The following are the times between 6 calls for an ambulance of a specific hospital and the patient’s arrival
at that hospital: 27, 15, 20, 32, 18 and 26 minutes. The hospital’s claim is average 20 minutes. Assume
duration is normally distributed, find out the value of t and the range of α.
(Answer t = 1.15, α > 0.1)
Sampling Distribution of the Variance
• To win a new contract, a manufacturer who supplies few parts needs to show the continuous reduction in the
variation of the supplied parts.
• To measure the altitude, the airplanes are equipped with altimeter. The variation among the altimeter readings
should be minimum for a given altitude.
• In the above situations, the distribution of the mean will not be helpful. The distribution of variance is required.
• The relationship of the sample variance to the population variance is captured by the chi-square distribution.
The ratio of the sample variance (s2) multiplied by (n – 1) to the population variance (σ2) is approximately chi-
square distributed. The samples are assumed to be drawn from the normal population.
• s2 cannot be negative and this sampling distribution is related to Gamma Distribution with α = v/2 and β = 2.
Where v is the Degree of Freedom (DF) and is equal to (n - 1) where n is the sample size.
• Chi-square is denoted by the square of Greek alphabet chi (χ) and is defined as:
𝟐 σ𝒏 𝟐
𝒏 − 𝟏 𝒔 ഥ
𝒊=𝟏 𝒙𝒊 − 𝒙
χ2 = =
𝝈𝟐 𝝈𝟐
• The value χ2α;v on the x-axis for which the area under the χ2 curve with v DF to
the right is α is called the χ2-critical value.
No inference is made; only table
Example calculations are done in this example!

An optical firm purchases glasses to be ground for lenses and it is known from
the past experiences that the variation in the refractive index of this kind of
glass is 1.26 x 10-4. The firm rejects a shipment if the sample variance of 20
pieces selected at random exceeds 2.0 x 10-4. Calculate the value of χ2 and the
range of α from it.

Given that:
σ2 = 1.26 x 10-4
n = 20 and s2 = 2.0 x 10-4
2
n − 1 s2
χ =
σ2
20 − 1 . 2.0 x 10−4
=
1.26 x 10−4
= 30.16
For given v = 20 - 1 = 19, the range is 0.05 > α > 0.025.

Chi-Square Distribution Table


Exercise
1. Hard disks of computers must spin evenly and one departure from the acceptable level is called a pitch.
Samples are regularly taken to ensure the quality through the measurement of pitches. From the data
over a period shows that pitches are normally distributed with variance 0.065. A sample of 10 is
collected each week. The process will be declared out of control if the sample variance exceeds 0.122.
Calculate the value of χ2 and the range of α from it.
(Answer: χ2 = 16.89, 0.10 > α > 0.05)
2. A random sample of 10 observations is taken from a normal population having the variance as 42.5.
Find the approximate probability of obtaining a sample standard deviation between 3.14 and 8.94.
(Answer: 0.95)
3. From the data provided in the last example of Optical Firm, verify the calculations using the Gamma
Distribution function.
4. Using the Gamma Distribution function, find the probability that the variance of a random sample of
size 5 from a normal population with standard deviation 12 will exceed 180.
(Answer = 0.2873)
Parameter
Estimation Techniques
Interval Estimation of Population Mean
Using z-statistic
• First we will review the methods to calculate and report an entire interval (range) of reasonable
values for the parameter. This is called an Interval Estimate. It is calculated by setting up a
Confidence Level.
• Let us say it is needed to estimate the interval of the population mean (µ) with 95% Confidence
Level. What does it imply?
• It implies, that 95% of all samples would give an interval that includes µ and only 5% of all
samples would yield an erroneous interval for the population mean.
• In other words, if we have to estimate the interval of the population mean with some
Confidence Level (1 - α)% and the sampling and interval estimate is done Y times, then Y.(1 -
α)% times the calculated intervals will contain the population mean.
• In the Central Limit Theorem, we reviewed that the means of the samples follow a normal
distribution. The area under the normal curve for the calculated interval will be (1 - α)% and
under both the tails it will be (α/2)% each.
• From the Central Limit Theorem and corresponding z-statistic for the significantly very large
Asterisk (*) shows: µ is not
population and n sample size we know that: in the Interval
𝒙− 𝝁
−𝒛𝜶/𝟐 ≤ ≤ 𝒛𝜶/𝟐
𝝈/ 𝒏
• The above expression can be arranged for the population mean as:
𝝈
𝝁 = 𝒙 ± 𝒛𝜶/𝟐
𝒏
• Or the interval estimate of the population mean can be written as:
𝝈 𝝈
𝒙 − 𝒛𝜶/𝟐 ≤ 𝝁 ≤ 𝒙 + 𝒛𝜶/𝟐
𝒏 𝒏
Example-1
For a random sample of 36 items and a sample mean of 211, calculate a 95%
confidence interval for population mean if the population standard deviation is 23.
Given that:
n = 36, 𝑥 = 211
σ = 23
(1-α) = 0.95, so α = 0.05 and α/2 = 0.025
zα/2 = the value of z that gives the normal curve area
after the mean that is (0.5000 - α/2) = (0.5000 - 0.025) = 0.4750
= 1.96 Explore Inverse
σ σ Normal Feature
x − zα/2 ≤ μ ≤ x + zα/2 in the Calculator
n n
23 23
211 − 1.96. ≤ μ ≤ 211 + 1.96.
36 36
203.49 ≤ 𝛍 ≤ 218.51
Example-2
A survey was conducted for US multinational companies that do business with the
companies in India. One of the questions in the survey was: around how many years has
your company been working with the companies in India? A random sample of 44
responses to this question gave a mean of 10.50 years. Suppose the population standard
deviation is 7.70 years. Construct a 90% confidence interval for the mean number of
years that a company has been working with Indian companies India for the population.
Given that:

n = 44, 𝑥 = 10.50
σ = 7.70

(1-α) = 0.90, so α = 0.10 and α/2 = 0.05

zα/2 = the value of z that gives the normal curve area

after the mean that is (0.5000 - α/2) = (0.5000 - 0.05) = 0.4500 Explore Inverse
Normal Feature
= (1.64 + 1.65) / 2 = 1.645 in the Calculator
σ σ
x − zα/2 ≤ μ ≤ x + zα/2
n n
7.70 7.70
10.50 − 1.645. ≤ μ ≤ 10.50 + 1.645.
44 44
8.59 ≤ 𝛍 ≤ 12.41
Exercise
1. Construct an 80% confidence interval for the population mean given that: x = 56.7, σ = 12.1, N = 500 and n = 47.
(Answer: 54.4381 ≤ μ ≤ 58.9619)
2. A random sample of size 39 is taken from a population of 200 members. The sample mean is 66 and the
population standard deviation is 11. Construct a 96% confidence interval to estimate the population mean?
(Answer: 62.3825 ≤ μ ≤ 69.6175)
3. A community health association is interested in estimating the average number of maternity days women stay in
the local hospital. A random sample is taken of 36 women who had babies in the hospital during the past year.
The numbers below show the maternity days each woman in the sample was in the hospital. Use this data and a
population standard deviation of 1.17 to construct a 98% confidence interval to estimate the average maternity
stay in the hospital for all women who have babies in this hospital: 3, 3, 4, 3, 2, 5, 3, 1, 4, 3, 4, 2, 3, 5, 3, 2, 4, 3, 2,
4, 1, 6, 3, 4, 3, 3, 5, 2, 3, 2, 3, 5, 4, 3, 5, 4.
(Answer: 2.85 ≤ μ ≤ 3.76)
Interval Estimation of Population Mean
Using t-statistic

• We have reviewed the situations when t-statistic can be used to observe the distribution of the population mean
(𝜇) using sample mean (𝑥)ҧ and sample standard deviation (s).
• Correspondingly, the t-distribution can be utilized to estimate the interval of population mean for a given
confidence level.
• The method is similar to what is used in estimating the interval of population mean using z-statistic when
population standard deviation (σ) was known, except in this method the sample standard deviation (s) will be
used.
• Using the t-statistic for v Degree of Freedom (DF) and (1-α) confidence interval:
𝒔 𝒔
𝒙 − 𝒕𝜶/𝟐 ≤ 𝝁 ≤ 𝒙 + 𝒕𝜶/𝟐
𝒏 𝒏

t-distribution with 90% Confidence Level


Example
If a random sample of 41 items produces xത = 128.4 and s = 20.6, what is the
98% confidence interval for µ? Assume x is normally distributed for the
population
Given that:
n = 41 so v = (n – 1) = 40
xത = 128.4 and s = 20.6
(1 – α) = 0.98, so α = 0.02 and α/2 = 0.01

From the t-distribution table:


t0.01 = 2.423
s s
x − t α/2 ≤ μ ≤ x + t α/2
n n
20.6 20.6
128.4 − 2.423. ≤ μ ≤ 128.4 + 2.423.
41 41
𝟏𝟐𝟎. 𝟔𝟎 ≤ 𝛍 ≤ 136.20
Interval Estimation of Population Variance
Using χ2 distribution

• The relationship of the sample variance to the population variance is captured by the chi-square distribution.
The ratio of the sample variance (s2) multiplied by (n – 1) to the population variance (σ2) is approximately chi-
square distributed. The samples are assumed to be drawn from the normal population.
• s2 cannot be negative and this sampling distribution is related to Gamma Distribution with α = v/2 and β = 2.
Where v is the Degree of Freedom (DF) and is equal to (n - 1) where n is the sample size.
• Chi-square is denoted by the square of Greek alphabet chi (χ2) and is defined as:
𝒏 − 𝟏 𝒔 𝟐
χ2 =
𝝈𝟐
• The above expression can be re-written for the population variance as:
𝟐
𝒏 − 𝟏 𝒔𝟐
𝝈 =
χ2
• Therefore, interval estimate for the population variance for v DF and (1 - α) confidence level can be written as:
(𝒏 − 𝟏)𝒔𝟐 (𝒏 − 𝟏)𝒔 𝟐

𝟐
≤ 𝝈𝟐 ≤ 𝟐
χ𝜶/𝟐 χ𝟏−(α/𝟐)

• The meaning of population variance interval estimate is discussed in the next slide.
Population Variance: Interval Estimation
• From the chi-square distribution, we reviewed that the variance and chi-square value are
inversely proportional.
𝒏 − 𝟏 𝒔𝟐
𝝈𝟐 =
χ2
• So as the value of χ2 progresses towards the right from the origin, the corresponding value of 𝜎2
decreases.

• Chi-square distribution table is arranged in such a way that it tells the value of χ2 for v DF and
area under the curve (α) to the right of χ2 .

• When it is required to estimate the population variance interval for the confidence level (1 - α),
the following areas (shaded) are excluded from the distribution curve:

i. α/2 to the right of χ2𝛼/2 χ2α/2 can be read from the table

2 2
ii. α/2 to the left of χ1−(𝛼/2) how will χ1−(α/2) be read; area α/2 is on the left?

2
• Notice that for the point (ii) above the value of χ1−(𝛼/2) will be equal to χ2 value for which the
area on the right is 1 – (α/2).

• Therefore, interval estimate for the population variance for v DF and (1 - α) confidence level can
be written as:
(𝒏 − 𝟏)𝒔𝟐 𝟐
(𝒏 − 𝟏)𝒔𝟐
≤𝝈 ≤
χ𝟐𝜶/𝟐 χ𝟐𝟏−(α/𝟐)
Example-1
A company manufactures aluminium cylinders and when a random sample of 8
cylinders is taken, the variance of the diameters comes out as 0.0022125.
Estimate the interval for the diameter variance for the cylinders that are
produced by the company for the 90% confidence level.
Given that:
s2 = 0.0022125
n = 8 so v = (n - 1) = 7
(1 – α) = 0.90 so α = 0.10 and α/2 = 0.05, 1 – (α/2) = 0.95
From the table:
χ2α/2 = χ20.05 = 14.067
2
χ1−(α/2) = χ20.95 = 2.167

(n − 1)s2 2
(n − 1)s2
≤σ ≤ 2
χ2α/2 χ1−(α/2)
7 x 0.0022125 7 x 0.0022125
≤ σ2 ≤
14.067 2.167
0.001101 ≤ 𝛔𝟐 ≤ 0.007147
Example-2
Refractive indices of 20 random pieces of glass selected from a large shipment
that is purchased from an optical firm have a variance as 1.20 x 10-4. Construct
a 95% confidence interval for the population variance.
Given that:
s2 = 1.20 x 10-4
n = 20 so v = (20 - 1) = 19
(1 – α) = 0.95 so α = 0.05 and α/2 = 0.025, 1 – (α/2) = 0.975
From the table:
χ2α/2 = χ20.025 = 32.852
2
χ1−(α/2) = χ20.975 = 8.906

(n − 1)s2 2
(n − 1)s2
≤σ ≤ 2
χ2α/2 χ1−(α/2)
19 x1.20 x 10−4 19 x1.20 x 10−4
≤ σ2 ≤
32.852 8.906
0.000069402 ≤ 𝛔𝟐 ≤ 0.0002560
Exercise
1. A random sample of 14 families produced the monthly household incomes that is shown below in the table.
Construct a 95% confidence interval for the population variance assuming household income is normally
distributed.
Monthly Income for 14 Sample Families (₹)
37,500 44,800 33,500 36,900 42,300 32,400 28,000
41,200 46,600 38,500 40,200 32,000 35,500 36,800

(Answer: 14084380.17 ≤ σ2 ≤ 69550238.25)


2. During the Covid-19 pandemic it was reported that the total weekly work hours are reducing primarily because
of several distractions during work from home. To do some analysis, a random sample of 20 employees was
collected from different IT companies and it was found that the distribution of weekly hours they work has the
standard deviation as 4.3 hours. Assume work hours worked per week are normally distributed in the
population. Use this sample information to develop a 98% confidence interval for the population variance of
the number of work hours worked per week for an employee.
(Answer: 9.71 ≤ σ2 ≤ 46.03)
Hypothesis Testing
Hypothesis Testing Plural: Hypotheses

There are many problems in which, rather than estimate the value or interval of a parameter, we must
decide whether a statement concerning a parameter is True or False. Statistically speaking we test a
hypothesis (claim) about a parameter. Few example statements are:
o Average time spent by women on social media is more than men.
o Children who drink health drinks (Horlicks etc.) grow taller.
o Average salary of people with PhD in engineering is more than the salary of people with just the
masters degrees.

Recap Example-1: A paint manufacturer claims that the mean drying time of their new paint is 20
minutes with standard deviation of 2.4 minutes. A researcher decided to reject the claim if a sample of
36 new paint boxes yield a mean that exceeds 20.75 minutes of drying time.
Reject the Claim
𝑥ҧ − 𝜇 20.75 − 20
𝑧= = = 1.875
𝜎/ 𝑛 2.4/ 36
From the z-distribution table the shaded area = 0.0304, which is the probability of rejecting the claim. z=0 z = 1.875

Recap Example-2: In the above example, if the claim were 21 minutes:


Accept the Claim
𝑥ҧ − 𝜇 20.75 − 21
𝑧= = = −0.625
𝜎/ 𝑛 2.4/ 36
From the z-distribution table the shaded area = 0.2660, which is the probability of accepting the claim.
z = 0.625 z = 0
Is that all? Shouldn't there be a criteria to reject or accept a claim like probability more than or less than
a certain threshold?
Steps to Set up a Hypothesis Test
1. Describe the Hypothesis (claim) in words in the form of a statement. E.g. average time spent by women on social media is more than men.
2. Based on the claim:
 Define Null (H0) and Alternative Hypothesis (HA or H1).
 In general null hypothesis means that there is no relationship between the two variables under consideration. E.g. there is no relationship between the gender and
the time spent on social media.
 Initially it is believed that the null hypothesis is true.
3. Identify the statistic for testing the validity of the Null Hypothesis. E.g. if it is about the mean time spent on social media, then the statistic is z-
statistic or t-statistic.
4. Decide the criteria to reject or accept the null hypothesis. This is called the significance value and it is denoted by α. This is also called Type-I
Error, that will be elaborated in the subsequent slides.
5. Calculate the probability value (p-value) which is the conditional probability observing the statistic given null hypothesis is true.
p − value = P observing the test statistic value null hypothesis is true)
6. p-value is the probability value for the calculated statistic. E.g. you calculate the probability for the calculated z-statistic.
7. Null hypothesis is rejected when p-value is less than significance value (α), otherwise accepted.
8. α is the threshold conditional probability. The statistic value for α is called the critical value (e.g. corresponding z-statistic for α).
9. Alternative hypothesis is complement of the null hypothesis which the claimant believes and tries to reject the null hypothesis.
10. In summary, we start with the assumption that null hypothesis is true and try to retain it unless there is an evidence against it.
Examples
Hypothesis Null and Alternative Hypothesis
Average annual salary is different for IT H0: µIT = µDS
and Data Science (DS) Engineers. HA: µIT ≠ µDS
Mean height of the children who drink
H0: µHD <= µNHD
health drinks (HD) is more than those
who do not drink health drinks (NHD). HA: µHD > µNHD
One Tailed and Two Tailed Tests Area under the tail (α), the significance
value, shown is the rejection region.

Reject or retain decision will depend on the direction of the deviation of the estimated static from
the hypothesis value (population parameter) that is being ascertained.
Scenario-1:
Hypothesis: In US, average annual salary of Machine Learning (ML) experts is at least $100,000.
Null Hypothesis (H0): µML < 100,000
Alternative Hypothesis (HA): µML >= 100,000
Area = α
In this case, alternative hypothesis is for more than the hypothesis value, so rejection region will be on
the right side of the one tailed test.
Scenario-2:
Hypothesis: Average waiting time (w) at the Bangalore airport security check is less than 30 minutes.
Null Hypothesis (H0): µw >= 30
Alternative Hypothesis (HA): µw < 30
In this case, alternative hypothesis is for less than the hypothesis value, so rejection region will be on the
left side of the one tailed test. Area = α

Scenario-3:
Hypothesis: Average annual salaries of Engineering (E) and Business Management (M) graduates are
different.
Null Hypothesis (H0): µE = µM
Alternative Hypothesis (HA): µE ≠ µM
In this case, alternative hypothesis could be less than or more than the hypothesis value, so rejection
region will be on the both the sides of the two tailed test. (observe α/2 regions). Area = α/2 Area = α/2
Area only under the tail
shown below.
Consolidated View
α p-value

Type of Test Condition Probability Decision p-value α


z-statistic > Critical Value p-value < α Reject H0
Right-Tail Test
z-statistic <= Critical Value p-value >= α Retain H0
z-critical z-statistic z-statistic z-critical
value value
Right-Tail H0 Rejection Right-Tail H0 Retention

α p-value

z-statistic < Critical Value p-value < α Reject H0 p-value α


Left-Tail Test
z-statistic >= Critical Value p-value >= α Retain H0

z-statistic z-critical z-critical z-statistic


value value

Left-Tail H0 Rejection Left-Tail H0 Retention


α/2 α/2

|z-statistic| > |Critical Value| p-value < α Reject H0 p-value/2 p-value/2


Two-Tail Test
|z-statistic| <= |Critical Value| p-value >= α Retain H0

The similar approach z-statistic z-critical z-critical z-statistic


value value
applicable for t-statistic.
Two-Tail H0 Rejection
Example-1
A study claimed that average monthly disposable income of the families in the IT hubs of India is greater
than ₹42,000 with a standard deviation of ₹32,000. The random sample of 40,000 families provided a mean
of ₹42,500. Conduct an appropriate hypothesis test using the confidence level of 95%.
Given that:
µ = 42000, σ = 32000, n = 40000 and xത = 42500
(1-α) = 0.95, so significance value (α) = 0.05
Critical value for α = 1.645
Null Hypothesis (H0): µ <= 42000
Alternative Hypothesis (HA): µ > 42000
Alternative hypothesis is for more than the hypothesis value, so rejection region will be on the right side of
the one tailed test.
α
xത − μ 42500 − 42000
z − statistic = σ = = 𝟑. 𝟏𝟐𝟓
32000 p-value
n 40000
p-value for z-statistic = 0.0009 z-critical z-statistic
value
Since, p-value < significance value, so null hypothesis is rejected. Right-Tail H0 Rejection
Example-2
Passport office claims that passport applications are processed within 30 days on an average when all the
papers are complete with standard deviation of 12.5 days. To verify the claim, data of 40 applications are
examined and the sample mean was found as 27.05 days. Conduct a hypothesis test for significance value
0.05.
Given that:
µ = 30, σ = 12.5, n = 40 and xത = 27.05
Significance value (α) = 0.05
Critical value for α = 1.645
Null Hypothesis (H0): µ > 30
Alternative Hypothesis (HA): µ <= 30
Alternative hypothesis is for less than the hypothesis value, so rejection region will be on the left side of the
one tailed test and the critical value will be with the negative sign (-1.645).
xത − μ 27.05 − 30
z − statistic = σ = = −1.49
12.5
n 40
p-value for z-statistic = 0.0681
Since, p-value > significance value, so null hypothesis is retained.
Example-3
Based on a research conducted the average IQ is 82 with standard deviation 11.03 for the citizen of a
country. When a random sample of 100 citizen was tested, the average IQ was found to be 84. Using the
significance value of 0.05, conduct a hypothesis test.
Given that:
µ = 82, σ = 11.03, n = 100 and xത = 84
Significance value (α) = 0.05
Null Hypothesis (H0): µ ≠ 82
Alternative Hypothesis (HA): µ = 82
The deviation could be on the either side, so rejection region will be on both the sides of the two tailed test
with significance value (α/2) = 0.025
Critical value for α/2 = 1.96 p-value/2 p-value/2

xത − μ 84 − 82
z − statistic = σ = = 𝟏. 𝟖𝟏𝟑𝟐 α/2 α/2
11.03
n 100
z-critical z-statistic z-statistic z-critical
p-value for z-statistic = 0.0349 and total p-value (both sides) = 0.0698 value value
Two-Tail H0 Retention
Since, p-value > significance value, so null hypothesis is retained.
Example-4
An agricultural researcher believes the average farm size has now increased from the 1990 mean figure of 4.71
acres. To test this notion, he randomly sampled 23 farms across India and ascertained the size of each farm from
the county records. The data he gathered follows a mean of 5 acres with standard deviation of 0.47. Use a 5%
level of significance to test the hypothesis.
Given that:
µ = 4.71, s = 0.47, n = 23, v = 23 - 1 = 22 and xത = 5
Significance value (α) = 0.05
This problem involves t-statistic because σ is unknown.
Null Hypothesis (H0): µ <= 4.71
Alternative Hypothesis (HA): µ > 4.71
Alternative hypothesis is more than the hypothesis value, so rejection region will be on the right side of the one-
tailed test.
Critical value for α = 0.05 and v (= 22) = 1.717 α
xത − μ 5 − 4.71
t − statistic = s = = 𝟐. 𝟗𝟔
0.47 p-value

n 23
Since t-statistic > critical value (or, p-value < significance value) t-critical t-statistic
value
so null hypothesis is rejected. Right-Tail H0 Rejection
Example-5
A heavy machine part needs to be 25 pounds average in weight. The factory supervisor has doubts that the new
parts that are being built are no more of 25 pounds and their manufacturing process has gone out of control. To
verify the doubt, 20 parts are sampled which are found to have a mean weight as 25.51 pounds with standard
deviation as 2.19 pounds. Using significance value of 0.05, conduct a hypothesis test.
Given that:
µ = 25, s = 2.19, n = 20, v = 20 - 1 = 19 and xത = 25.51
Significance value (α) = 0.05
This problem involves t-statistic because σ is unknown.
Null Hypothesis (H0): µ = 25
Alternative Hypothesis (HA): µ ≠ 25
Alternative hypothesis could be less than or more than the hypothesis value, so rejection region will be on both
the sides of the two tailed test with significance value (α/2) = 0.025
Critical value for α/2 = 0.025 and v (= 19) = 2.093
xത − μ 25.51 − 25 p-value/2 p-value/2
t − statistic = s = = 𝟏. 𝟎𝟒𝟏𝟓
2.19
n 20 α/2 α/2

Since t-statistic < critical value (or, p-value > significance value)
t-critical t-statistic
so null hypothesis is retained. value
t-statistic t-critical
value
Two-Tail H0 Retention
Area only under the tail
Recap: How Hypothesis Formulation Works? shown below.

• We have reviewed Hypothesis Testing that involves z-statistic and t-statistic.


• It is a good point, that we take a step back and recap how does the formulation of Null and
Alternative Hypothesis work in the context of significance value (α).
• Let us say for some problem, we have formed the hypotheses as:
Null Hypothesis (H0): µ <= 30
Alternative Hypothesis (HA): µ > 30 that is being claimed
significance value: α
• The Null Hypothesis will be rejected (or alternative hypothesis will be accepted) if our evidence from
the sample shows that mean > 30 with respect to the significance value (α). p-value
• We are talking about “more than” relationship so it means it is a right-tail test.
• Significance value (α) is area under the curve on the right side. Its corresponding x-axis value is
critical statistic
called the critical value. value
Right-Tail H0 Rejection
• From the sample data we calculate the statistic. It is also a point on the x-axis.
• For the calculated statistic, we find out the area under the curve on the right side (for this problem).
This area is called the p-value.
• If p-value < significance value (α), we accept the Alternative Hypothesis (or reject the Null
Hypothesis).
• Why? Notice that for the right-tail test, if p-value < α it also means that statistic > critical value. That
proves that the alternative hypothesis (claim) is correct.
• In general any one pair either (p-value, α) or (statistic, critical value) can be used to test the
hypothesis.
Hypothesis Testing for Variance
Type of Test Condition Probability Decision

χ2-statistic > Critical Value (χ2α) p-value < α Reject H0


Right-Tail Test
χ2-statistic <= Critical Value (χ2α) p-value >= α Retain H0

χ2-statistic < Critical Value (χ21-α) p-value < α Reject H0


Left-Tail Test
χ2-statistic >= Critical Value (χ21-α) p-value >= α Retain H0

χ2-statistic < Critical Value (χ21-α/2)


OR
p-value < α Reject H0
χ -statistic > Critical Value (χ2α/2)
2

Two-Tail Test
χ2-statistic >=Critical Value (χ21-α/2)
AND
p-value >= α Retain H0
χ -statistic <= Critical Value (χ2α/2)
2
Example-1
A company’s goal is to minimize the number of machine parts that are piled up at the workstation waiting to be installed. The company
expects that, on the average, about 20 parts will be at the station. However, the production superintendent suspects that the variance of
has increased and it is now greater than 4 at the workstation. On a given day, the number of machine parts piled up at the workstation is
determined eight different times and the following number of parts are recorded. 23, 17, 20, 29, 21, 14, 19 and 24. Using the significance
value of 5%, test the hypothesis.
Given that:
σ2 = 4, n = 8, v = 8 - 1 = 7
From the given data sample variance (s2) can be calculated as = 20.9821
Significance value (α) = 0.05
This problem involves χ2-statistic.
Null Hypothesis (H0): σ2 <= 4
Alternative Hypothesis (HA): σ2 > 4
Alternative hypothesis is more than the hypothesis value, so rejection region will be on the right side of the one-tailed test.
Critical value for α = 0.05 and v (= 7) = 14.067 α

𝑛 − 1 𝑠2 7 x 20.9821
χ2−statistic = = = 𝟑𝟔. 𝟕𝟏𝟗 p-value
𝜎2 4
Since χ2-statistic > critical value (or, p-value < significance value)
so null hypothesis is rejected. χ2-critical χ2-statistic
value
Right-Tail H0 Rejection
Example-2
A small business has few employees. Because of the uncertain demand for its product, the company usually pays overtime on any given
week. The company assumed that about 50 total hours of overtime per week is required and that the variance on this figure is about 25.
Company officials want to know whether the variance of overtime hours has changed. A sample of 16 weeks of overtime data (in hours
per week) gave a variance of 28.06. Use this data to test the null hypothesis that the variance of overtime data is 25. Significance value is
0.10.
Given that:
σ2 = 25, n = 16, v = 16 - 1 = 15
From the given data sample variance (s2) = 28.06
Significance value (α) = 0.10
This problem involves χ2-statistic.
Null Hypothesis (H0): σ2 = 25
Alternative Hypothesis (HA): σ2 ≠ 25
Alternative hypothesis could be less than or more than the hypothesis value, so rejection region will be on both the sides of the two tailed
test with significance value (α/2) = 0.05
Critical value for α/2 = 0.05 and v (= 15) = 24.996 (right end tail)
Critical value for 1-(α/2) = 0.95 and v (= 15) = 7.261 (left end tail)
𝑛 − 1 𝑠2 15 x 28.06
χ2−statistic = = = 16.836
𝜎2 25 Remember: χ2 distribution
starts from the origin.
Since, Critical Value (χ21-α/2) < χ2-statistic < Critical Value (χ2α/2)
Not in the so null hypothesis is retained.
Exercise
1. Calculate the p-value to reach to a statistical conclusion: Null Hypothesis (H0): µ = 25, Alternative
Hypothesis (HA): µ ≠ 25, xത = 28.1, n = 57, σ = 8.46 and α = 0.01.
(Answer: p-value = 0.0028 x 2)

2. A random sample of size 20 is taken, resulting in a sample mean of 16.45 and a sample standard
deviation of 3.59. Assume x is normally distributed and α = 0.05. Test the following hypotheses: Null
Hypothesis (H0): µ = 16, Alternative Hypothesis (HA): µ ≠ 16.
(Answer t−statistic = 0.56, Retained)

3. Test the hypothesis when Null Hypothesis (H0): σ2 = 20, Alternative Hypothesis (HA): σ2 > 20 given that α
= 0.05, n = 15, s2 = 32.
(Answer χ2−statistic = 22.4, Retained)
Type-I and Type-II Errors
• Type-I Error: Conditional probability of rejecting a null hypothesis when it is true is called Type-I Error. It is same as
significance value and denoted by α.
• Type-II Error: Conditional probability of retaining a null hypothesis when null hypothesis is false is called Type-II Error. It is
denoted by β.
• Power of Hypothesis Test: Conditional probability of rejecting a null hypothesis when it is false is called the power of the
hypothesis test. It is equal to 1 – β.
Retain Reject
H0 is True Correct Decision Type-I Error (α)
H0 is False Type-II Error (β) Correct Decision

Challenges in Finding Out the Type-II Error: (Example): As per the data the average IQ is 82 with standard deviation 11.03 for
the citizen of a country. Ministry does not agree with this data and believes that IQ is now more than 82.
Null Hypothesis (H0): µ <= 82
Alternative Hypothesis (HA): µ > 82
• Type-II error will occur when null hypothesis is false (alternative hypothesis is true) and it is retained.
• So if null hypothesis is false, what is the actual value of the mean. It is > 82, but what is the actual value? It could be 82.5, 83,
83.5 or anything else?
• So, Type-II Error is calculated w.r.t. an assumed true mean value.
Example-1
A research firm claimed that average IQ is 82 with standard deviation 11.03 for the citizen of a country.
Ministry does not completely agree with the research firm and believes that IQ is more than 82 at around
86. If significance value is 0.05, sample count = 100 and calculate the Type-II error.
Null Hypothesis (H0): µ <= 82
Alternative Hypothesis (HA): µ > 82
Ministry believes that IQ is more than 82, so it is a right tail test.
For α = 0.05, the z critical value = 1.645
σ 11.03
x-critical value = μ + z = 82 + 1.645. = 83.81
n 100
83.81−86
The z-value for 83.81 w.r.t. the assumed true mean = = −1.98549
11.03/ 100

The probability P(z <= -1.98549) = 0.02354


So Type-II error is 0.02354 w.r.t. the assumed mean 86 when
null hypothesis is retained but it is false.
Example-2
Suppose a null hypothesis is that the population mean is greater than or equal to 100. Suppose further that
a random sample of 48 items is taken and the population standard deviation is 14. Compute the probability
of committing a Type-II error if the population mean actually is 96 with significance value as 0.05.
Null Hypothesis (H0): µ >= 100
Alternative Hypothesis (HA): µ < 100
Actually mean is 96 which is less than 100, so it is a left tail test.
For α = 0.05, the z critical value = 1.645
σ 14
x-critical value = μ − z = 100 − 1.645. = 96.676
n 48
96.676−96
The z-value for 96.676 w.r.t. the actual mean = = 0.3345
14/ 48

The probability P(z <= 0.3345) = 0.6310


So Type-II error is 1 - 0.6310 = 0.3690 w.r.t. the actual mean 96
when null hypothesis is retained but it is false.
Exercise
The test is conducted on the following hypothesis:
H0: µ >= 12
HA: µ < 12
Test was done taking a sample size as 60. Taking significance value as 0.05 and population standard
deviation is 0.10, calculate the probability of committing a Type-II error if the assumed true mean is 11.96.
(Answer = 0.073)
Appendix
(Dealing with Two or More Populations/Samples)
For Self Study
Will Not be Included in the Evaluation Components
Comparing Two Populations
Introduction
• It is expensive for a mobile operator when one of its towers breaks down. Depending on
the availability, sometimes the operator might need to send a novice to fix the problem.
An mobile operator wants to conduct an experiment to compare the average time for an
expert to fix a problem and the average time for a novice to fix the problem.
• Petroleum companies advertise the advantages of using premium fuel. A consumer group
wants to determine the difference in mileage of cars using regular fuel and cars using
premium fuel.
• A consumer group wants to test if two different brands of toothpaste are rated same by
the consumers or two different brand of tyres wear differently.
• What is the interval estimation of average saving per month on groceries from the
samples of families who buy only online and who buy only visiting the stores?
• Two machines produce the same mechanical part in time shifts. Is the accuracy
measurement vary between these two lots of parts?
• Can the stock price for a set of companies be compared for the two successive years?
• Do employees perform better after a training program?
• In the above examples, the objectives are to determine the amount of differences in the
two populations or two populations before and after some event. This is done drawing
the random samples from these two sets.
• In this topic, we will review what are the different techniques to compare these two sets.
Some Background Math
• Let us say Y is a combination of two independent random variables x1 and x2 and can be represented with
• Let us say there is a function g(x) = a.x + b where a and b are constants. two constants a1 and a2 in the following form:
𝑌 = 𝑎1 𝑥1 + 𝑎2 𝑥2
• For this function f(x) is the probability distribution function.
• The mean of Y can be expressed as:
• We have reviewed that the first order moment is the mean. So the first order 𝐸 𝑎1 𝑥1 + 𝑎2 𝑥2
moment of g(x): ∞ ∞
=න න 𝑎1 𝑥1 + 𝑎2 𝑥2 𝑓1 𝑥1 𝑓2 𝑥2 𝑑𝑥1 𝑑𝑥2
𝐸 𝑎. 𝑥 + 𝑏 −∞ −∞
∞ ∞ ∞ ∞ ∞
= න 𝑔 𝑥 . 𝑓 𝑥 𝑑𝑥 = 𝑎1 න 𝑥1 𝑓1 𝑥1 𝑑𝑥1 . න 𝑓2 𝑥2 𝑑𝑥2 + 𝑎2 . න 𝑓1 𝑥1 𝑑𝑥1 . න 𝑥2 . 𝑓2 𝑥2 𝑑𝑥2
−∞ −∞ −∞ −∞ −∞

∞ = 𝑎1 . 𝐸 𝑥1 + 𝑎2 . 𝐸(𝑥2 )
= න (𝑎. 𝑥 + 𝑏). 𝑓 𝑥 𝑑𝑥 = 𝒂𝟏 . 𝝁𝟏 +𝒂𝟐 . 𝝁𝟐
−∞
∞ ∞ • The variance of Y can be expressed as:
= 𝑎. න 𝑥. 𝑓 𝑥 𝑑𝑥 + 𝑏. න 𝑓 𝑥 𝑑𝑥
−∞ −∞ 𝑉𝑎𝑟 𝑌
= 𝑎. 𝐸 𝑥 + 𝑏 2
= 𝐸 𝑌 − 𝜇𝑌
2
= 𝒂. 𝝁𝒙 + 𝒃 = 𝐸 𝑎1 𝑥1 + 𝑎2 𝑥2 − 𝑎1 𝜇1 − 𝑎2 𝜇2
= 𝐸 𝑎1 (𝑥1 −𝜇1 + 𝑎2 (𝑥2 −𝜇2 )]2
• We have also reviewed that the second order moment about the mean is the
variance. So the variance of g(x): = 𝐸 𝑎12 . (𝑥1 −𝜇1 )2 + 𝑎22 . (𝑥2 −𝜇2 )2 + 2. 𝑎1 . 𝑎2 . (𝑥1 −𝜇1 . (𝑥2 −𝜇2 )]
𝑉𝑎𝑟 𝑎. 𝑥 + 𝑏 = 𝑎12 . 𝐸(𝑥1 −𝜇1 )2 + 𝑎22 . 𝐸(𝑥2 −𝜇2 )2 + 2. 𝑎1 . 𝑎2 . 𝐸[(𝑥1 −𝜇1 ). (𝑥2 −𝜇2 )]
∞ = 𝑎12 . 𝑉𝑎𝑟(𝑥1 ) + 𝑎22 . 𝑉𝑎𝑟(𝑥2 ) + 2. 𝑎1 . 𝑎2 . 𝐸[(𝑥1 −𝜇1 ). (𝑥2 −𝜇2 )]
=න 𝑎. 𝑥 + 𝑏 − 𝑎. 𝜇𝑥 +𝑏 2. 𝑓 𝑥 𝑑𝑥
−∞ • If x1 and x2 are independent their covariance will be 0.

𝐸[(𝑥1 −𝜇1 ). (𝑥2 −𝜇2 )] = 0
= 𝑎2 න 𝑥 − 𝜇𝑥 2. 𝑓 𝑥 𝑑𝑥
−∞
• So:
𝟐
= 𝒂 . 𝑽𝒂𝒓 𝒙 𝑽𝒂𝒓 𝒀 = 𝒂𝟐𝟏 . 𝑽𝒂𝒓(𝒙𝟏 ) + 𝒂𝟐𝟐 . 𝑽𝒂𝒓(𝒙𝟐 )
Example
Example-1: Given that if x1 and x2 are independent. Let x1 has mean 4 and variance 9 and let x2 has mean -2 and variance 6.
Find the mean and variance of: (2x1 + x2- 5).
We have reviewed that: if Y = a1 x1 + a2 x2 , then where x1 and x2 are independent and a1 and a2 are two
Mean ax + b = a. μx + b constants:
Var ax + b = a2 . Var x Mean Y = a1 μ1 + a2 μ2
Var Y = a12 Var(x1 ) + a22 Var(x2 )

Mean (2x1 + x2 - 5) = 2. Mean(x1 ) + 1. Mean(x2 ) − 5 = 2x4 − 2 − 5 = 1


Var (2x1 + x2 - 5) = 22. Var(x1 ) + 12. Var(x2 ) = 4x9 + 1x6 = 42

Example-2: Given that if x1 and x2 are independent. Let x1 has mean μ1 and variance σ21 and let x2 has mean μ2 and variance
σ22.Find the mean and variance of: (x1 - x2) and (x1 + x2).
Comparing (x1 - x2) with a1 x1 + a2 x2 Comparing (x1 + x2) with a1 x1 + a2 x2
a1 = 1 and a2 = −1 a1 = 1 and a2 = 1
So: So:
Mean (x1 - x2) = μ1 − μ2 Mean (x1 + x2) = μ1 + μ2
Var Y = Var(x1 ) + Var(x2 ) Var Y = Var(x1 ) + Var(x2 )
= σ21 + σ22 = σ21 + σ22
Interval Estimation & Hypothesis Testing
Using z-Statistic

• Let us say two samples (with xത 1 and xത 2 means) are drawn from two populations (with μ1 and μ2 means
and σ1 and σ2 standard deviations) of sizes of sizes n1 and n2 .
• The central limit theorem states that the difference in two sample means is normally distributed for large
sample sizes (both n1 and n2 >= 30) regardless of the shape of the populations with the following
properties:
𝝁ഥ𝒙𝟏 −ഥ𝒙𝟐 = 𝝁𝟏 − 𝝁𝟐

𝝈𝟐𝟏 𝝈𝟐𝟐 Why?


𝝈ഥ𝒙𝟏 −ഥ𝒙𝟐 = +
𝒏𝟏 𝒏𝟐
• The z-statistic is calculated by:
𝐱ത 𝟏 −ഥ𝐱 𝟐 − (𝛍𝟏 − 𝛍𝟐 )
𝐳=
𝛔𝟐𝟏 𝛔𝟐𝟐
𝐧𝟏 + 𝐧𝟐
• Confidence interval for the population mean:
𝛔𝟐𝟏 𝛔𝟐𝟐 𝛔𝟐𝟏 𝛔𝟐𝟐
𝐱ത 𝟏 −ഥ𝐱 𝟐 − 𝐳. + ≤ (𝛍𝟏 − 𝛍𝟐 ) ≤ 𝐱ത 𝟏 −ഥ𝐱 𝟐 + 𝐳. +
𝐧𝟏 𝐧𝟐 𝐧𝟏 𝐧𝟐
Example
Standard deviation of the mileage on regular fuel is 3.46 miles per gallon (mpg) and on premium
fuel is 2.99 mpg. Two sets of 50 cars each are tested on regular and premium fuels that give the
mean mileage of 21.45 mpg and 24.60 mpg respectively. Construct a 95% confidence interval for the
difference of mean mileage of regular and premium fuel based on this sample experiment.
Given that:
σ1 = 3.46, σ2 = 2.99, n1 = n2 = 50, xത 1 = 21.45 and xത 2 = 24.60
The z-value for 95% confidence interval = 1.96
σ12 σ22 σ12 σ22
xത1 −ഥx2 − z. + ≤ (μ1 − μ2 ) ≤ xത1 −ഥx2 + z. +
n1 n2 n1 n2

3.462 2.992 3.462 2.992


21.45−24.60 − 1.96. + ≤ (μ1 − μ2 ) ≤ 21.45−24.60 + 1.96. +
50 50 50 50

−𝟒. 𝟒𝟐 ≤ (𝛍𝟏 − 𝛍𝟐 ) ≤ −𝟏. 𝟖𝟖


Example
A DTH company sends technicians to resident houses for dish and set top box installations. In 2009, the standard deviation for the overhead
expenditure for these trips per technician per day was ₹18.50 and in 2019 it was ₹15.60. The company’s auditor believes that the daily
overhead expenditure rose significantly between 2009 and 2019. To test this belief, the auditor samples 51 trips from the company’s records
for 2009. The sample average was ₹190 per day. The auditor selects a second random sample of 47 business trips from the company’s
records for 2019. The sample average was ₹198 per day. If he uses a risk of committing a Type-I error of 0.01, does the auditor find that the
average expenditure has gone up significantly?
Given that:
σ1 = 18.50, σ2 = 15.60, n1 = 51, n2 = 47, xത1 = 190 and xത 2 = 198
Type-1 Error (α) = 0.01
Null Hypothesis (H0): μ1 = μ2 or μ1 − μ2 = 0
Alternative Hypothesis (HA): μ1 − μ2 < 0
z-critical value for the given α = -2.3263 for the left-end tail test
z-statistic from the given data:
xത1 −ഥx2 − (μ1 − μ2 )
z=
σ12 σ22
n1 + n2

190−198 − 0
= = −𝟐. 𝟑𝟐𝟎𝟐
18.502 15.602
+ 47
51

Since, z-statistic >= z-critical value for the left-end tail test, the null hypothesis is retained and auditor’s belief is tested as incorrect.
Interval Estimation & Hypothesis Testing
Using t-Statistic

• The pooled variance of the two samples (x1 and x2) can be given as:
𝒏𝟏 𝟐 𝒏𝟐 𝟐
σ 𝒊=𝟏(𝒙 𝒊𝟏 − ഥ
𝒙 𝟏 ) + σ 𝒊=𝟏(𝒙𝒊𝟐 − 𝒙ഥ𝟐 )
𝑺𝟐𝒑 =
(𝒏𝟏 − 𝟏) + (𝒏𝟐 − 𝟏)
• If the individual variances of the samples are S12 and S22 respectively, the above expression can be re-written as:
𝟐
𝒏𝟏 − 𝟏 . 𝑺𝟐𝟏 + 𝒏𝟐 − 𝟏 . 𝑺𝟐𝟐
𝑺𝒑 =
(𝒏𝟏 +𝒏𝟐 − 𝟐)
• From the Central Limit Theorem of two populations, the standard deviation for the difference of sample means:
𝝈𝟐𝟏 𝝈𝟐𝟐
𝝈𝒙ഥ𝟏−ഥ𝒙𝟐 = +
𝒏𝟏 𝒏𝟐
• If the standard deviations of two populations are not known, the above expression can be used to estimate the
standard deviation of the difference of the two sample means:
𝟏 𝟏
𝑺𝒙ഥ𝟏−ഥ𝒙𝟐 = 𝑺𝒑 . +
𝒏𝟏 𝒏𝟐
• The t-statistic is calculated by:
𝐱ത 𝟏 −ഥ𝐱𝟐 − (𝛍𝟏 − 𝛍𝟐 )
𝐭=
𝟏 𝟏
𝑺𝒑 . +
𝒏𝟏 𝒏𝟐
Example
Waste industrial material is crushed and used for constructing the roadways. This is not only
environmental friendly but also provides better strength and durability of the roads. Six
samples each from two locations of waste collection centres are picked up. Their resilience
modulus of strength values are listed below. Use the 0.05 significance value and test the null
hypothesis if the mean strength of the waste are same at the two locations.
Resilience Modulus
Location-1 707 632 604 652 669 674
Location-2 552 554 484 630 648 610

Given that, n1 = n2 = 6
Mean (തx1 ) of location-1 = 656.33, variance (S12 ) = 1277.87
Mean (തx2 ) of location-2 = 579.67, variance (S22 ) = 3739.87
Significance value (α) = 0.05, Degree of Freedom (v) = 6+6-2 = 10
Null Hypothesis (H0): μ1 = μ2 or μ1 − μ2 = 0 xത1 −ഥx2 − (μ1 − μ2 )
t − statistic =
Alternative Hypothesis (HA): μ1 ≠ μ2 1 1
𝑆𝑝 . 𝑛 + 𝑛
1 2
The t-critical value for α/2 = 0.025 for two-tailed test for the given v = 2.228

656.33−579.67 − 0
n1 − 1 . S12 + n2 − 1 . S22 t − statistic = = 𝟐. 𝟔𝟓
Sp2 = 1 1
(n1 +n2 − 2) 50.09. +
6 6
6 − 1 . 1277.87 + 6 − 1 . 3739.87 Since t-statistic > t-critical value for the right-
Sp2 = = 2508.87, so Sp = 50.09 end tailed test, the null hypothesis is rejected. It
(6 + 6 − 2)
also means that the strength of the waste is
different at the two locations.
Example
A coffee manufacturer is interested in estimating the difference in the average daily coffee consumption of regular-
coffee drinkers and decaffeinated-coffee drinkers. Its researchers sampled 13 and 15 coffee drinkers in each category
respectively and collected the data for the number of cups of coffee they drink. The average for the regular-coffee
drinkers is 4.35 cups, with a standard deviation of 1.20 cups. The average for the decaffeinated-coffee drinkers is 6.84
cups, with a standard deviation of 1.42 cups. Assuming that the daily consumption is normally distributed, construct a
95% confidence interval to estimate the difference in the averages of the two populations.

Given that:
S1 = 1.20, S2 = 1.42, n1 = 13, n2 = 15, xത1 = 4.35 and xത 2 = 6.84
The Degree of Freedom (v) = 13+15-2 = 26
The t-value for α/2 = (1-0.95)/2 = 0.025 and the given degree of freedom = 2.056
n1 − 1 . S12 + n2 − 1 . S22
Sp2 =
(n1 +n2 − 2)
13 − 1 . 1.202 + 15 − 1 . 1.422
Sp2 = = 1.75, so Sp = 1.32
(13 + 15 − 2)
1 1 1 1
xത1 −ഥx2 − 𝑡. 𝑆𝑝 . + ≤ (μ1 − μ2 ) ≤ xത1 −ഥx2 + 𝑡. 𝑆𝑝 . +
𝑛1 𝑛2 𝑛1 𝑛2

1 1 1 1
4.35−6.84 − 2.056 ∗ 1.32. + ≤ (𝜇1 − μ2 ) ≤ 4.35−6.84 + 2.056 ∗ 1.32. +
13 15 13 15
−𝟑. 𝟓𝟐 ≤ (𝛍𝟏 − 𝛍𝟐 ) ≤ −𝟏. 𝟒𝟔
Matched Pair Comparison

• There are many situations where before-and-after condition of the data is to be compared. E.g.:
o Ceramic components of a mechanical part lose weight after the baking process.
o Price to Earning Ratio (P/E) of selected 10 companies in the two successive years.
o Market price of the flats in an apartment complex before and during Covid-19 pandemic.
o Weekly loss of working hours in a factory before and after taking the stringent safety measures.
o Body Mass Index (BMI) before and after following a strict diet and exercise regime for 3 months.
• This comparison of two populations is popularly known as Matched Pair Comparison or correlated t-tests.
• t-statistic for this test is given by:
ഥ−𝑫
𝒅
𝒕 − 𝒔𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄 =
𝑺𝒅
𝒏
Where:
𝑑ҧ = mean sample difference
D = mean population difference
𝑆𝑑 =standard deviation of the sample difference
n = number of pairs with degree of freedom = n-1
• Null hypothesis is tested for mean population difference D = 0.
• The hypothesis D = 0 indicates that the means of the two responses are the same.
Example
A stock market investor thinks that there is a significant difference in the P/E (price to earnings) ratio for 9
companies from a year to the next. Test the hypothesis taking the significance value as 0.01 and the data
below. Company Year-1 P/E Ratio Year-2 P/E Ratio d
ҧ
𝐝−𝐃 1 8.9 12.7 -3.80
𝐭 − 𝐬𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜 = 2 38.1 45.4 -7.30
𝐒𝐝 3 43 10 33.00
𝐧 4 34 27.2 6.80
5 34.5 22.8 11.70
Given that: 6 15.2 24.1 -8.90
7 20.3 32.3 -12.00
Significance value (α) = 0.01, Degree of Freedom (v) = 9-1= 8 8 19.9 40.1 -20.20
9 61.9 106.5 -44.60

Null Hypothesis (H0): D = 0 Sum



-45.30
Mean (d) -5.03
Standard Deviation (Sd ) 21.60
Alternative Hypothesis (HA):D ≠ 0
The t-critical value for α/2 = 0.005 for two-tailed test for the given v = ±3.355
−5.03 − 0
t − statistic = = −0.6986
21.60
9
Since t-statistic > t-critical value in the left-end tail, the null hypothesis is retained. It means the
experiment suggests there is no significance difference in the P/E ratio in two years.
Exercise
1. Variance for the two populations are 22.74 and 26.65. Given the significance value as 0.02, test the null
hypothesis that there is no difference in the population means. The sample data from two populations are
shown below:
Sample-1 Sample-2
90 88 80 88 83 94 78 85 82 81 75 76
88 87 91 81 83 88 90 80 76 83 88 77
81 84 84 87 87 93 77 75 79 86 90 75
88 90 91 88 84 83 82 83 88 80 80 74
89 95 97 95 93 97 80 90 74 89 84 79

(Answer: z-statistic = 5.4813, null hypothesis is rejected)

2. Construct a 98% confidence interval to estimate the difference in population means using the above data.
(Answer: 4.04 ≤ (μ1 − μ2 ) ≤ 10.02)

3. A sample-1 of size 8 has mean 24.56 and standard deviation 12.40. Another sample-2 of size 11 has mean 26.42
and standard deviation 15.80. Using 1% significance value, test the alternative hypothesis that µ1- µ2 < 0.
(Answer: t − statistic = −1.05, null hypothesis is retained or alternative hypothesis is rejected)

4. The following are the average weekly losses of worker-hours due to accidents in 10 industrial plants before and
after a certain safety program was put into operation. Using the significance value as 0.05, test the null
hypothesis that safety program is not effective.
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
(Answer: t − statistic = 4.03, null hypothesis is rejected)
F-distribution
• If S21 and S22 are the variances of independent random samples
of size n1 and n2 respectively taken from two normally
distributed populations having the same variance then:
𝑺𝟐𝟏
𝑭= 𝟐
𝑺𝟐
• If two samples come from the same population (or
populations with equal variances), the ratio of the sample
variances (F) should be 1. However, because of sampling errors
this ratio vary.
• The distribution of F belongs to Beta-distribution, where the
value of the random variable takes on a value between 0 and
1.
• F follows a continuous distribution with degrees of freedom
(df) for numerator (v1) = n1-1 and for denominator (v2) = n2-1.
• Statistics books capture the critical values of the F ratio in a
table for few values of v1 and v2 and α (the area under the tail
on the right hand side). This is denoted by Fα, v1, v2.
• Left hand side F value for the probability (1-α) is the inverse of
corresponding F value for α. So:
𝟏
𝑭(𝟏−𝜶),𝒗𝟏,𝒗𝟐 =
𝑭𝜶,𝒗𝟏,𝒗𝟐
• F-distribution is sensitive to the assumption that populations
are in normal distribution.
Hypothesis Testing
Concerning Two Variances

Two machines produce metal sheets that are specified to be 22 mm thick. There is variability in the thickness because of several factors.
Operators are concerned about the consistency of the two machines. To test the consistency, few sheets are randomly sampled and their
thickness is measured. The details are captured in the table below. Assume that the sheet thickness is normally distributed. Test the null
hypothesis that the sheets that are produced by these two machine have the same variability with significance value as 0.05.
Given that: α = 0.05 Machine-1 Machine-2
n1 =10, n2 =12, so v1 = 9 and v2 = 11 22.30 21.90 22.00 21.70 22.00
21.80 22.40 22.10 21.90 22.10
Null Hypothesis (H0): σ21 = σ22 22.30 22.50 21.80 22.00
21.60 22.20 21.90 22.10
Alternative Hypothesis (HA):σ21 ≠ σ22 21.80 21.60 22.20 21.90

For the two tailed test, α/2 =0.025


F-critical value for the right-end tail = 3.59
F-critical value for the left-end tail = 1/3.59 = 0.28
From the provided data, the sample variances:
S12 = 0.11378 and S22 = 0.02023
S12
F − statistic = 2
S2
0.11378
= = 𝟓. 𝟔𝟐
0.02023

Since, F-statistic is in the rejection region on the right-end tail, the null hypothesis is rejected.
Exercise
General Brite is specialized in silver plating on the metal surfaces. 12 samples each from two of its workers
are randomly collected that provided the standard deviations of 0.035 mil and 0.062 mil of the thickness of
their plating (1 mil = 1/1000 inch). Using 0.05 significance value, test the null hypothesis if σ21 = σ22
against alternative hypothesis σ21 < σ22. Given that F0.05, 11, 11 = 2.82.
(Answer: F-statistic = 0.3187 and F0.95, 11, 11 = 0.3546, Null Hypothesis rejected)
Analysis of Variance (ANOVA)
Introduction
• We have reviewed hypothesis testing in the cases of 1-population and 2-populations so far. There could be situations
where statisticians want to perform the hypothesis testing for 3 or more populations. Examples:
o Different quality of tyres in the experiment such as low-quality, medium-quality, and high-quality need to be compared.
o Sales of garments under different discount.
o Returns of mutual funds under different categories like large, mid and small cap funds.
• The experiment begins with finalizing few variables:
• Independent Variables (Factors): they could be treatment or classification variables. Examples:
o 0%, 10% or 20% discount on garments is treatment.
o Low, medium or high quality tyres is classification.
o Mutual fund categories.
• Dependent Variables: they are the response to the independent variables in an experiment. Examples:
o Amount of sale with different discounts.
o Wear and tear of different quality tyres.
o Returns from the mutual funds. 1-Population
• The details of One-Way ANOVA will
be reviewed in this module. Hypothesis
2-Populations
• The assumptions are that response Testing
 One-Way ANOVA
variable will be in the normal  Randomized Block Design
distribution and population variance >= 3-Populations
 Two-Way ANOVA
will be the same.
Measures of Partitioning
X

In the conceptual diagram above, assume each circle group consists of several random points (not shown), with
marked centre as their group average.
The red dot (X) out the circles is the average of all random points in these groups.

Sum of square of Total Variation (SST) = ∑(Point - X)2


Sum of square of Between Group Variation (SSB) = ∑ Count of group points * (Group Average - X)2
Sum of square of Within Group Variation (SSW) = ∑(Point – Its Group Average)2
Measures of ANOVA
k = Number of groups # Group-1 Group-2 Group-3
ni = Number of observations in the group − i 1 39 37 34 34 47 38 42 41 40
n = Total observations 2 32 34 25 41 44 35 43 47 50
Yij = Observation − j in group − i 3 25 28 33 45 46 34 44 55 52
μi = Mean of group i 4 25 36 26 39 38 34 46 55 43
μ = Overall mean 5 37 38 33 38 42 37 41 47 47
6 28 38 26 33 33 39 52 48 55
k ni 7 26 34 26 35 37 34 43 41 49
Sum of Squares of Total Variation 𝐒𝐒𝐓 = ෍ ෍(Yij − μ)2 8 26 31 27 41 45 34 42 42 46
i=1 j=1
9 40 39 32 47 38 36 50 45 55
10 29 36 40 34 44 41 41 48 42
Degree of Freedom for SST = (n-1)
SST
Mean of Squares for Total Variation 𝐌𝐒𝐓 = k=3, n = 90, μ = 39.06
n−1 𝐒𝐒𝐓 = (39 − 39.06)2 + ⋯ + 42 − 39.06 2
= 5170.72
k Degree of Freedom for SST = (n-1) = 89 n1 = 30, μ1 = 32
Sum of Squares of Between Group Variation 𝐒𝐒𝐁 = ෍ ni . (μi − μ)2 5170.72 n2 = 30, μ2 = 38.77
𝐌𝐒𝐓 = = 58.10
i=1 89 n3 = 30, μ3 = 46.4
Degree of Freedom for SSB= (k-1)
SSB 2 2 2
Mean of Squares for Between Variation 𝐌𝐒𝐁 = 𝐒𝐒𝐁 = 30 ∗ 32 − 39.06 + 30 ∗ 38.77 − 39.06 + 30 ∗ 46.4 − 39.06 = 3114.16
k−1
Degree of Freedom for SSB= (k-1) =2
k ni 3114.10
𝐌𝐒𝐁 = = 1557.05
Sum of Squares of Within Group Variation 𝐒𝐒𝐖 = ෍ ෍(Yij − μi )2 3−1
i=1 j=1

Degree of Freedom for SSW= (n-k) 𝐒𝐒𝐖 = (39 − 32)2 + ⋯ + 34 − 38.77 2


+ ⋯ + 42 − 46.4 2
= 2056.56
SSW
Mean of Squares for Within Variation 𝐌𝐒𝐖 = Degree of Freedom for SSW= (n-k)=90-3 = 87
n−k 2056.56
𝐌𝐒𝐖 = = 23.64
87
SST = SSB + SSW
Cochran’s Theorem and F-Test
• From the previous slide, it is reviewed that SST is broken into SSB and SSW.
• Cochran’s Theorem states that if:
o Y1, Y2, …Yn are drawn from a normal distribution with mean µ and standard deviation σ.
o Sum of squares of total variation (SST) is decomposed into k different sum of squares (SS), with degree of
freedom for the ith sum of square SSi as DFi.
o The ratio of SSi/σ2 are independent chi-square variables (χ2) with DFi as degree of freedom.
o If σki=1 DFi = n − 1
• In this context, the F-statistic is defined as:
MSB
F=
MSW
• F-critical value from the tables is read for the degrees of freedom for MSB and MSW respectively.
• Null hypothesis in ANOVA is if means for the different groups are equal. ANOVA is a right tail test
hypothesis testing as the objective is if the variation between groups is greater than the variation within
groups.
Example
# Group-1 Group-2 Group-3
The three groups represent the product sale on 1 39 37 34 34 47 38 42 41 40
randomly selected days under 0%, 10% and 20% 2 32 34 25 41 44 35 43 47 50
3 25 28 33 45 46 34 44 55 52
discounts respectively. Using significance value as 4 25 36 26 39 38 34 46 55 43
0.05 test the hypothesis if discount has any 5 37 38 33 38 42 37 41 47 47
6 28 38 26 33 33 39 52 48 55
significant impact on the average sale quantity. 7 26 34 26 35 37 34 43 41 49
8 26 31 27 41 45 34 42 42 46
9 40 39 32 47 38 36 50 45 55
10 29 36 40 34 44 41 41 48 42

Null Hypothesis (H0): µ1 = µ2 = µ3


Alternative Hypothesis (HA): µ1 ≠ µ2 ≠ µ3
From the previous slide:
MSB = 1557.05, degree of freedom = 2
MSW = 23.64, degree of freedom = 87
F-statistic = MSB/MSW = 1557.05 /23.64 = 65.87
F-critical value for 2 and 87 degrees of freedoms for α(=0.05) = 3.101
Since, F-statistic > F-critical value for the right-tail test, the null hypothesis is rejected. It also means that different
discounts have an impact.
Exercise
A company has three manufacturing plants. The following data shows the ages of five randomly selected workers
at each plant. Using significance value as 0.01, perform one-way ANOVA to determine whether there is a
significant difference in the mean ages of the worker-populations at the three plants. Given that F0.01, 2, 12 = 6.93.
Plant-1 Plant-2 Plant-3
29 32 25
27 33 24
30 31 24
27 34 25
28 30 26
(Answer: F-statistic = 39.80, There is age difference)
Thank You

You might also like