4. Inferential Statistics
4. Inferential Statistics
Inferential Statistics
Revision-1.0
1. Unknowingly, best 50 performing stocks are kept on the top in the list of 150 companies of a stock
exchange and 3 such stock exchanges’ lists are prepared. If a random sample of 3 companies is to
created from the population of these lists, explain why systematic sampling may not be a good
technique.
2. An advertising company interested in determining the impact of TV advertisements conducts a survey
in a geographical area. The area has three towns A, B and C. Town A is built around a factory and mostly
factory workers with young kids live there. Town B contains mainly old retired people and the town
C has mainly farmers. Assume all households have at least one TV set. There are 155 households in
town A, 62 in town B and 93 in town C. The company wants to pickup up a sample of 40 households.
Suggest them a suitable random sampling technique.
Sampling Distribution
x y Mean
54 54 54.00
54 55 54.50
• There is a small finite population of N = 8 numbers as: 54, 55, 59, 63, 64, 68, 69
55 54 54.50
55 55 55.00
55 59 57.00
and 70 with µ as mean and σ as standard deviation. 55
55
63
64
59.00
59.50
• For these samples, means are also calculated. The mean of these samples (ഥ
63 55 59.00
X) will 63 59 61.00
be different every time. May be the variations in the mean are not high.
Population Histogram 63
63
63
64
63.00
63.50
63 68 65.50
• Central Limit Theorem (CLT): if there is a population with mean μ and standard 68
68
54
55
61.00
61.50
68 59 63.50
deviation σ and if several random samples from the population with 68
68
63
64
65.50
66.00
replacement are taken, then the distribution of the sample means will be 68
68
68
69
68.00
68.50
68 70 69.00
approximately normally distributed. This will be true regardless of distribution of 69
69
54
55
61.50
62.00
the population. 69
69
59
63
64.00
66.00
• The mean of the sample means (μX ) and the standard deviation (𝜎𝑋 ) is given by: 69
69
64
68
66.50
68.50
69 69 69.00
𝝁𝑿 = 𝝁
69 70 69.50
70 54 62.00
70 55 62.50
70 59 64.50
𝝈 Histogram of Sample Means 70 63 66.50
𝝈𝑿 = 70
70
64
68
67.00
69.00
𝒏 70
70
69
70
69.50
70.00
• The mean of the sample means = 62.75 and standard deviation = 4.12. Samples with
Means
Central Limit Theorem: Features
• Central Limit Theorem holds for any type of population distribution
for large sample sizes (n >= 30).
• If the population itself is normally distributed then CLT holds even for
smaller values of n.
• The expression for standard deviation is true for infinite size of
population.
𝝈
𝝈𝑿 =
𝒏
• If the population is finite (N) then correction factor is applied to it
(normally not used because N will be still be >> n ):
𝝈 𝑵−𝒏
𝝈𝑿 =
𝒏 𝑵−𝟏
z-Distribution Table
The t-distribution
• In the Central Limit Theorem (CLT), we reviewed that the means of samples follow a normal distribution for large sample sizes and
the theory of standard normal distribution can be applied for some analysis.
𝝈 𝒙− 𝝁
𝝁𝑿 = 𝝁, 𝝈𝑿 = , 𝒛=
𝒏 𝝈/ 𝒏
• The issue arises when:
o Population is known to be normally distributed.
o It is not feasible to draw several random samples.
o It is not feasible to draw a large-sized sample.
o The population standard deviation (σ) and mean (µ) are not known.
o The estimation of population mean (µ) is to be done or the claim about population mean (µ) is to be ascertained.
• We have reviewed that even for small sample sizes, the mean of samples follow the normal distribution if these samples are
drawn from normally distributed population.
• In such situations, Student’s t-distribution can be used. Student’s t-distribution was developed by William S Gosset, a 19th
century English statistician. His pen name was Student. The t-statistic can be calculated using sample mean (𝑥) and standard
deviation (s ) as:
𝒙− 𝝁 µ is not known, so how t-statistic
𝒕= will be calculated?
𝒔/ 𝒏
• The assumption while calculating the t-statistic that population is to be normally distributed is not so severe. The results are fairly
close for non-normally distributed populations also unless they are not too skewed.
t-distribution: Characteristics
• Like normal distribution, t-distribution is also bell shaped and symmetrical. But they are flatter in
the middle and have more area in the tails.
• The mean of t-distribution is also zero (like standard normal distribution) but variance depends on
(n-1) which is called the Degree of Freedom (DF) and denoted by the Greek alphabet v (nu).
• The variance of t-distribution approaches 1 as n→∞ or in another words the t-distribution
approaches to standard normal distribution when n→∞.
• The term DF (v) refers to the number of independent observations for a source of variation minus
the number of independent parameters estimated in computing the variation. Here there are n
observations in the sample and one parameter (the mean of population) is being estimated. So v =
n-1.
• The value tα;v on the x-axis for which the area (tail area) under the t-curve with v DF to right is α is
called the t-critical value.
• In other words, the probability that some tv will be greater than tα;v is α for v DF. The Student’s t-
distribution table provides the values of tα;v for few α and v.
• In the table for a given v, when α decreases, the value of tα;v increases. That means moving away
from the mean towards right.
• In the table for a given α, when v increases, the value of tα;v gradually decreases.
• For v = ∞, the tα;v values are the corresponding z values for (1 - α) in the standard normal
distribution table. E.g. tα;v = 1.960 for v = ∞ and α = 0.025. The z-distribution table value for z =
1.960 is 0.9750 that is (1 - α).
Student’s t-distribution Table
No inference is made; only table
Example-1 calculations are done in this example!
= -3.19
From the Student’s t-distribution table, we can observe that the t value -3.19 will
occur in the range of 0.005 > α > 0.001 for v = 19 for the left half of the t-curve. Student’s t-distribution Table
No inference is made; only table
Example-2 calculations are done in this example!
The population mean of the heights of five-year old boys is claimed to be 100
cm. A teacher measures the height of her 25 students, obtaining a mean height
of 105 cm and standard deviation 18 cm.
Calculate the t-statistic. What is the range of α expected from this data?
Given that:
µ = 100
n = 25, v = 25 - 1 = 24, 𝑥 = 105 and s = 18
So
x− μ
t=
s/ n
105 −100
= 18
25
= 1.39
From the Student’s t-distribution table, we can observe that the t value 1.39
will occur in the range of 0.1 > α > 0.05 for v = 24 for the right half of the t-
curve.
An optical firm purchases glasses to be ground for lenses and it is known from
the past experiences that the variation in the refractive index of this kind of
glass is 1.26 x 10-4. The firm rejects a shipment if the sample variance of 20
pieces selected at random exceeds 2.0 x 10-4. Calculate the value of χ2 and the
range of α from it.
Given that:
σ2 = 1.26 x 10-4
n = 20 and s2 = 2.0 x 10-4
2
n − 1 s2
χ =
σ2
20 − 1 . 2.0 x 10−4
=
1.26 x 10−4
= 30.16
For given v = 20 - 1 = 19, the range is 0.05 > α > 0.025.
n = 44, 𝑥 = 10.50
σ = 7.70
after the mean that is (0.5000 - α/2) = (0.5000 - 0.05) = 0.4500 Explore Inverse
Normal Feature
= (1.64 + 1.65) / 2 = 1.645 in the Calculator
σ σ
x − zα/2 ≤ μ ≤ x + zα/2
n n
7.70 7.70
10.50 − 1.645. ≤ μ ≤ 10.50 + 1.645.
44 44
8.59 ≤ 𝛍 ≤ 12.41
Exercise
1. Construct an 80% confidence interval for the population mean given that: x = 56.7, σ = 12.1, N = 500 and n = 47.
(Answer: 54.4381 ≤ μ ≤ 58.9619)
2. A random sample of size 39 is taken from a population of 200 members. The sample mean is 66 and the
population standard deviation is 11. Construct a 96% confidence interval to estimate the population mean?
(Answer: 62.3825 ≤ μ ≤ 69.6175)
3. A community health association is interested in estimating the average number of maternity days women stay in
the local hospital. A random sample is taken of 36 women who had babies in the hospital during the past year.
The numbers below show the maternity days each woman in the sample was in the hospital. Use this data and a
population standard deviation of 1.17 to construct a 98% confidence interval to estimate the average maternity
stay in the hospital for all women who have babies in this hospital: 3, 3, 4, 3, 2, 5, 3, 1, 4, 3, 4, 2, 3, 5, 3, 2, 4, 3, 2,
4, 1, 6, 3, 4, 3, 3, 5, 2, 3, 2, 3, 5, 4, 3, 5, 4.
(Answer: 2.85 ≤ μ ≤ 3.76)
Interval Estimation of Population Mean
Using t-statistic
• We have reviewed the situations when t-statistic can be used to observe the distribution of the population mean
(𝜇) using sample mean (𝑥)ҧ and sample standard deviation (s).
• Correspondingly, the t-distribution can be utilized to estimate the interval of population mean for a given
confidence level.
• The method is similar to what is used in estimating the interval of population mean using z-statistic when
population standard deviation (σ) was known, except in this method the sample standard deviation (s) will be
used.
• Using the t-statistic for v Degree of Freedom (DF) and (1-α) confidence interval:
𝒔 𝒔
𝒙 − 𝒕𝜶/𝟐 ≤ 𝝁 ≤ 𝒙 + 𝒕𝜶/𝟐
𝒏 𝒏
• The relationship of the sample variance to the population variance is captured by the chi-square distribution.
The ratio of the sample variance (s2) multiplied by (n – 1) to the population variance (σ2) is approximately chi-
square distributed. The samples are assumed to be drawn from the normal population.
• s2 cannot be negative and this sampling distribution is related to Gamma Distribution with α = v/2 and β = 2.
Where v is the Degree of Freedom (DF) and is equal to (n - 1) where n is the sample size.
• Chi-square is denoted by the square of Greek alphabet chi (χ2) and is defined as:
𝒏 − 𝟏 𝒔 𝟐
χ2 =
𝝈𝟐
• The above expression can be re-written for the population variance as:
𝟐
𝒏 − 𝟏 𝒔𝟐
𝝈 =
χ2
• Therefore, interval estimate for the population variance for v DF and (1 - α) confidence level can be written as:
(𝒏 − 𝟏)𝒔𝟐 (𝒏 − 𝟏)𝒔 𝟐
𝟐
≤ 𝝈𝟐 ≤ 𝟐
χ𝜶/𝟐 χ𝟏−(α/𝟐)
• The meaning of population variance interval estimate is discussed in the next slide.
Population Variance: Interval Estimation
• From the chi-square distribution, we reviewed that the variance and chi-square value are
inversely proportional.
𝒏 − 𝟏 𝒔𝟐
𝝈𝟐 =
χ2
• So as the value of χ2 progresses towards the right from the origin, the corresponding value of 𝜎2
decreases.
• Chi-square distribution table is arranged in such a way that it tells the value of χ2 for v DF and
area under the curve (α) to the right of χ2 .
• When it is required to estimate the population variance interval for the confidence level (1 - α),
the following areas (shaded) are excluded from the distribution curve:
i. α/2 to the right of χ2𝛼/2 χ2α/2 can be read from the table
2 2
ii. α/2 to the left of χ1−(𝛼/2) how will χ1−(α/2) be read; area α/2 is on the left?
2
• Notice that for the point (ii) above the value of χ1−(𝛼/2) will be equal to χ2 value for which the
area on the right is 1 – (α/2).
• Therefore, interval estimate for the population variance for v DF and (1 - α) confidence level can
be written as:
(𝒏 − 𝟏)𝒔𝟐 𝟐
(𝒏 − 𝟏)𝒔𝟐
≤𝝈 ≤
χ𝟐𝜶/𝟐 χ𝟐𝟏−(α/𝟐)
Example-1
A company manufactures aluminium cylinders and when a random sample of 8
cylinders is taken, the variance of the diameters comes out as 0.0022125.
Estimate the interval for the diameter variance for the cylinders that are
produced by the company for the 90% confidence level.
Given that:
s2 = 0.0022125
n = 8 so v = (n - 1) = 7
(1 – α) = 0.90 so α = 0.10 and α/2 = 0.05, 1 – (α/2) = 0.95
From the table:
χ2α/2 = χ20.05 = 14.067
2
χ1−(α/2) = χ20.95 = 2.167
(n − 1)s2 2
(n − 1)s2
≤σ ≤ 2
χ2α/2 χ1−(α/2)
7 x 0.0022125 7 x 0.0022125
≤ σ2 ≤
14.067 2.167
0.001101 ≤ 𝛔𝟐 ≤ 0.007147
Example-2
Refractive indices of 20 random pieces of glass selected from a large shipment
that is purchased from an optical firm have a variance as 1.20 x 10-4. Construct
a 95% confidence interval for the population variance.
Given that:
s2 = 1.20 x 10-4
n = 20 so v = (20 - 1) = 19
(1 – α) = 0.95 so α = 0.05 and α/2 = 0.025, 1 – (α/2) = 0.975
From the table:
χ2α/2 = χ20.025 = 32.852
2
χ1−(α/2) = χ20.975 = 8.906
(n − 1)s2 2
(n − 1)s2
≤σ ≤ 2
χ2α/2 χ1−(α/2)
19 x1.20 x 10−4 19 x1.20 x 10−4
≤ σ2 ≤
32.852 8.906
0.000069402 ≤ 𝛔𝟐 ≤ 0.0002560
Exercise
1. A random sample of 14 families produced the monthly household incomes that is shown below in the table.
Construct a 95% confidence interval for the population variance assuming household income is normally
distributed.
Monthly Income for 14 Sample Families (₹)
37,500 44,800 33,500 36,900 42,300 32,400 28,000
41,200 46,600 38,500 40,200 32,000 35,500 36,800
There are many problems in which, rather than estimate the value or interval of a parameter, we must
decide whether a statement concerning a parameter is True or False. Statistically speaking we test a
hypothesis (claim) about a parameter. Few example statements are:
o Average time spent by women on social media is more than men.
o Children who drink health drinks (Horlicks etc.) grow taller.
o Average salary of people with PhD in engineering is more than the salary of people with just the
masters degrees.
Recap Example-1: A paint manufacturer claims that the mean drying time of their new paint is 20
minutes with standard deviation of 2.4 minutes. A researcher decided to reject the claim if a sample of
36 new paint boxes yield a mean that exceeds 20.75 minutes of drying time.
Reject the Claim
𝑥ҧ − 𝜇 20.75 − 20
𝑧= = = 1.875
𝜎/ 𝑛 2.4/ 36
From the z-distribution table the shaded area = 0.0304, which is the probability of rejecting the claim. z=0 z = 1.875
Reject or retain decision will depend on the direction of the deviation of the estimated static from
the hypothesis value (population parameter) that is being ascertained.
Scenario-1:
Hypothesis: In US, average annual salary of Machine Learning (ML) experts is at least $100,000.
Null Hypothesis (H0): µML < 100,000
Alternative Hypothesis (HA): µML >= 100,000
Area = α
In this case, alternative hypothesis is for more than the hypothesis value, so rejection region will be on
the right side of the one tailed test.
Scenario-2:
Hypothesis: Average waiting time (w) at the Bangalore airport security check is less than 30 minutes.
Null Hypothesis (H0): µw >= 30
Alternative Hypothesis (HA): µw < 30
In this case, alternative hypothesis is for less than the hypothesis value, so rejection region will be on the
left side of the one tailed test. Area = α
Scenario-3:
Hypothesis: Average annual salaries of Engineering (E) and Business Management (M) graduates are
different.
Null Hypothesis (H0): µE = µM
Alternative Hypothesis (HA): µE ≠ µM
In this case, alternative hypothesis could be less than or more than the hypothesis value, so rejection
region will be on the both the sides of the two tailed test. (observe α/2 regions). Area = α/2 Area = α/2
Area only under the tail
shown below.
Consolidated View
α p-value
α p-value
xത − μ 84 − 82
z − statistic = σ = = 𝟏. 𝟖𝟏𝟑𝟐 α/2 α/2
11.03
n 100
z-critical z-statistic z-statistic z-critical
p-value for z-statistic = 0.0349 and total p-value (both sides) = 0.0698 value value
Two-Tail H0 Retention
Since, p-value > significance value, so null hypothesis is retained.
Example-4
An agricultural researcher believes the average farm size has now increased from the 1990 mean figure of 4.71
acres. To test this notion, he randomly sampled 23 farms across India and ascertained the size of each farm from
the county records. The data he gathered follows a mean of 5 acres with standard deviation of 0.47. Use a 5%
level of significance to test the hypothesis.
Given that:
µ = 4.71, s = 0.47, n = 23, v = 23 - 1 = 22 and xത = 5
Significance value (α) = 0.05
This problem involves t-statistic because σ is unknown.
Null Hypothesis (H0): µ <= 4.71
Alternative Hypothesis (HA): µ > 4.71
Alternative hypothesis is more than the hypothesis value, so rejection region will be on the right side of the one-
tailed test.
Critical value for α = 0.05 and v (= 22) = 1.717 α
xത − μ 5 − 4.71
t − statistic = s = = 𝟐. 𝟗𝟔
0.47 p-value
n 23
Since t-statistic > critical value (or, p-value < significance value) t-critical t-statistic
value
so null hypothesis is rejected. Right-Tail H0 Rejection
Example-5
A heavy machine part needs to be 25 pounds average in weight. The factory supervisor has doubts that the new
parts that are being built are no more of 25 pounds and their manufacturing process has gone out of control. To
verify the doubt, 20 parts are sampled which are found to have a mean weight as 25.51 pounds with standard
deviation as 2.19 pounds. Using significance value of 0.05, conduct a hypothesis test.
Given that:
µ = 25, s = 2.19, n = 20, v = 20 - 1 = 19 and xത = 25.51
Significance value (α) = 0.05
This problem involves t-statistic because σ is unknown.
Null Hypothesis (H0): µ = 25
Alternative Hypothesis (HA): µ ≠ 25
Alternative hypothesis could be less than or more than the hypothesis value, so rejection region will be on both
the sides of the two tailed test with significance value (α/2) = 0.025
Critical value for α/2 = 0.025 and v (= 19) = 2.093
xത − μ 25.51 − 25 p-value/2 p-value/2
t − statistic = s = = 𝟏. 𝟎𝟒𝟏𝟓
2.19
n 20 α/2 α/2
Since t-statistic < critical value (or, p-value > significance value)
t-critical t-statistic
so null hypothesis is retained. value
t-statistic t-critical
value
Two-Tail H0 Retention
Area only under the tail
Recap: How Hypothesis Formulation Works? shown below.
Two-Tail Test
χ2-statistic >=Critical Value (χ21-α/2)
AND
p-value >= α Retain H0
χ -statistic <= Critical Value (χ2α/2)
2
Example-1
A company’s goal is to minimize the number of machine parts that are piled up at the workstation waiting to be installed. The company
expects that, on the average, about 20 parts will be at the station. However, the production superintendent suspects that the variance of
has increased and it is now greater than 4 at the workstation. On a given day, the number of machine parts piled up at the workstation is
determined eight different times and the following number of parts are recorded. 23, 17, 20, 29, 21, 14, 19 and 24. Using the significance
value of 5%, test the hypothesis.
Given that:
σ2 = 4, n = 8, v = 8 - 1 = 7
From the given data sample variance (s2) can be calculated as = 20.9821
Significance value (α) = 0.05
This problem involves χ2-statistic.
Null Hypothesis (H0): σ2 <= 4
Alternative Hypothesis (HA): σ2 > 4
Alternative hypothesis is more than the hypothesis value, so rejection region will be on the right side of the one-tailed test.
Critical value for α = 0.05 and v (= 7) = 14.067 α
𝑛 − 1 𝑠2 7 x 20.9821
χ2−statistic = = = 𝟑𝟔. 𝟕𝟏𝟗 p-value
𝜎2 4
Since χ2-statistic > critical value (or, p-value < significance value)
so null hypothesis is rejected. χ2-critical χ2-statistic
value
Right-Tail H0 Rejection
Example-2
A small business has few employees. Because of the uncertain demand for its product, the company usually pays overtime on any given
week. The company assumed that about 50 total hours of overtime per week is required and that the variance on this figure is about 25.
Company officials want to know whether the variance of overtime hours has changed. A sample of 16 weeks of overtime data (in hours
per week) gave a variance of 28.06. Use this data to test the null hypothesis that the variance of overtime data is 25. Significance value is
0.10.
Given that:
σ2 = 25, n = 16, v = 16 - 1 = 15
From the given data sample variance (s2) = 28.06
Significance value (α) = 0.10
This problem involves χ2-statistic.
Null Hypothesis (H0): σ2 = 25
Alternative Hypothesis (HA): σ2 ≠ 25
Alternative hypothesis could be less than or more than the hypothesis value, so rejection region will be on both the sides of the two tailed
test with significance value (α/2) = 0.05
Critical value for α/2 = 0.05 and v (= 15) = 24.996 (right end tail)
Critical value for 1-(α/2) = 0.95 and v (= 15) = 7.261 (left end tail)
𝑛 − 1 𝑠2 15 x 28.06
χ2−statistic = = = 16.836
𝜎2 25 Remember: χ2 distribution
starts from the origin.
Since, Critical Value (χ21-α/2) < χ2-statistic < Critical Value (χ2α/2)
Not in the so null hypothesis is retained.
Exercise
1. Calculate the p-value to reach to a statistical conclusion: Null Hypothesis (H0): µ = 25, Alternative
Hypothesis (HA): µ ≠ 25, xത = 28.1, n = 57, σ = 8.46 and α = 0.01.
(Answer: p-value = 0.0028 x 2)
2. A random sample of size 20 is taken, resulting in a sample mean of 16.45 and a sample standard
deviation of 3.59. Assume x is normally distributed and α = 0.05. Test the following hypotheses: Null
Hypothesis (H0): µ = 16, Alternative Hypothesis (HA): µ ≠ 16.
(Answer t−statistic = 0.56, Retained)
3. Test the hypothesis when Null Hypothesis (H0): σ2 = 20, Alternative Hypothesis (HA): σ2 > 20 given that α
= 0.05, n = 15, s2 = 32.
(Answer χ2−statistic = 22.4, Retained)
Type-I and Type-II Errors
• Type-I Error: Conditional probability of rejecting a null hypothesis when it is true is called Type-I Error. It is same as
significance value and denoted by α.
• Type-II Error: Conditional probability of retaining a null hypothesis when null hypothesis is false is called Type-II Error. It is
denoted by β.
• Power of Hypothesis Test: Conditional probability of rejecting a null hypothesis when it is false is called the power of the
hypothesis test. It is equal to 1 – β.
Retain Reject
H0 is True Correct Decision Type-I Error (α)
H0 is False Type-II Error (β) Correct Decision
Challenges in Finding Out the Type-II Error: (Example): As per the data the average IQ is 82 with standard deviation 11.03 for
the citizen of a country. Ministry does not agree with this data and believes that IQ is now more than 82.
Null Hypothesis (H0): µ <= 82
Alternative Hypothesis (HA): µ > 82
• Type-II error will occur when null hypothesis is false (alternative hypothesis is true) and it is retained.
• So if null hypothesis is false, what is the actual value of the mean. It is > 82, but what is the actual value? It could be 82.5, 83,
83.5 or anything else?
• So, Type-II Error is calculated w.r.t. an assumed true mean value.
Example-1
A research firm claimed that average IQ is 82 with standard deviation 11.03 for the citizen of a country.
Ministry does not completely agree with the research firm and believes that IQ is more than 82 at around
86. If significance value is 0.05, sample count = 100 and calculate the Type-II error.
Null Hypothesis (H0): µ <= 82
Alternative Hypothesis (HA): µ > 82
Ministry believes that IQ is more than 82, so it is a right tail test.
For α = 0.05, the z critical value = 1.645
σ 11.03
x-critical value = μ + z = 82 + 1.645. = 83.81
n 100
83.81−86
The z-value for 83.81 w.r.t. the assumed true mean = = −1.98549
11.03/ 100
∞ = 𝑎1 . 𝐸 𝑥1 + 𝑎2 . 𝐸(𝑥2 )
= න (𝑎. 𝑥 + 𝑏). 𝑓 𝑥 𝑑𝑥 = 𝒂𝟏 . 𝝁𝟏 +𝒂𝟐 . 𝝁𝟐
−∞
∞ ∞ • The variance of Y can be expressed as:
= 𝑎. න 𝑥. 𝑓 𝑥 𝑑𝑥 + 𝑏. න 𝑓 𝑥 𝑑𝑥
−∞ −∞ 𝑉𝑎𝑟 𝑌
= 𝑎. 𝐸 𝑥 + 𝑏 2
= 𝐸 𝑌 − 𝜇𝑌
2
= 𝒂. 𝝁𝒙 + 𝒃 = 𝐸 𝑎1 𝑥1 + 𝑎2 𝑥2 − 𝑎1 𝜇1 − 𝑎2 𝜇2
= 𝐸 𝑎1 (𝑥1 −𝜇1 + 𝑎2 (𝑥2 −𝜇2 )]2
• We have also reviewed that the second order moment about the mean is the
variance. So the variance of g(x): = 𝐸 𝑎12 . (𝑥1 −𝜇1 )2 + 𝑎22 . (𝑥2 −𝜇2 )2 + 2. 𝑎1 . 𝑎2 . (𝑥1 −𝜇1 . (𝑥2 −𝜇2 )]
𝑉𝑎𝑟 𝑎. 𝑥 + 𝑏 = 𝑎12 . 𝐸(𝑥1 −𝜇1 )2 + 𝑎22 . 𝐸(𝑥2 −𝜇2 )2 + 2. 𝑎1 . 𝑎2 . 𝐸[(𝑥1 −𝜇1 ). (𝑥2 −𝜇2 )]
∞ = 𝑎12 . 𝑉𝑎𝑟(𝑥1 ) + 𝑎22 . 𝑉𝑎𝑟(𝑥2 ) + 2. 𝑎1 . 𝑎2 . 𝐸[(𝑥1 −𝜇1 ). (𝑥2 −𝜇2 )]
=න 𝑎. 𝑥 + 𝑏 − 𝑎. 𝜇𝑥 +𝑏 2. 𝑓 𝑥 𝑑𝑥
−∞ • If x1 and x2 are independent their covariance will be 0.
∞
𝐸[(𝑥1 −𝜇1 ). (𝑥2 −𝜇2 )] = 0
= 𝑎2 න 𝑥 − 𝜇𝑥 2. 𝑓 𝑥 𝑑𝑥
−∞
• So:
𝟐
= 𝒂 . 𝑽𝒂𝒓 𝒙 𝑽𝒂𝒓 𝒀 = 𝒂𝟐𝟏 . 𝑽𝒂𝒓(𝒙𝟏 ) + 𝒂𝟐𝟐 . 𝑽𝒂𝒓(𝒙𝟐 )
Example
Example-1: Given that if x1 and x2 are independent. Let x1 has mean 4 and variance 9 and let x2 has mean -2 and variance 6.
Find the mean and variance of: (2x1 + x2- 5).
We have reviewed that: if Y = a1 x1 + a2 x2 , then where x1 and x2 are independent and a1 and a2 are two
Mean ax + b = a. μx + b constants:
Var ax + b = a2 . Var x Mean Y = a1 μ1 + a2 μ2
Var Y = a12 Var(x1 ) + a22 Var(x2 )
Example-2: Given that if x1 and x2 are independent. Let x1 has mean μ1 and variance σ21 and let x2 has mean μ2 and variance
σ22.Find the mean and variance of: (x1 - x2) and (x1 + x2).
Comparing (x1 - x2) with a1 x1 + a2 x2 Comparing (x1 + x2) with a1 x1 + a2 x2
a1 = 1 and a2 = −1 a1 = 1 and a2 = 1
So: So:
Mean (x1 - x2) = μ1 − μ2 Mean (x1 + x2) = μ1 + μ2
Var Y = Var(x1 ) + Var(x2 ) Var Y = Var(x1 ) + Var(x2 )
= σ21 + σ22 = σ21 + σ22
Interval Estimation & Hypothesis Testing
Using z-Statistic
• Let us say two samples (with xത 1 and xത 2 means) are drawn from two populations (with μ1 and μ2 means
and σ1 and σ2 standard deviations) of sizes of sizes n1 and n2 .
• The central limit theorem states that the difference in two sample means is normally distributed for large
sample sizes (both n1 and n2 >= 30) regardless of the shape of the populations with the following
properties:
𝝁ഥ𝒙𝟏 −ഥ𝒙𝟐 = 𝝁𝟏 − 𝝁𝟐
190−198 − 0
= = −𝟐. 𝟑𝟐𝟎𝟐
18.502 15.602
+ 47
51
Since, z-statistic >= z-critical value for the left-end tail test, the null hypothesis is retained and auditor’s belief is tested as incorrect.
Interval Estimation & Hypothesis Testing
Using t-Statistic
• The pooled variance of the two samples (x1 and x2) can be given as:
𝒏𝟏 𝟐 𝒏𝟐 𝟐
σ 𝒊=𝟏(𝒙 𝒊𝟏 − ഥ
𝒙 𝟏 ) + σ 𝒊=𝟏(𝒙𝒊𝟐 − 𝒙ഥ𝟐 )
𝑺𝟐𝒑 =
(𝒏𝟏 − 𝟏) + (𝒏𝟐 − 𝟏)
• If the individual variances of the samples are S12 and S22 respectively, the above expression can be re-written as:
𝟐
𝒏𝟏 − 𝟏 . 𝑺𝟐𝟏 + 𝒏𝟐 − 𝟏 . 𝑺𝟐𝟐
𝑺𝒑 =
(𝒏𝟏 +𝒏𝟐 − 𝟐)
• From the Central Limit Theorem of two populations, the standard deviation for the difference of sample means:
𝝈𝟐𝟏 𝝈𝟐𝟐
𝝈𝒙ഥ𝟏−ഥ𝒙𝟐 = +
𝒏𝟏 𝒏𝟐
• If the standard deviations of two populations are not known, the above expression can be used to estimate the
standard deviation of the difference of the two sample means:
𝟏 𝟏
𝑺𝒙ഥ𝟏−ഥ𝒙𝟐 = 𝑺𝒑 . +
𝒏𝟏 𝒏𝟐
• The t-statistic is calculated by:
𝐱ത 𝟏 −ഥ𝐱𝟐 − (𝛍𝟏 − 𝛍𝟐 )
𝐭=
𝟏 𝟏
𝑺𝒑 . +
𝒏𝟏 𝒏𝟐
Example
Waste industrial material is crushed and used for constructing the roadways. This is not only
environmental friendly but also provides better strength and durability of the roads. Six
samples each from two locations of waste collection centres are picked up. Their resilience
modulus of strength values are listed below. Use the 0.05 significance value and test the null
hypothesis if the mean strength of the waste are same at the two locations.
Resilience Modulus
Location-1 707 632 604 652 669 674
Location-2 552 554 484 630 648 610
Given that, n1 = n2 = 6
Mean (തx1 ) of location-1 = 656.33, variance (S12 ) = 1277.87
Mean (തx2 ) of location-2 = 579.67, variance (S22 ) = 3739.87
Significance value (α) = 0.05, Degree of Freedom (v) = 6+6-2 = 10
Null Hypothesis (H0): μ1 = μ2 or μ1 − μ2 = 0 xത1 −ഥx2 − (μ1 − μ2 )
t − statistic =
Alternative Hypothesis (HA): μ1 ≠ μ2 1 1
𝑆𝑝 . 𝑛 + 𝑛
1 2
The t-critical value for α/2 = 0.025 for two-tailed test for the given v = 2.228
656.33−579.67 − 0
n1 − 1 . S12 + n2 − 1 . S22 t − statistic = = 𝟐. 𝟔𝟓
Sp2 = 1 1
(n1 +n2 − 2) 50.09. +
6 6
6 − 1 . 1277.87 + 6 − 1 . 3739.87 Since t-statistic > t-critical value for the right-
Sp2 = = 2508.87, so Sp = 50.09 end tailed test, the null hypothesis is rejected. It
(6 + 6 − 2)
also means that the strength of the waste is
different at the two locations.
Example
A coffee manufacturer is interested in estimating the difference in the average daily coffee consumption of regular-
coffee drinkers and decaffeinated-coffee drinkers. Its researchers sampled 13 and 15 coffee drinkers in each category
respectively and collected the data for the number of cups of coffee they drink. The average for the regular-coffee
drinkers is 4.35 cups, with a standard deviation of 1.20 cups. The average for the decaffeinated-coffee drinkers is 6.84
cups, with a standard deviation of 1.42 cups. Assuming that the daily consumption is normally distributed, construct a
95% confidence interval to estimate the difference in the averages of the two populations.
Given that:
S1 = 1.20, S2 = 1.42, n1 = 13, n2 = 15, xത1 = 4.35 and xത 2 = 6.84
The Degree of Freedom (v) = 13+15-2 = 26
The t-value for α/2 = (1-0.95)/2 = 0.025 and the given degree of freedom = 2.056
n1 − 1 . S12 + n2 − 1 . S22
Sp2 =
(n1 +n2 − 2)
13 − 1 . 1.202 + 15 − 1 . 1.422
Sp2 = = 1.75, so Sp = 1.32
(13 + 15 − 2)
1 1 1 1
xത1 −ഥx2 − 𝑡. 𝑆𝑝 . + ≤ (μ1 − μ2 ) ≤ xത1 −ഥx2 + 𝑡. 𝑆𝑝 . +
𝑛1 𝑛2 𝑛1 𝑛2
1 1 1 1
4.35−6.84 − 2.056 ∗ 1.32. + ≤ (𝜇1 − μ2 ) ≤ 4.35−6.84 + 2.056 ∗ 1.32. +
13 15 13 15
−𝟑. 𝟓𝟐 ≤ (𝛍𝟏 − 𝛍𝟐 ) ≤ −𝟏. 𝟒𝟔
Matched Pair Comparison
• There are many situations where before-and-after condition of the data is to be compared. E.g.:
o Ceramic components of a mechanical part lose weight after the baking process.
o Price to Earning Ratio (P/E) of selected 10 companies in the two successive years.
o Market price of the flats in an apartment complex before and during Covid-19 pandemic.
o Weekly loss of working hours in a factory before and after taking the stringent safety measures.
o Body Mass Index (BMI) before and after following a strict diet and exercise regime for 3 months.
• This comparison of two populations is popularly known as Matched Pair Comparison or correlated t-tests.
• t-statistic for this test is given by:
ഥ−𝑫
𝒅
𝒕 − 𝒔𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄 =
𝑺𝒅
𝒏
Where:
𝑑ҧ = mean sample difference
D = mean population difference
𝑆𝑑 =standard deviation of the sample difference
n = number of pairs with degree of freedom = n-1
• Null hypothesis is tested for mean population difference D = 0.
• The hypothesis D = 0 indicates that the means of the two responses are the same.
Example
A stock market investor thinks that there is a significant difference in the P/E (price to earnings) ratio for 9
companies from a year to the next. Test the hypothesis taking the significance value as 0.01 and the data
below. Company Year-1 P/E Ratio Year-2 P/E Ratio d
ҧ
𝐝−𝐃 1 8.9 12.7 -3.80
𝐭 − 𝐬𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜 = 2 38.1 45.4 -7.30
𝐒𝐝 3 43 10 33.00
𝐧 4 34 27.2 6.80
5 34.5 22.8 11.70
Given that: 6 15.2 24.1 -8.90
7 20.3 32.3 -12.00
Significance value (α) = 0.01, Degree of Freedom (v) = 9-1= 8 8 19.9 40.1 -20.20
9 61.9 106.5 -44.60
2. Construct a 98% confidence interval to estimate the difference in population means using the above data.
(Answer: 4.04 ≤ (μ1 − μ2 ) ≤ 10.02)
3. A sample-1 of size 8 has mean 24.56 and standard deviation 12.40. Another sample-2 of size 11 has mean 26.42
and standard deviation 15.80. Using 1% significance value, test the alternative hypothesis that µ1- µ2 < 0.
(Answer: t − statistic = −1.05, null hypothesis is retained or alternative hypothesis is rejected)
4. The following are the average weekly losses of worker-hours due to accidents in 10 industrial plants before and
after a certain safety program was put into operation. Using the significance value as 0.05, test the null
hypothesis that safety program is not effective.
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
(Answer: t − statistic = 4.03, null hypothesis is rejected)
F-distribution
• If S21 and S22 are the variances of independent random samples
of size n1 and n2 respectively taken from two normally
distributed populations having the same variance then:
𝑺𝟐𝟏
𝑭= 𝟐
𝑺𝟐
• If two samples come from the same population (or
populations with equal variances), the ratio of the sample
variances (F) should be 1. However, because of sampling errors
this ratio vary.
• The distribution of F belongs to Beta-distribution, where the
value of the random variable takes on a value between 0 and
1.
• F follows a continuous distribution with degrees of freedom
(df) for numerator (v1) = n1-1 and for denominator (v2) = n2-1.
• Statistics books capture the critical values of the F ratio in a
table for few values of v1 and v2 and α (the area under the tail
on the right hand side). This is denoted by Fα, v1, v2.
• Left hand side F value for the probability (1-α) is the inverse of
corresponding F value for α. So:
𝟏
𝑭(𝟏−𝜶),𝒗𝟏,𝒗𝟐 =
𝑭𝜶,𝒗𝟏,𝒗𝟐
• F-distribution is sensitive to the assumption that populations
are in normal distribution.
Hypothesis Testing
Concerning Two Variances
Two machines produce metal sheets that are specified to be 22 mm thick. There is variability in the thickness because of several factors.
Operators are concerned about the consistency of the two machines. To test the consistency, few sheets are randomly sampled and their
thickness is measured. The details are captured in the table below. Assume that the sheet thickness is normally distributed. Test the null
hypothesis that the sheets that are produced by these two machine have the same variability with significance value as 0.05.
Given that: α = 0.05 Machine-1 Machine-2
n1 =10, n2 =12, so v1 = 9 and v2 = 11 22.30 21.90 22.00 21.70 22.00
21.80 22.40 22.10 21.90 22.10
Null Hypothesis (H0): σ21 = σ22 22.30 22.50 21.80 22.00
21.60 22.20 21.90 22.10
Alternative Hypothesis (HA):σ21 ≠ σ22 21.80 21.60 22.20 21.90
Since, F-statistic is in the rejection region on the right-end tail, the null hypothesis is rejected.
Exercise
General Brite is specialized in silver plating on the metal surfaces. 12 samples each from two of its workers
are randomly collected that provided the standard deviations of 0.035 mil and 0.062 mil of the thickness of
their plating (1 mil = 1/1000 inch). Using 0.05 significance value, test the null hypothesis if σ21 = σ22
against alternative hypothesis σ21 < σ22. Given that F0.05, 11, 11 = 2.82.
(Answer: F-statistic = 0.3187 and F0.95, 11, 11 = 0.3546, Null Hypothesis rejected)
Analysis of Variance (ANOVA)
Introduction
• We have reviewed hypothesis testing in the cases of 1-population and 2-populations so far. There could be situations
where statisticians want to perform the hypothesis testing for 3 or more populations. Examples:
o Different quality of tyres in the experiment such as low-quality, medium-quality, and high-quality need to be compared.
o Sales of garments under different discount.
o Returns of mutual funds under different categories like large, mid and small cap funds.
• The experiment begins with finalizing few variables:
• Independent Variables (Factors): they could be treatment or classification variables. Examples:
o 0%, 10% or 20% discount on garments is treatment.
o Low, medium or high quality tyres is classification.
o Mutual fund categories.
• Dependent Variables: they are the response to the independent variables in an experiment. Examples:
o Amount of sale with different discounts.
o Wear and tear of different quality tyres.
o Returns from the mutual funds. 1-Population
• The details of One-Way ANOVA will
be reviewed in this module. Hypothesis
2-Populations
• The assumptions are that response Testing
One-Way ANOVA
variable will be in the normal Randomized Block Design
distribution and population variance >= 3-Populations
Two-Way ANOVA
will be the same.
Measures of Partitioning
X
In the conceptual diagram above, assume each circle group consists of several random points (not shown), with
marked centre as their group average.
The red dot (X) out the circles is the average of all random points in these groups.