0% found this document useful (0 votes)
3 views

Lecture 04

This lecture covers the estimation of population variance and proportion in data science. It explains point estimation methods for population variance and standard deviation, including confidence intervals for both parameters. Additionally, it discusses the estimation of population proportion and provides examples to illustrate these concepts.

Uploaded by

kht07144
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 04

This lecture covers the estimation of population variance and proportion in data science. It explains point estimation methods for population variance and standard deviation, including confidence intervals for both parameters. Additionally, it discusses the estimation of population proportion and provides examples to illustrate these concepts.

Uploaded by

kht07144
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SEHH2311

FOUNDATIONS OF DATA SCIENCE


LECTURE 5
Estimation of Population Variance
Estimation of Population Proportion
Topics
Estimation of population parameter
• Population variance
• Population proportion

SEHH2311 Foundations of Data Science Page 2


Point Estimation Population Variance 𝝈𝝈𝟐𝟐 and 𝝈𝝈

Suppose 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 are the observed values of an iid random sample from a
population with unknown mean 𝜇𝜇 and variance 𝜎𝜎 2 . The population variance 𝜎𝜎 2
can be estimated by the sample variance
∑𝑛𝑛 2 ∑𝑛𝑛 2 2
𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑛𝑛 𝑥𝑥̅
𝑠𝑠 2 = =
𝑛𝑛 − 1 𝑛𝑛 − 1

Sample variance is an unbiased estimator of the population variance. Note that


the unbiasedness of sample variance does not require the assumption of
normality of the data.
Similarly, the population standard deviation can be estimated by
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2 ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2
𝑠𝑠 = =
𝑛𝑛 − 1 𝑛𝑛 − 1
However the estimator is biased!

SEHH2311 Foundations of Data Science Page 3


𝟐𝟐
CI of 𝝈𝝈 and 𝝈𝝈
Suppose 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 are the observed values of an iid random
sample from a normal population with unknown mean 𝜇𝜇 and
variance 𝜎𝜎 2 .

The 1 − 𝛼𝛼 100% confidence interval for 𝜎𝜎 2 is given by


𝑛𝑛 − 1 𝑠𝑠 2 2 <
𝑛𝑛 − 1 𝑠𝑠 2

2 < 𝜎𝜎 2
𝜒𝜒𝛼𝛼/2;(𝑛𝑛−1) 𝜒𝜒1−𝛼𝛼/2;(𝑛𝑛−1)
The 1 − 𝛼𝛼 100% confidence interval for 𝜎𝜎 is given by
𝑛𝑛 − 1 𝑠𝑠 2 𝑛𝑛 − 1 𝑠𝑠 2
2 < 𝜎𝜎 < 2
𝜒𝜒𝛼𝛼/2;(𝑛𝑛−1) 𝜒𝜒1−𝛼𝛼/2;(𝑛𝑛−1)

SEHH2311 Foundations of Data Science Page 4


2 2
Calculation of 𝜒𝜒𝛼𝛼/2 and 𝜒𝜒1−𝛼𝛼/2
2
The following graph illustrates the meaning of 𝜒𝜒𝛼𝛼/2 and
2
𝜒𝜒1−𝛼𝛼/2 used in the confidence interval formula.
𝜒𝜒 2 Distribution with (n-1) degrees of freedom

area = 𝛼𝛼/2

area = 𝛼𝛼/2

2
𝜒𝜒1−𝛼𝛼/2 2
𝜒𝜒𝛼𝛼/2

SEHH2311 Foundations of Data Science Page 5


Example 7
The weights (in grams) of a random sample of 5 apples from a farm are given below.
80 85 90 88 82
If weights of apples from the farm are normally distributed with a mean of 𝜇𝜇 g and a
standard deviation of 𝜎𝜎 g. Find a point estimate and 95% confidence interval for 𝜎𝜎 2 .
Solution:
∑𝑖𝑖 𝑥𝑥𝑖𝑖2 −𝑛𝑛𝑥𝑥̅ 2
36193−5(852 )
Point estimate of 𝜎𝜎 2 is 𝑠𝑠 2 = = = 17
𝑛𝑛−1 5−1
2 2
For 95% confidence and 4 d.f., we have 𝜒𝜒0.025 = 11.14 and 𝜒𝜒0.975 = 0.48
𝑛𝑛−1 𝑠𝑠 2 𝑛𝑛−1 𝑠𝑠 2
2 < 𝜎𝜎 2 < 𝜒𝜒2
𝜒𝜒𝛼𝛼/2;(𝑛𝑛−1) 1−𝛼𝛼/2;(𝑛𝑛−1)
4 17 2
4 17
< 𝜎𝜎 <
11.14 0.48
6.104 < 𝜎𝜎 2 < 141.667
Interpretation: The variance of the weights of apples from the farm is between 6.104 and
141.667 with a confidence of 95%.

SEHH2311 Foundations of Data Science Page 6


Example 8
The weights (in grams) of a random sample of 6 oranges from a farm are given below.
72 76 70 70 78 80
If weights of oranges from the farm are normally distributed with a mean of 𝜇𝜇 g and a
standard deviation of 𝜎𝜎 g. Find a point estimate and 90% confidence interval for 𝜎𝜎.
Solution:
𝑠𝑠 2 = 18.26667

SEHH2311 Foundations of Data Science Page 7


Example 8
The weights (in grams) of a random sample of 6 oranges from a farm are given below.
72 76 70 70 78 80
If weights of oranges from the farm are normally distributed with a mean of 𝜇𝜇 g and a
standard deviation of 𝜎𝜎 g. Find a point estimate and 90% confidence interval for 𝜎𝜎.
Solution:
∑𝑖𝑖 𝑥𝑥𝑖𝑖2 −𝑛𝑛𝑥𝑥̅ 2 33244−6 74.333
Point estimate of 𝜎𝜎 is 𝑠𝑠 = = = 4.274
𝑛𝑛−1 6−1
2 2
For 90% confidence and 5 d.f., we have 𝜒𝜒0.05 = 11.07 and 𝜒𝜒0.95 = 1.15
𝑛𝑛−1 𝑠𝑠 2 𝑛𝑛−1 𝑠𝑠 2
2 < 𝜎𝜎 < 2
𝜒𝜒𝛼𝛼/2;(𝑛𝑛−1) 𝜒𝜒1− 𝛼𝛼⁄2;(𝑛𝑛−1)

5 18.26667 5 18.26667
< 𝜎𝜎 <
11.07 1.15
2.872 < 𝜎𝜎 < 8.912
Interpretation: The standard deviation of the weights of oranges from the farm is between
2.872 g and 8.912 g with a confidence of 90%.

SEHH2311 Foundations of Data Science Page 8


𝟐𝟐
Proof of CI Formula for 𝝈𝝈
The problem is equivalent to find 𝐿𝐿 and 𝑈𝑈 such that 𝑃𝑃 𝐿𝐿 < 𝜎𝜎 2 < 𝑈𝑈 = 1 − 𝛼𝛼.
This is slight different from the derivation of CI for mean because the distribution
of sample variance is not symmetric.
Proof:
Suppose 𝑈𝑈 = 𝑑𝑑𝑢𝑢 𝑆𝑆 2 for some constant 𝑑𝑑𝑢𝑢 and 𝑃𝑃 𝜎𝜎 2 > 𝑈𝑈 = 𝛼𝛼/2.
𝑃𝑃 𝜎𝜎 2 > 𝑑𝑑𝑢𝑢 𝑆𝑆 2 = 𝛼𝛼/2
𝑛𝑛−1 𝑛𝑛−1 𝑆𝑆 2
𝑃𝑃 > 𝜎𝜎2 = 𝛼𝛼/2
𝑑𝑑𝑢𝑢
𝑛𝑛−1
𝑃𝑃 > 𝑋𝑋 = 𝛼𝛼/2, 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑋𝑋~𝜒𝜒 2 𝑛𝑛 − 1
𝑑𝑑𝑢𝑢
𝑛𝑛−1 2
= 𝜒𝜒1−𝛼𝛼/2
𝑑𝑑𝑢𝑢
𝑛𝑛−1
𝑑𝑑𝑢𝑢 = 2
𝜒𝜒1−𝛼𝛼/2

𝑛𝑛−1 𝑆𝑆 2 𝑛𝑛−1 𝑆𝑆 2
Therefore 𝑈𝑈 = 2 . Similarly, 𝐿𝐿 = 2
𝜒𝜒1−𝛼𝛼/2 𝜒𝜒𝛼𝛼/2

SEHH2311 Foundations of Data Science Page 9


Point Estimation of Population
Proportion
Suppose 𝑝𝑝 is the probability of success (such as answering “yes” in a survey) in
the population, it is also known as the population proportion. Let 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 be
the observed values of a random sample of size 𝑛𝑛 from the population, where 𝑥𝑥𝑖𝑖
can only take a value of 1 (success) or 0 (fail).
The population proportion 𝑝𝑝 can be estimated by the sample proportion
𝑛𝑛
1
𝑝𝑝̂ = � 𝑥𝑥𝑖𝑖
𝑛𝑛
𝑖𝑖=1
Sample proportion is an unbiased estimator of population proportion 𝑝𝑝.

The standard error of 𝑝𝑝̂ is given by


𝑝𝑝̂ 1 − 𝑝𝑝̂
𝑆𝑆𝐸𝐸𝑝𝑝� =
𝑛𝑛

SEHH2311 Foundations of Data Science Page 10


CI for Population Proportion 𝒑𝒑
Suppose 𝑝𝑝 is the probability of success (such as answering “yes” in a survey) in
the population, it is also known as the population proportion. Let 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 be
the observed values of a random sample of size 𝑛𝑛 from the population, where 𝑥𝑥𝑖𝑖
can only take a value of 1 (success) or 0 (fail).
The 1 − 𝛼𝛼 100% confidence interval for 𝑝𝑝 is given by
𝑝𝑝̂ 1 − 𝑝𝑝̂ 𝑝𝑝̂ 1 − 𝑝𝑝̂
𝑝𝑝̂ − 𝑧𝑧𝛼𝛼/2 < 𝑝𝑝 < 𝑝𝑝̂ + 𝑧𝑧𝛼𝛼/2
𝑛𝑛 𝑛𝑛

This confidence interval is based on the asymptotic (i.e. large sample) distribution
of 𝑝𝑝.̂ It is only valid if 𝑛𝑛 > 30, 𝑛𝑛𝑝𝑝̂ > 5 and 𝑛𝑛 1 − 𝑝𝑝̂ > 5.

The calculation of 𝑧𝑧𝛼𝛼/2 is the same as before in CI for 𝜇𝜇.

SEHH2311 Foundations of Data Science Page 11


Example 9
A survey was conducted to find the proportion of students in ABC College who have
driver’s license. Out of a random sample of 50 students, 15 of them have driver’s license.
Estimate the proportion of students in ABC College who have driver’s license and find the
92% confidence interval of the population proportion.
Solution:
15
An estimate of the proportion is 𝑝𝑝̂ = = 0.3. The corresponding standard error is 𝑆𝑆𝐸𝐸𝑝𝑝� =
50
0.3 1−0.3
= 0.065.
50
For 92% confidence interval, 𝑧𝑧0.04 ≈ 1.75
𝑝𝑝̂ 1−𝑝𝑝̂ 𝑝𝑝̂ 1−𝑝𝑝̂
𝑝𝑝̂ − 𝑧𝑧𝛼𝛼/2 < 𝑝𝑝 < 𝑝𝑝̂ + 𝑧𝑧𝛼𝛼/2
𝑛𝑛 𝑛𝑛

0.3×0.7 0.3×0.7
0.3 − 1.75 < 𝑝𝑝 < 0.3 + 1.75
50 50
0.187 < 𝑝𝑝 < 0.413
Interpretation: The proportion of students in ABC College who have driver’s license is
between 0.187 and 0.413 with a confidence of 92%.

SEHH2311 Foundations of Data Science Page 12


Example 10
In a survey, a mobile phone company interviewed a random sample
of 50 customers and asked them if they are satisfied with their
services. Among the 50 customers, 18 of them are satisfied with the
company’s services. Suppose 𝑝𝑝 is the true proportion of customers
who are satisfied with the company’s services. Give an estimate of 𝑝𝑝
and find its standard error. In addition, find the 97% confidence
interval of 𝑝𝑝.

SEHH2311 Foundations of Data Science Page 13


Example 10
Solution:

SEHH2311 Foundations of Data Science Page 14


Example 10
Solution:
18
The point estimate for 𝑝𝑝 is 𝑝𝑝̂ = = 0.36.
50
𝑝𝑝� 1−𝑝𝑝� 0.36 1−0.36
The standard error 𝑆𝑆𝐸𝐸𝑝𝑝� = = = 0.0679
𝑛𝑛 50
For 97% confidence interval, 𝑧𝑧𝛼𝛼/2 = 𝑧𝑧0.015 = 2.17.
The 97% confidence interval is given by
𝑝𝑝� 1−𝑝𝑝� 𝑝𝑝� 1−𝑝𝑝�
𝑝𝑝̂ − 𝑧𝑧𝛼𝛼/2 < 𝑝𝑝 < 𝑝𝑝̂ + 𝑧𝑧𝛼𝛼/2
𝑛𝑛 𝑛𝑛

0.36×0.64 0.36×0.64
0.36 − 2.17 < 𝑝𝑝 < 0.36 + 2.17
50 50
0.213 < 𝑝𝑝 < 0.507
Interpretation: The proportion of customers of the mobile phone company who
are satisfied with the services is between 21.3% and 50.7% with a confidence of
97%.

SEHH2311 Foundations of Data Science Page 15


Final Remarks About Estimation
• Estimating an unknown parameter by point estimate is not sufficient, we also
want to quantify the error or our estimate
• Two important aspects of estimation
– Accuracy : That is the expected value of the estimator is equal to the true
population value. E.g. 𝐸𝐸 𝑋𝑋� = 𝜇𝜇 means that sample mean has a high accuracy in
estimating the population mean
– Precision: That is the error or the estimation. If an estimator is unbiased, it can be
measured by the stand error (SE) of the estimator. If an estimator is biased, we
need to take into account of the bias as well.
• Standard error (SE) is the standard deviation of the point estimate. Smaller SE means
higher precision
• Confidence interval is another way to present the precision of estimation. High
confidence together with a narrow width means high precision.

SEHH2311 Foundations of Data Science Page 16


Final Remarks About Estimation
• An illustration of accuracy and precision

https://sites.google.com/a/apaches.k12.in.us/mr-evans-science-website/accuracy-vs-precision

SEHH2311 Foundations of Data Science Page 17


Final Remarks About Estimation
• How can we construct better confidence intervals?
– Increasing only the confidence level of a confidence interval
will not improve the precision as the confidence interval will
just become WIDER. Therefore, most people will stick with
the usual 95% confidence interval.
– Increasing sample size is a much more effective method in
improving the precision. It will REDUCE the width of the
confidence interval without changing the confidence level

SEHH2311 Foundations of Data Science Page 18


Next Lecture
• In Lecture 4 and 5, we focused on the estimation of
population mean, population proportion and
population variance. In the next lecture, we will look at
the comparisons of these parameters in two
populations

SEHH2311 Foundations of Data Science Page 19

You might also like