Inferential PDF
Inferential PDF
The objective of inferential statistics is to use sample data to obtain results about
the whole population.
In a first step the goal is to describe an underlying population and estimate the parameters
with the help of statistics. There are two different approaches for estimating: Point Estimation
and Interval Estimation.
• For Point Estimation one value is given as an estimate a parameter, which is hopefully
close to the true unknown value. We can not expect to find the precise value describing
the population when only using data of a sample.
• For Interval Estimation you give an interval of likely values, where the width of the
interval will depend on the confidence you require to have in this interval.
Example 1
Assume we were interested in the mean income of all Canadians. In order to calculate the
value of this parameter we would have to conduct a census, and get this information from each
individual. Doable???
How much can we learn from a sample? Assume drawing a simple random sample of 500
Canadians and obtaining their income. The sample mean of these numbers should present
some insight into the mean income of all Canadians. But we can not expect that the sample
mean equals the population mean, also if we take another ransom sample, we can not expect
that the two sample means are the same. This describes the sampling variability of a statistic.
In order to learn how sample data can be used to learn about parameters we need to apply
what we learned about the sampling distributions in the previous section.
1
2 Estimation of a population mean µ
2.1 Point Estimator for µ
Definition:
• An estimator is a statistic (based on sample data) for obtaining estimates for the param-
eter.
Example:
• The sample mean x̄ is a point estimator for the parameter population mean µ.
• The sample standard deviation s is a point estimator for the population standard deviation
σ.
Example:
• To estimate the average height µ of students in this class, we take a sample of size 10 and
calculate a sample mean (estimator) of x̄ =172.9cm (estimate). We estimate the mean
height in this class is 172.9cm!
• The sample standard deviation s (estimator) in the sample of 10 students from this class
is s =9.3cm. We estimate that the population standard deviation σ of the height in this
class is 9.3cm(estimate).
A point estimator gives a single value (estimate) that is supposed to be close to the true value
in the population but it doesn’t tell how close the estimate is.
One desirable property of an estimator is that the mean of it’s distribution equals the parameter
it is supposed to estimate.
Definition:
An estimator is said to be unbiased estimator for a parameter if the mean of its distribution is
equal to the true value of the parameter. Otherwise is said to be biased.
The sample mean x̄ of n observation is an unbiased estimator for the population mean µ, since
we saw that µx̄ = µ.
Remark:
2
Given a choice between several unbiased statistics for a given population characteristic, the best
statistic to choose is the one with the smallest standard deviation of its distribution. Because
a smaller standard deviation means that on average the estimates from all the different samples
will be closer to the true value.
√
Since the standard deviation of the the mean σx̄ = σ/ n and the standard deviation of a
single observation is σ, this remark leads us to choose the sample mean as the better estimator
(unbiased statistic), and prefer larger samples since the larger the sample the smaller σx̄ . The
intuition was right!
Remark:
The sample standard variance Pn
2 − x̄)2
n=1 (xi
s =
n−1
is an unbiased estimator for estimating the population variance σ 2 . In fact, the denominator
has √to be n − 1 in order for this statistic to be unbiased. (This statement doesn’t imply that
s = s2 is an unbiased estimator for σ, in fact it usually under estimates the true value of σ.)
Definition:
The distance between an estimate and the true parameter is called the error of estimation.
Definition:
The standard error of a statistic is the standard deviation of the statistic.
Remark: For unbiased estimators, the error of estimation will be most likely (with probability
0.95 for normal distributions) less than 1.96 standard errors (SE)–(Compare Empirical Rule).
But on the other hand we find that for large populations P(x̄ = µ) = 0, a frustrating result,
because we are 100% certain that the value we give is wrong. We only know that x̄ should be
close to µ, but again do not know how close.
To deal with this dilemma we give an interval for estimating µ instead of just one value.
3
Summary:
Estimation of a population mean µ
To estimate the population mean µ, the point estimator x̄ is unbiased, with the standard error
estimated as
σ
SE = √
n
As an alternative to point estimation we can report not just a single value for the population
characteristic, but an entire interval of reasonable values based on sample data. A measure of
confidence will be connected to such an interval.
For example we could give
σ
x̄ ± 2 √
n
to estimate µ.
Then the chances that we capture µ with such an interval is about 95% (it actually 0.9544).
(This means that for 95% of samples the resulting interval (calculated using this formula)
captures the true population value).
If we do not use 2, but 3 as a factor in the interval this chance will increase.
If make the factor smaller the probability to capture µ will decrease.
In general:
Remark:
• The confidence level provides information on how much confidence we can have in the
method (formula) used to construct the interval estimate.
If we were to use the method for different samples, C gives the proportion of intervals,
that the true value falls into the calculated intervals.
• You also can give the confidence level in percent.
• Usual choices for the confidence level are 90%, 95%, or 99%.
• Most confidence intervals are of the form
4
2.2.1 Large-Sample z -Confidence Interval for a Population Mean µ
If we use the statistic x̄ for estimating the population mean µ, we can use the following infor-
mation from the Central Limit Theorem in order to obtain a confidence interval for µ.
• µx̄ = µ
√
• σx̄ = σ/ n standard error of x̄.
This leads to the following confidence interval for the population mean µ.
With z ∗ being the (1 − C)/2) percentile of the standard normal distribution (Table A).
Usually σ is unknown. In the case, that σ is unknown, it can be approximated by the sample
standard deviation s when the sample size is large (n ≥ 30) and the approximate confidence
interval is
Example: Bardwell, Ensign & Mills (2005) assessed the moods of 60 male U.S. Marines
following a month-long training exercise in the arctic. Mean mood scores were compared to
population norms for college men is 8.9. The Marine mean is 13.33 pts. and the sample sd is
2.0 pts. (which we will use instead of σ, since we do not have the population value.)
Do the data indicate that the mood of the Marines was higher after the exercise?
Find a 95% confidence interval for the mean mood of U.S. Marines after an arctic exercise.
!
s
x̄ ± zα/2 √
n
!
2.0
13.33 ± 1.96 √
60
Resulting in the interval [12.824, 13.836]. We can be 95% confident that the mean mood of
U.S. marines after arctic exercises falls between 12.8 and 13.8. Since the entire interval falls
above 8.9, we can also be 95% confident that the mean mood of the U.S. Marines is after such
an exercise higher than for college men.
If we would have wished to calculate a 93% confidence interval, we needed to find the appropriate
z∗:
C = 0.93, then (1 − C)/2 = 0.035, use table A to find z ∗ = −1.81, or just use the positive
z ∗ = 1.81.
5
To find z ∗ from the table remember to locate the value closest to 0.035 inside the table, and
find -1.81 on the margin.
756 ± 9.70
Hence, the 95% confidence interval for µ is from 746.30 to 765.70 grams per day.
The true mean daily intake of diary products for men is with confidence 0.95 in the interval
from 746.30 to 765.70 grams per day.
Remember:
Being ”95% confident” means, if you were to construct 100 95% confidence intervals from 100
different random samples. Of the 100 intervals you expect 95 to capture the true mean, and 5
not to capture the mean.
In conclusion, you can not be sure that a specific confidence interval captures the true mean µ.
it determines the precision in the estimation of µ. For a fixed confidence level, increasing the
sample size decreases the margin of error and improves the precision of estimation .
Argument: Suppose you want to estimate the average daily yield µ of a chemical process and
you want to insure with a high level of confidence that the estimate is not more than 4 tons of
the true mean yield µ.
In this situation you would require that the sampling error of x̄ in a C100% confidence interval
is less than 4 tons.
6
This will ensure, that if you would take 100 samples the distance between the true mean and
the sample mean from about C100 samples will be at most 4 tons.
In general for a given confidence level one choose the required precision for the estimation by
determining the largest value for the margin of error which seems acceptable.
From this the necessary sample size can be determined by solving E = zα/2 √σn for n. We require
that the margin of error in a C confidence interval is less or equal than E.
2
σ z∗σ
z∗ √ ≤ E ⇔ ≤n
n E
Go back to the example. Plan to do a 95% confidence interval for µ, where we allow a margin
of error not greater than E = 4.
At this point we still do not know σ, the standard deviation of the daily yield of this chemical
process.
If σ is unknown, what is the realistic case, you can use the best approximation available:
• A range estimate based on knowledge of the largest and smallest possible measurement:
σ ≈ Range/4.
In this example assume a previous sample would have shown a sample standard deviation of
s = 21tons. Then ∗ 2
z σ 1.9621 2
n≥ = = 105.8
E 4
We obtain that the sample size has to be at least 106 in order to estimate µ with a 95%
confidence interval, with a margin of error smaller than 4.
Find that this result is only approximate since we had to use an approximation for σ, but this
is still better than just choosing any number.
Example:
The financial aid office wishes to estimate the mean cost of textbooks per quarter for students
at a particular college. For the estimate to be useful, it should be used be within $20 of the
true population mean. How large a sample should be used to be 95% confident of achieving
this level of accuracy?
The financial aid knows that the amount spent varies between $50 and $450.
A reasonable estimate of σ is then
range 450 − 50
= = 100
4 4
The required sample size is
2 2
1.96σ 1.96 · 100
n≥ = = 9.82 = 96.04
E 20
So that in this case a sample size of at least 97 is required.
7
2.2.3 t-confidence interval for a mean µ
The problem with the large sample confidence interval for µ is that it requires us to know σ
the population standard deviation. This assumption is strong and never met.
For that reason we should replace the large sample confidence interval with an alternative, that
does not require σ.
Student’s t distribution
Consider the t-score
x̄ − µ
t= √
s/ n
The distribution of the t-score only depends on one parameter, which is called the degrees of
freedom (df). ”Student” showed that the t-score is t distributed with n − 1 degrees of freedom
(df = n − 1). The appendix provides a table (Table IV) with values from this distribution for
different choices for the df .
The table gives uppertail areas.
Example: In the 1994 General Social Survey in the U.S. respondents were asked to rate their
political views on a seven point scale, where 1 = extremely liberal, 4 = moderate, and 7 =
extremely conservative. A report gives the following results
-------------------------------------------
N Mean Std Dev Std Err
2879 4.171 1.390 0.0259
-------------------------------------------
Since the sample size is so high there is no concern that the data is not coming from a normal
distribution when finding a confidence interval for the mean political view (on a 7 point scale).
Find a 99% confidence interval for the mean political view u sing a t-ci:
s
x̄ ± t∗n−1 √
n
8
√
x̄ = 4.171, s/ n = 0.0259, α = 0.01, α/2 = 0.005
Since df = n − 1 = 2878 is greater than any value in the table, we use the largest df and find
t∗n−1 ≈ 2.578, giving for the 99% ci
Since the interval falls entirely above 4 (moderate) we find that in 1994 on average Americans
were more conservative than liberal.
Example: A scientist is interested in monitoring the daily intake of dairy products in a pop-
ulation.
A sample of n = 50 people let to a sample mean of x̄ = 756g with a standard deviation of
s = 35g.
We will find a 95% confidence interval for µ =the mean daily intake of dairy products in this
population.
α = 0.05, so α/2 = 0.025 (upper tail area needed for finding the percentile in table IV),
df = n − 1 = 49, from table VI find t∗40 = 2.021 (use df=40), the largest value that is smaller
than the true df.
!
35
756 ± 2.021 √ → 756 ± 10.002 → [745.998, 766.002]
50
We are 95% confident that the mean daily intake of dairy products in this population falls
between 746g and 766g.