0% found this document useful (0 votes)
119 views

Chapter 1 - Comparing Normal Populations

This document discusses comparing normal populations and the central limit theorem. It explains that if samples are taken from a normal population, the sampling distribution of the sample means will also be normal. Even if the original population is not normal, as the sample size increases, the sampling distribution of the sample means will approximate a normal distribution. It provides an example of taking a random sample of 64 observations from a population with a mean of 15 and standard deviation of 4 and calculating the probability that the sample mean is less than or equal to 15.5.

Uploaded by

MMMARIT
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views

Chapter 1 - Comparing Normal Populations

This document discusses comparing normal populations and the central limit theorem. It explains that if samples are taken from a normal population, the sampling distribution of the sample means will also be normal. Even if the original population is not normal, as the sample size increases, the sampling distribution of the sample means will approximate a normal distribution. It provides an example of taking a random sample of 64 observations from a population with a mean of 15 and standard deviation of 4 and calculating the probability that the sample mean is less than or equal to 15.5.

Uploaded by

MMMARIT
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 1

Comparing normal populations



1. Comparing normal populations

1.1 Central Limit theorem

Suppose that we know the mean and the
standard deviation of a hypothetical population
(, o), and that we take samples of a given size n
from this population. For example, we know the
mean and standard deviation of a population of
fossils of a given specie from world wide data.
Suppose that we do the following experiment:
a)we take 10 fossil samples in a random way from
the population and find the mean and standard
deviation for the 10 fossils.
b)We repeat the experiment many times so that
we have many estimates of the mean and
standard deviation for possible combinations of
10 fossils extracted from the large population.
c)We can plot a histogram for all the calculated
means of the 10 samples, see Figure in next page.
d)We will see that the distribution of the means is
also a normal function but with a smaller
standard deviation given by



And with a value of the sample mean close to the
mean of the population.

Important Fact:
If the population is normally distributed,
then the sampling distribution of is
normally distributed for any sample size n.

What happens with the shape of the sampling distribution of
when the population from which the sample is selected is not
normal?
Population: age of a rock
time 0
f(x)
Example: Non- normal distribution
If a random sample of n observations is selected from a
population (any population), with n sufficiently large, the
sampling distribution of x will be approximately normal.
(The larger the sample size, the better will be the normal
approximation to the sampling distribution of x.)
The Central Limit Theorem
(for the sample mean x)
Why is the Central Limit Theorem so important?
When we select simple random samples of size n, the
sample means will vary from sample to sample. However, we
can model the distribution of these sample means with a
probability model where:
The mean of the means =
The standard deviation of the means=

How large should be the sampling size n ?
For the purpose of applying the central limit theorem, we
will consider a sample size to be large when n > 30.
In summary:

Population: mean ; stand dev. o; shape of population
dist. is unknown; value of is unknown; if we select
random sample of size n;
Sampling distribution of x bar:
mean ; stand. dev. o/\n;
always true!

The shape of the sampling distribution of the means is
approximately normal
The standard deviation of the means is known as the
standard error of estimate of the mean or standard error
se=
1.2 Comparing means
Example
Observation = fossil length
4
8
A r a n d o m s a m p l e o f = 6 4 o b s e r v a t i o n s i s
d r a w n f r o m a p o p u l a t i o n w i t h m e a n = 1 5
a n d s t a n d a r d d e v i a t i o n = 4 .
a . 1 5 ; ( ) . 5
b . T h e s h a p e o f t h e s a m p l i n g d i s t r i b u t i o n o f
i s a p p r o x . n o r m a l ( b
o
n
n
Xx
Se
X
x

o

= = = = =
y t h e C L T ) w i t h
m e a n 1 5 a n d . 5 . T h e a n s w e r
d e p e n d s o n t h e s a m p l e s i z e.
= =
Xx ( )
Se
X
c. What is the probability of the mean of the random
sample of 64 to have a mean equal or smaller than to
15.5?
1 5 . 5 1 5 . 5
. 5 . 5
( )
c . 1 5 . 5 ;
1
T h i s m e a n s t h a t = 1 5 . 5 i s o n e s t a n d a r d
d e v i a t i o n a b o v e t h e m e a n 1 5
x
S D X
x
z
x


=
= = = =
=
Xx
Probability = Normsdist(1)= 0.84
Example 2:
The concentration of mercury in a soil has a mean of 20 ppb
and a standard deviation of 10 ppb. Find the probability that
a random sampling of 24 sites gives a mean that exceeds 16
ppb.

o
x x
= = =
2 0 2 0 4
1 0
2 4
; . .
Hyphotesis:
Null hyphotesis: it is the hyphotesis of no difference,
for example Ho= 1 =2
Alternative hyphotesis: an appropriate alternative to
the null hyphotesis, for example: H1= 1 =2

p-values: the smalles level of significance at
which the null hypothesis should be rejected for
a specific test. 1-p will the the probability that
the null hyphotesis is true.
3.3 P-values and levels of significance

1.4 Confidence limits
A 95% confidence interval looks like
this.
Producing the general equation.
Provides Range of Values
Based on Observations from 1 Sample
Gives Information about Closeness to Unknown
Population Parameter
Stated in terms of Probability
Never 100% Sure
The above calculation gives us two
limits within which the population
mean lies. Further the probability
that the mean lies within these limits
is 95% certain, a probability of 0.95.
The limits can be
summarised as follows.
Confidence Interval
Confidence Limit
(Lower)
Confidence Limit
(Upper)
A Probability That the Population
Parameter Falls Somewhere Within
the Interval.
Error
= Error =
Error
Error
Probability that the unknown population parameter is in the
confidence interval in 100 trials.
Denoted (1 - o) % = level of confidence
e.g. 90%, 95%, 99%
o Is Probability That the Parameter Is Not
Within the Interval in 100 trials
Data Variation
measured by o
Sample Size

Level of Confidence
(1 - o)
Factors that affect the size
of the confidence interval:
Intervals Extend from

X - Zox to X + Z ox
Assumptions
Population Standard Deviation Is
Known
Population Is Normally Distributed
If Not Normal, use large samples
Confidence Interval Estimate
Exercise 2.3 (Davis)

The West Lyons Oil field was discovered in 1963 in
Rice County, Kansas, and was originally estimated
to contain 22 million barrels of oil. The reservoir is a
sandstone of Pennsylvanian (Upper Carboniferous)
age and has been cored in 94 wells in the field. File
WLYONS.TXT contains core measurements of
porosity and water saturation and the thickness of
the reservoir for each of these wells. Assume that
the porosity is normally distributed and the
parameters of the population can be estimated from
the sample statististics. Then, answer the following
questions:
Wat is the probability of measuring
1.A core porosity that is exactly 15%?
2.A core porosity les than 6%?
3.A core porosity greater than 15%?
4.A core porosity between 15 and 16%

Production from the West Lyons oil field is now
declining and operators are considering the
application of a proprietary enhanced recovery
procedure (called PERP) to stimulate recovery. The
PERP method works best on sandstone reservoirs
whose average porosity is 15% or greater. Assume
the population of suitable reservoirs is normally
distributed with a mean porosity of 15% and a
standard deviation of 5%. Could the random taking of
94 cores from such a population result in a
distribution of porosity measurements we observe in
the West Lyons field?

See solution in Excel file WLYONS exercise.
Assumptions
Population Standard Deviation Is Unknown
Sample size must be large enough for central limit
theorem or Population Must Be Normally Distributed
Use Students t Distribution
Confidence Interval Estimate
Confidence Intervals (o Unknown)
1.6 The t-test and confidence intervals
Students t Distribution
t-distribution: small-sample statistics
Z-distribution (normal): large-sample statistics
x
x
t
=
s
n
Number of observations that are free to vary after sample
mean has been calculated
Example
If the mean of 3 Numbers is 2 and
X1 = 1 (or Any Number)
X2 = 2 (or Any Number)
X3 = 3 (Cannot Vary)
Mean = 2
degrees of freedom = n -1
= 3 -1
= 2
Degrees of Freedom (df or v)
Another definition is: the number of observations in a
sample minus the number of parameters estimated
from the sample, or the number of observations in
excess of those neccesary to estimate the parameters
of the distribution.
Important Properties of the
Student t Distribution
1. The Student t distribution is different for different
sample sizes.
2. The Student t distribution has the same general
bell shape as the normal distribution; its wider shape
reflects the greater variability that is expected when s is
used to estimate o .
3. The Student t distribution has a mean of t = 0
(just as the standard normal distribution has a mean of z
= 0).
4. The standard deviation of the Student t
distribution varies with the sample size and is greater
than 1 (unlike the standard normal distribution, which has
o = 1).
5. As the sample size n gets larger, the Student t
distribution gets closer to the standard normal
distribution.
*****NOTE: The TINV function in EXCEL is already a
two tail function. You do not need to divide a/2 to get
the t value, just input TINV(0.05, df).
Example silver content: find the confidence interval of the
log of the mean for the content in the rocks of exercise 2
chapter 2.
3.7 t-test for equality of two sample means

We want to compare the Hg
concentrations in ppb of sediments of
two rivers
10 samples were randomly
taken at the two rivers. The results are
as follows:


River A
River B

1 1114
1032
2 996
1148
3 979
1074
4 1125
1076
5 910
959
6 1056
1094
7 1091
1091
8 1053
1096
9 996
1032
10 894
1012
For the two rivers, Hg descriptive stat.:

Variable N Mean StDev
River A 10 1021.4 80.3
River B 10 1061.4 53.4

Do you think this data provide statistically
sufficient evidence to support that the two Hg
populations are the same?
Hypotheses:

Ho: 1 = 2 vs H1: 1 dif. 2
P-value of two-sample T-test
Sample from Population 1:
n1, X1, S1

Sample from Population 2:
n2, X2, S2



p-value = P(T t) if H1: 1 > 2

p-value = P(T t) if H1: 1 < 2

p-value = 2P(T |t|) if H1: 1 2

To find the p-value we need:
t - number
The degrees of freedom of T- variable

We are interested in the pooled t-test, where the
two variances are the same.

If 1 = 2
The two-sample T-test carried out under this
assumption is called pooled t-test.
The degrees of freedom is (n1+ n2 - 2)
And the t calculated value is given
by:
If 1 = 2













where
Sp ~ the pooled estimate of the standard deviation
Solution to the Hg in river problem:
Variable N Mean StDev
River A 10 1021.4 80.3
River B 10 1061.4 53.4

Sp= 68.19
Se= 30.49
T calculated= -1.31
t critical (0.05,18)= 2.1
p value = 0.21
So at the 95% the two populations are the same.
They are the same at the 79% level or higher.
But at the 79% confidence level or lower the
populations are different.


Continuation of the West Lyons Oil field exercise:

The production geologist in charge of the West Lyons oil
field has speculated that the characteristics of the field
vary with reservoir thickness, perhaps reflecting
differences in the depositional environment and hence
grain sorting and packing. The distribution of reservoir
thickness does appear bimodal, with a distinction
between wells containing more than 30 feet of reservoir
sandstone and those that contain less. Are core porosity
measured where the reservoir sand is over 30 ft thick
significantly different from those measured where the
sandstone is thinner?

See Excel exercise.
1.8 F-test and the equality of two
variances
Fisher was an great geneticist and
evolutionary biologist, he developed the F-
test while doing genetic research on an
agricultural project.

The F-test is a variance test, or a test for equal variance.
It was given the name in honor of R.A.Fisher who
conceived of this test methodology.

The F-distribution is a family of curves based on the
values that would be expected by randomly sampling from
a normal population and calculating, for all possible pairs
of sample variances, the ratios

F= s12/s22

A few of the curves that result are shown in the next
Figure.


F-distribution probability density
function for varying df.
In the F test, the variance of two samples are set as a
ratio with the larger value in the numerator. This
value is compared to a critical value found from the F-
distribution having the same two df as found in the
values of the ratio.

**The larger the value of the ratio, the greater the
variances differ and the less likely the null hypothesis will
be upheld.

**If the F test value (the ratio of variances) is greater than
the critical F statistic, the null hypothesis is rejected.

**The F-test is highly sensitive to departures from a
normal distribution. It cannot be used if the samples being
tested are not normally distributed.

The null hypothesis is:
Ho = o12 = o22

Against: H1= o12 > o22
Degrees of freedom
v1= n1 -1 and v2= n2 -1

Important note for the t test and the F test from
EXCEL:
For the t-test for equal means
"P(T <= t) two-tail" gives the probability that a
value of the t-Statistic would be observed that is
larger in absolute value than t. "P Critical two-tail"
gives the cutoff value so that the probability of an
observed t-Statistic larger in absolute value than
"P Critical two-tail" is Alpha.

For the F test for equal variances
The tool calculates the value f of an F-statistic (or
F-ratio). A value of f close to 1 provides evidence
that the underlying population variances are
equal. In the output table, if f < 1, then "P(F <= f)
one-tail" gives the probability of observing a value
of the F-statistic less than f when population
variances are equal, and "F Critical one-tail" gives
the critical value less than 1 for the chosen
significance level, Alpha. If f > 1, then "P(F <= f)
one-tail" gives the probability of observing a value
of the F-statistic greater than f when population
variances are equal, and "F Critical one-tail" gives
the critical value greater than 1 for Alpha.


1.9 The _2 test
The test was invented in 1900 by
Karl Pearson

The test is used when there are
more than two categories of data
Example: a gambler is accused of sheeting, but he
pleads innocent. A record has been kept for the last 60
throws.

4 3 3 1 2 3 4 6 5 6
2 4 1 3 3 5 3 4 3 4
3 3 4 5 4 5 6 4 5 1
6 4 4 2 3 3 2 4 4 5
6 3 6 2 4 6 4 6 3 2
5 4 6 3 3 3 5 3 1 4

If the gambler is innocent, the numbers from the
table should be like 60 random drawings with
replacement from a box with {1,2,3,4,5,6}. Each
number should show up about 10 times.
The expected frequency is 10.
Value Observed Freq Expected Freq
1 4 10
2 6 10
3 17 10
4 16 10
5 8 10
6 9 10
Sum
60
60
The statistic
= sum of
(observed frequency expected frequency)
expected frequency
2
When the observed frequency is far from the expected
frequency, the
corresponding term in the sum is large; when the two
are close, the term is small.

Large values of indicate that the observed and
expected frequencies are far
apart. Small values of mean the opposite:
observed are close to expected.
So chi-square is a measure of the distance between
observed and expected.
The P-value: the observed significance level
We need to know the chance that when a fair die is rolled 60
times and is computed from the observed frequencies, its
value turns out to be 14.2 or more.
Degrees of freedom =6-1
Using CHIDIST(14.2, 5)= 1.4%=P

Using CHIINV(0.05, 5)= 11.07

The answer P=1.4% That is, if the die is fair there is 1.4%
chance for the statistic to be as big as or bigger than the
observed one.
In addition, at the 95% confidence level the value of 14.2 is
higher than 11.07, so
Conclusion: The gambler is in trouble!!!
For the -test the P-value is approximately
equal to the area to the right of the observed
value for the statistic, under the -curve
with the appropriate number of degrees of
freedom.
c
Curve with 5 degrees of freedom
14.2 P= brown area
IMPORTANT***
The approximation given by the curve can be
trusted when the expected frequency in each line of the
table is 5 or more.
Summary for the -test
The basic data: N observations of a random process
The frequency table computed
The -statistic formula is used to sum things up
The degrees of freedom
The observed significance level using the curve
Example
The file SEDIM has data for the grain sizes of three
different sediments. We want to know if they belong to
the same population comparing the values and also the
log values of the grain sizes.
a)Apply the F and T test to the different populations
b)Apply the _2 square test to the first data set.

You might also like