This document discusses comparing normal populations and the central limit theorem. It explains that if samples are taken from a normal population, the sampling distribution of the sample means will also be normal. Even if the original population is not normal, as the sample size increases, the sampling distribution of the sample means will approximate a normal distribution. It provides an example of taking a random sample of 64 observations from a population with a mean of 15 and standard deviation of 4 and calculating the probability that the sample mean is less than or equal to 15.5.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
119 views
Chapter 1 - Comparing Normal Populations
This document discusses comparing normal populations and the central limit theorem. It explains that if samples are taken from a normal population, the sampling distribution of the sample means will also be normal. Even if the original population is not normal, as the sample size increases, the sampling distribution of the sample means will approximate a normal distribution. It provides an example of taking a random sample of 64 observations from a population with a mean of 15 and standard deviation of 4 and calculating the probability that the sample mean is less than or equal to 15.5.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39
Chapter 1
Comparing normal populations
1. Comparing normal populations
1.1 Central Limit theorem
Suppose that we know the mean and the standard deviation of a hypothetical population (, o), and that we take samples of a given size n from this population. For example, we know the mean and standard deviation of a population of fossils of a given specie from world wide data. Suppose that we do the following experiment: a)we take 10 fossil samples in a random way from the population and find the mean and standard deviation for the 10 fossils. b)We repeat the experiment many times so that we have many estimates of the mean and standard deviation for possible combinations of 10 fossils extracted from the large population. c)We can plot a histogram for all the calculated means of the 10 samples, see Figure in next page. d)We will see that the distribution of the means is also a normal function but with a smaller standard deviation given by
And with a value of the sample mean close to the mean of the population.
Important Fact: If the population is normally distributed, then the sampling distribution of is normally distributed for any sample size n.
What happens with the shape of the sampling distribution of when the population from which the sample is selected is not normal? Population: age of a rock time 0 f(x) Example: Non- normal distribution If a random sample of n observations is selected from a population (any population), with n sufficiently large, the sampling distribution of x will be approximately normal. (The larger the sample size, the better will be the normal approximation to the sampling distribution of x.) The Central Limit Theorem (for the sample mean x) Why is the Central Limit Theorem so important? When we select simple random samples of size n, the sample means will vary from sample to sample. However, we can model the distribution of these sample means with a probability model where: The mean of the means = The standard deviation of the means=
How large should be the sampling size n ? For the purpose of applying the central limit theorem, we will consider a sample size to be large when n > 30. In summary:
Population: mean ; stand dev. o; shape of population dist. is unknown; value of is unknown; if we select random sample of size n; Sampling distribution of x bar: mean ; stand. dev. o/\n; always true!
The shape of the sampling distribution of the means is approximately normal The standard deviation of the means is known as the standard error of estimate of the mean or standard error se= 1.2 Comparing means Example Observation = fossil length 4 8 A r a n d o m s a m p l e o f = 6 4 o b s e r v a t i o n s i s d r a w n f r o m a p o p u l a t i o n w i t h m e a n = 1 5 a n d s t a n d a r d d e v i a t i o n = 4 . a . 1 5 ; ( ) . 5 b . T h e s h a p e o f t h e s a m p l i n g d i s t r i b u t i o n o f i s a p p r o x . n o r m a l ( b o n n Xx Se X x
o
= = = = = y t h e C L T ) w i t h m e a n 1 5 a n d . 5 . T h e a n s w e r d e p e n d s o n t h e s a m p l e s i z e. = = Xx ( ) Se X c. What is the probability of the mean of the random sample of 64 to have a mean equal or smaller than to 15.5? 1 5 . 5 1 5 . 5 . 5 . 5 ( ) c . 1 5 . 5 ; 1 T h i s m e a n s t h a t = 1 5 . 5 i s o n e s t a n d a r d d e v i a t i o n a b o v e t h e m e a n 1 5 x S D X x z x
= = = = = = Xx Probability = Normsdist(1)= 0.84 Example 2: The concentration of mercury in a soil has a mean of 20 ppb and a standard deviation of 10 ppb. Find the probability that a random sampling of 24 sites gives a mean that exceeds 16 ppb.
o x x = = = 2 0 2 0 4 1 0 2 4 ; . . Hyphotesis: Null hyphotesis: it is the hyphotesis of no difference, for example Ho= 1 =2 Alternative hyphotesis: an appropriate alternative to the null hyphotesis, for example: H1= 1 =2
p-values: the smalles level of significance at which the null hypothesis should be rejected for a specific test. 1-p will the the probability that the null hyphotesis is true. 3.3 P-values and levels of significance
1.4 Confidence limits A 95% confidence interval looks like this. Producing the general equation. Provides Range of Values Based on Observations from 1 Sample Gives Information about Closeness to Unknown Population Parameter Stated in terms of Probability Never 100% Sure The above calculation gives us two limits within which the population mean lies. Further the probability that the mean lies within these limits is 95% certain, a probability of 0.95. The limits can be summarised as follows. Confidence Interval Confidence Limit (Lower) Confidence Limit (Upper) A Probability That the Population Parameter Falls Somewhere Within the Interval. Error = Error = Error Error Probability that the unknown population parameter is in the confidence interval in 100 trials. Denoted (1 - o) % = level of confidence e.g. 90%, 95%, 99% o Is Probability That the Parameter Is Not Within the Interval in 100 trials Data Variation measured by o Sample Size
Level of Confidence (1 - o) Factors that affect the size of the confidence interval: Intervals Extend from
X - Zox to X + Z ox Assumptions Population Standard Deviation Is Known Population Is Normally Distributed If Not Normal, use large samples Confidence Interval Estimate Exercise 2.3 (Davis)
The West Lyons Oil field was discovered in 1963 in Rice County, Kansas, and was originally estimated to contain 22 million barrels of oil. The reservoir is a sandstone of Pennsylvanian (Upper Carboniferous) age and has been cored in 94 wells in the field. File WLYONS.TXT contains core measurements of porosity and water saturation and the thickness of the reservoir for each of these wells. Assume that the porosity is normally distributed and the parameters of the population can be estimated from the sample statististics. Then, answer the following questions: Wat is the probability of measuring 1.A core porosity that is exactly 15%? 2.A core porosity les than 6%? 3.A core porosity greater than 15%? 4.A core porosity between 15 and 16%
Production from the West Lyons oil field is now declining and operators are considering the application of a proprietary enhanced recovery procedure (called PERP) to stimulate recovery. The PERP method works best on sandstone reservoirs whose average porosity is 15% or greater. Assume the population of suitable reservoirs is normally distributed with a mean porosity of 15% and a standard deviation of 5%. Could the random taking of 94 cores from such a population result in a distribution of porosity measurements we observe in the West Lyons field?
See solution in Excel file WLYONS exercise. Assumptions Population Standard Deviation Is Unknown Sample size must be large enough for central limit theorem or Population Must Be Normally Distributed Use Students t Distribution Confidence Interval Estimate Confidence Intervals (o Unknown) 1.6 The t-test and confidence intervals Students t Distribution t-distribution: small-sample statistics Z-distribution (normal): large-sample statistics x x t = s n Number of observations that are free to vary after sample mean has been calculated Example If the mean of 3 Numbers is 2 and X1 = 1 (or Any Number) X2 = 2 (or Any Number) X3 = 3 (Cannot Vary) Mean = 2 degrees of freedom = n -1 = 3 -1 = 2 Degrees of Freedom (df or v) Another definition is: the number of observations in a sample minus the number of parameters estimated from the sample, or the number of observations in excess of those neccesary to estimate the parameters of the distribution. Important Properties of the Student t Distribution 1. The Student t distribution is different for different sample sizes. 2. The Student t distribution has the same general bell shape as the normal distribution; its wider shape reflects the greater variability that is expected when s is used to estimate o . 3. The Student t distribution has a mean of t = 0 (just as the standard normal distribution has a mean of z = 0). 4. The standard deviation of the Student t distribution varies with the sample size and is greater than 1 (unlike the standard normal distribution, which has o = 1). 5. As the sample size n gets larger, the Student t distribution gets closer to the standard normal distribution. *****NOTE: The TINV function in EXCEL is already a two tail function. You do not need to divide a/2 to get the t value, just input TINV(0.05, df). Example silver content: find the confidence interval of the log of the mean for the content in the rocks of exercise 2 chapter 2. 3.7 t-test for equality of two sample means
We want to compare the Hg concentrations in ppb of sediments of two rivers 10 samples were randomly taken at the two rivers. The results are as follows:
Variable N Mean StDev River A 10 1021.4 80.3 River B 10 1061.4 53.4
Do you think this data provide statistically sufficient evidence to support that the two Hg populations are the same? Hypotheses:
Ho: 1 = 2 vs H1: 1 dif. 2 P-value of two-sample T-test Sample from Population 1: n1, X1, S1
Sample from Population 2: n2, X2, S2
p-value = P(T t) if H1: 1 > 2
p-value = P(T t) if H1: 1 < 2
p-value = 2P(T |t|) if H1: 1 2
To find the p-value we need: t - number The degrees of freedom of T- variable
We are interested in the pooled t-test, where the two variances are the same.
If 1 = 2 The two-sample T-test carried out under this assumption is called pooled t-test. The degrees of freedom is (n1+ n2 - 2) And the t calculated value is given by: If 1 = 2
where Sp ~ the pooled estimate of the standard deviation Solution to the Hg in river problem: Variable N Mean StDev River A 10 1021.4 80.3 River B 10 1061.4 53.4
Sp= 68.19 Se= 30.49 T calculated= -1.31 t critical (0.05,18)= 2.1 p value = 0.21 So at the 95% the two populations are the same. They are the same at the 79% level or higher. But at the 79% confidence level or lower the populations are different.
Continuation of the West Lyons Oil field exercise:
The production geologist in charge of the West Lyons oil field has speculated that the characteristics of the field vary with reservoir thickness, perhaps reflecting differences in the depositional environment and hence grain sorting and packing. The distribution of reservoir thickness does appear bimodal, with a distinction between wells containing more than 30 feet of reservoir sandstone and those that contain less. Are core porosity measured where the reservoir sand is over 30 ft thick significantly different from those measured where the sandstone is thinner?
See Excel exercise. 1.8 F-test and the equality of two variances Fisher was an great geneticist and evolutionary biologist, he developed the F- test while doing genetic research on an agricultural project.
The F-test is a variance test, or a test for equal variance. It was given the name in honor of R.A.Fisher who conceived of this test methodology.
The F-distribution is a family of curves based on the values that would be expected by randomly sampling from a normal population and calculating, for all possible pairs of sample variances, the ratios
F= s12/s22
A few of the curves that result are shown in the next Figure.
F-distribution probability density function for varying df. In the F test, the variance of two samples are set as a ratio with the larger value in the numerator. This value is compared to a critical value found from the F- distribution having the same two df as found in the values of the ratio.
**The larger the value of the ratio, the greater the variances differ and the less likely the null hypothesis will be upheld.
**If the F test value (the ratio of variances) is greater than the critical F statistic, the null hypothesis is rejected.
**The F-test is highly sensitive to departures from a normal distribution. It cannot be used if the samples being tested are not normally distributed.
The null hypothesis is: Ho = o12 = o22
Against: H1= o12 > o22 Degrees of freedom v1= n1 -1 and v2= n2 -1
Important note for the t test and the F test from EXCEL: For the t-test for equal means "P(T <= t) two-tail" gives the probability that a value of the t-Statistic would be observed that is larger in absolute value than t. "P Critical two-tail" gives the cutoff value so that the probability of an observed t-Statistic larger in absolute value than "P Critical two-tail" is Alpha.
For the F test for equal variances The tool calculates the value f of an F-statistic (or F-ratio). A value of f close to 1 provides evidence that the underlying population variances are equal. In the output table, if f < 1, then "P(F <= f) one-tail" gives the probability of observing a value of the F-statistic less than f when population variances are equal, and "F Critical one-tail" gives the critical value less than 1 for the chosen significance level, Alpha. If f > 1, then "P(F <= f) one-tail" gives the probability of observing a value of the F-statistic greater than f when population variances are equal, and "F Critical one-tail" gives the critical value greater than 1 for Alpha.
1.9 The _2 test The test was invented in 1900 by Karl Pearson
The test is used when there are more than two categories of data Example: a gambler is accused of sheeting, but he pleads innocent. A record has been kept for the last 60 throws.
If the gambler is innocent, the numbers from the table should be like 60 random drawings with replacement from a box with {1,2,3,4,5,6}. Each number should show up about 10 times. The expected frequency is 10. Value Observed Freq Expected Freq 1 4 10 2 6 10 3 17 10 4 16 10 5 8 10 6 9 10 Sum 60 60 The statistic = sum of (observed frequency expected frequency) expected frequency 2 When the observed frequency is far from the expected frequency, the corresponding term in the sum is large; when the two are close, the term is small.
Large values of indicate that the observed and expected frequencies are far apart. Small values of mean the opposite: observed are close to expected. So chi-square is a measure of the distance between observed and expected. The P-value: the observed significance level We need to know the chance that when a fair die is rolled 60 times and is computed from the observed frequencies, its value turns out to be 14.2 or more. Degrees of freedom =6-1 Using CHIDIST(14.2, 5)= 1.4%=P
Using CHIINV(0.05, 5)= 11.07
The answer P=1.4% That is, if the die is fair there is 1.4% chance for the statistic to be as big as or bigger than the observed one. In addition, at the 95% confidence level the value of 14.2 is higher than 11.07, so Conclusion: The gambler is in trouble!!! For the -test the P-value is approximately equal to the area to the right of the observed value for the statistic, under the -curve with the appropriate number of degrees of freedom. c Curve with 5 degrees of freedom 14.2 P= brown area IMPORTANT*** The approximation given by the curve can be trusted when the expected frequency in each line of the table is 5 or more. Summary for the -test The basic data: N observations of a random process The frequency table computed The -statistic formula is used to sum things up The degrees of freedom The observed significance level using the curve Example The file SEDIM has data for the grain sizes of three different sediments. We want to know if they belong to the same population comparing the values and also the log values of the grain sizes. a)Apply the F and T test to the different populations b)Apply the _2 square test to the first data set.