Statistical Analysis 2023
Statistical Analysis 2023
1
Population is a collection of all possible individuals, objects,
or measurements of interest.
A sample is a portion, or part of the population of interest.
3
Summary of the characteristic for levels of measurement
4
Example
Classify the following variable
Quantitative
Categorical
discrete continuous
1. Numbers of books √
2. Brands of cars √
3. Temperatures √
4. Height of tress √
5. Numbers of cars √
6. Amount of water √
Statistic
Any function of the random variables constituting a random
sample is called a statistic.
5
Frequency Tables, Frequency Distributions
and Graphic Presentation
Frequency Distributions
Frequency distribution is a grouping of data into mutually
exclusive categories showing the number of observations in
each class.
Frequency Table
Frequency table is a grouping of qualitative data into mutually
exclusive classes showing the number of observations in each
class.
For example,
Cumulative
Selling prices Relative
Frequency frequency
($1000) Frequency
15 up to 18 8 8 8/80=0.1
8+23=31
18 up to 21 23 23/80=0.2875
8+23+17=48
21 up to 24 17 17/80=0.2125
8+23+17+18=66
24 up to 27 18 18/80=0.225
8+23+17+18+8=74
27 up to 30 8 8/80=0.1
8+23+17+18+8+4=78
30 up to 33 4 4/80=0.05
8+23+17+18+8+4+2=80
33 up to 36 2 2/80=0.025
Total 80 1
6
Graphic Presentation
The three commonly used graphics forms are:
Histograms
Frequency polygons
7
Numerical Measures
Measure of central tendency
Mean: measure of average of all the values in a sample.
Prosperities
All values are used.
It is quick and easy to compute.
It is unique.
The sum of the deviations from the mean is 0.
Disadvantages
mean is not defined for qualitative data
It is affected by extreme values.
Example 1
The lengths of time, in minutes, that 10 patients waited in a
doctor’s office before receiving treatment were recorded as
follows: 5, 11, 9, 5, 9, 15, 6, 10, 5, and 10. Treating the data as a
random sample, find the mean?
8
Example 2
Find the mean for the following grouped data
15 up to 18 8 16.5 132
18 up to 21 23 19.5 448.5
21 up to 24 17 22.5 382.5
24 up to 27 18 25.5 459
27 up to 30 8 28.5 228
30 up to 33 4 31.5 126
33 up to 36 2 34.5 69
Total 80 1845
∑
̅
∑
̌ {
In Example 1
Find the median of the raw data: 5, 11, 9, 5, 9, 15, 6, 10, 5, 10?
Solution
Rearrange 5,5,5,6,9,9,10,10,11,15
Median )=
9
For grouped data,
̌ ( )
̃
15 up to 18 8 8
18 up to 21 23 31
̃=17 40
L=21 up to 24 48
24 up to 27 18 66
27 up to 30 8 74
30 up to 33 4 78
33 up to 36 2 80
Total 80
11
Solution
, , , ̃ and
̌ ( * ( )
̃
= 22.58
Note that
( *
Prosperities
It is quick and easy to compute.
It can be evaluated for both quantitative and qualitative
data.
11
It is not affected by extreme values.
Disadvantages
Sometimes there is no mode or more than one mode.
In Example 1
Find the mode of the raw data: 5, 11, 9, 5, 9, 15, 6, 10, 5, 10?
Solution Mode is: 5
In Example 2
Find the mode for the following grouped data
, , and
( * ( *
12
The relationships between the Mean, Median and Mode
Measure of Dispersion
A measure of location, such as mean or median, only describes
the center of the data. It is valuable from the standpoint, but it
does not tell us anything about the spread of the data.
Range: measure of how spread apart the values in a data set.
13
For raw data,
∑ ̅
∑ ̅
(∑ )
√
In Example 1
Find the range, variance and standard deviation of the raw data:
5, 11, 9, 5, 9, 15, 6, 10, 5, 10?
Solution
- Range=15-5=10
- Variance ∑
14
Selling prices Frequency Midpoint
($1000) ̅ ̅ ̅
√
Quartiles
The standard deviation is the most widely used measure of
dispersion. Alternative ways of describing spread of data
include determine the location of values that divide a set of
observations into equal parts, such as quartiles.
15
The third quartile ( )
In Example 2
Find the quartiles for the following grouped data
15 up to 18 8 8
20
18 up to 21 23 31
21 up to 24 17 48
24 up to 27 18 66 60
27 up to 30 8 74
30 up to 33 4 78
33 up to 36 2 80
Total 80
, , , and
( * ( )
, , , and
( ) ( )
16
Chapter 2 Sampling Distribution
Sampling
The process of selecting the sample from a given population is
called sampling.
Sampling distribution
The probability distribution of a statistic is called a sampling
distribution.
For Example:
Sampling distribution of Mean
Sampling Distribution of difference between two means
Sampling distribution of Proportion
Sampling distribution of Variance
Sampling Distribution of Means and the Central Limit
Theorem
Suppose that a random sample of n observations is taken from a
normal population with mean μ and variance .Hence,
̅ + +…+ ) has a normal distribution with mean
̅= + +….+ )=
n times
and variance
̅ + +…+ )
17
Central Limit Theorem
If ̅ is the mean of a random sample of size n taken from a
population with mean μ and finite variance , then the
sampling distribution of the mean ̅ will approximately
normally distributed n(z; 0, 1).
̅
.
√
̅
If
√
̅
If
√
Example 3
An electrical firm manufactures light bulbs that have a length of
life that is approximately normally distributed, with mean equal
to 800 hours and a standard deviation of 40 hours. Find the
probability that a random sample of 16 bulbs will have an
average life of less than 775 hours.
Solution:
=16, =800, =40
̅
̅ ( )
√ √
0.0062
18
Example 4
A tobacco company claims that the amount of nicotine in its
cigarettes is a random variable with mean 2.2 mg and a standard
deviation 0.5 mg. What is the probability that the sum of
nicotine content in a sample of 100 cigarettes would have been
as higher than 230 mg?
Solution
∑
P(∑ ( )
̅ =
=
Example 5
A random sample of size 169 is taken from normal population
with variance 400. What is the probability that the absolute
difference between the sample mean and the population mean is
greater than 4?
Solution
|̅ | [| ̅ | ]
̅
[ ]
[ ]
[ ]
[ ]
[ ] =1 [ ]
19
Example 6
A large restaurant reports its outstanding bills to suppliers are
approximately normally distributed with a mean of $1200. The
standard deviation is unknown. A random sample of 9 accounts
is taken with standard deviation 210. What is the probability that
the sample mean will be greater than $1300?
Solution:
=9, =1200, =210
̅
̅ ( )
√ √
from table with
̅ ̅ ̅ ̅
̅ ̅
Hence,
√
21
Example 7
The television picture tubes of manufacturer A have a mean
lifetime of 6.5 years and a standard deviation of 0.9 year, while
those of manufacturer B have a mean lifetime of 6.0 years and a
standard deviation of 0.8 year. What is the probability that a
random sample of 36 tubes from manufacturer A will have a
mean lifetime that is at least 1 year more than the mean lifetime
of a sample of 49 tubes from manufacturer B?
Solution: Manufacturer A Manufacturer B
Population 1 Population 2
=36
=6.5
P( ̅ ̅ 1.0) =0.9
̅ ̅
= 1.0)
√
)
√
=1-0.9960=0.0040
Example 8
Two independent experiments are run in which two different
types of paint are compared. Eighteen specimens are painted
using type A, and the drying time, in hours, is recorded for each.
The same is done with type B. The population standard
deviations are both known to be 1.0. Assuming that the mean
drying time is equal for the two types of paint.
21
Find P ( ̅ ̅ > 1.0), where ̅ ̅ are averages drying
times?
Solution
̅ ̅
P( ̅ ̅ > 1.0) = P( > )
√
√
̂ ( ̂) ( )
and variance
̂ =
22
Example 9
Suppose that in a certain human population 0.08 are colorblind.
If a random sample of 150 individuals from this population is
selected. What is the probability that the proportion in the
sample who are colorblind will be greater than 0.15?
Solution
̂
̂ ( )
√ √
23
Solution
( )
( )
24
25
26
Chapter 3
One and Two Sample Estimation Problems
Point Estimate
A point estimate of some population parameter is a single
value ̂ of a statistic ̂ .
For example
Mean ̅ ̅
Proportion ̂ ̂
Variance
Properties of Estimators
We would like to ensure the estimator is somewhat “close” to
the true unknown parameter, therefore we need to devise a way
to measure the distance between the two…
Since the estimator is a function of the data, it is a random
quantity. Therefore we must take this into account.
To formalize these notions we need state a number of properties
that (might) be desirable for estimators to have, just as un-
biasedness, small mean squared error, and low variance
27
Properties of good estimator
1- Unbiased
When the estimated value of the parameter and the value of the
parameter being estimated are equal, the estimator is considered
unbiased.
(̂)
(̂) ̂ ̂)
2- Efficiency
The most efficient point estimator is the one with the smallest
variance of all the unbiased and consistent estimators.
If ̂ < ̂
̅ ∑
∑ ∑
(̂ ) - =0
28
̅ this estimator is unbiased.
Example 2
Let be a random sample from a population with
mean and variance Find the estimator for
Solution
̅ * ∑ + ∑
(∑ ) =
( ) (̂ ) - 0
̂ ( )
̂ ( ) ( )
( ̂ ) = =0.58
29
Interval Estimation
An interval estimate of a population parameter θ is an interval of
the form ̂ ̂ ̂
̂ ̂
Example 9
The average zinc concentration recovered from a sample of
measurements taken in 36 different locations in a river is found
to be 2.6 grams per milliliter. Find
the 95% confidence intervals for the mean zinc concentration in
the river. Assume that the population standard deviation is 0.3
gram per milliliter.
Solution:
̅ ̅
√ √
How to calculate =
= = =1.96
From table
31
Confidence Interval on Mean ( Unknown, )
̅ ̅
√ √
where is the z-value leaving an area of α/2 to the right.
Example 10
An electrical firm manufactures light bulbs that have a length of
life that is approximately normally distributed with a standard
deviation of 40 hours. If a sample of 30 bulbs has an average life
of 780 hours. Find 95% confidence interval for the population
mean of all bulbs produced by this firm?
Solution:
32
̅ ̅
√ √
Example 11
The heights of a random sample of 50 college students showed a
mean of 174.5 cm. and a standard deviation of 6.9 cm. Construct
a 98% confidence interval for the mean height of all college
students.
Solution:
̅ ̅
√ √
172.22
̅ ̅
√ √
33
where is the t-value with degree of freedom,
leaving an area of α/2 to the right.
Example 12
The contents of seven similar containers of sulfuric acid are 9.8,
10.2, 10.4, 9.8, 10.0, 10.2 and 9.6 liters. Find a 95% confidence
interval for the mean contents of all such containers, assuming
an approximately normal distribution.
Solution:
̅ ̅
√ √
How to calculate
34
= =2.447
̅ ̅ √ ̅ ̅ √
Example 13
A study was conducted in which two types of engines, A and B,
were compared. Gas mileage, in miles per gallon, was
measured. Fifty experiments were conducted using engine type
A and 75 experiments were done with engine type B. The
gasoline used and other conditions were held constant. The
average gas mileage was 36 miles per gallon for engine A and
42 miles per gallon for engine B. Find a 96% confidence interval
on , where are population mean gas
mileages for engines A and B, respectively. Assume that the
35
population standard deviations are 6 and 8 for engines A and B,
respectively.
Solution
Engine A Engine B
=50 =75
̅ =36 ̅ =42
=6 =8
̅ ̅ √ ̅ ̅ √
√ √
̅ ̅ √ ̅ ̅ √
36
where is the t-value with degrees of
freedom, leaving an area of α/2 to the right and is the pooled
estimates of the population standard deviation.
where
Example 14
Students may choose between a 3-semester-hour physics course
without labs and a 4-semester-hour course with labs. The final
written examination is the same for each section. If 12 students
in the section with labs made an average grade of 84 with a
standard deviation of 4, and 18 students in the section without
labs made an average grade of 77 with a standard deviation of 6,
find a 99% confidence interval for the difference between the
average grades for the two courses. Assume the populations to
be approximately normally distributed with equal variances.
Solution
With Without
= lab. lab.
=12 =18
√ =5.30
̅ =84 ̅ =77
=4 =6
The 99% confidence interval is
̅ ̅ √ ̅ ̅ √
√ √
12.5
37
Paired Observation
A paired t-test is used to compare two population means where
you have two samples in which observations in one sample can
be paired with observations in the other sample (Before-and-
after observations on the same subjects).
Let = score before the module, = score after the module
The procedure is as follows:
1. Calculate the difference ( ) between the two
observations on each pair, making sure you distinguish between
positive and negative differences.
̅ ̅
√ √
where is the t-value with degree of freedom,
leaving an area of α/2 to the right.
Example 15
Suppose a sample of 20 students were given a diagnostic test
before studying a particular module and then again after
completing the module, the following results were obtained as
follows:
38
Find 95% confidence interval for ?
Solution
̅ and
√
̅ ̅
√ √
Estimating a Proportion
̂̂ ̂̂
̂ √ ̂ √
39
Example 16
In a random sample of 1000 homes in a certain city, it is found
that 228 are heated by oil. Find 99% confidence intervals for the
proportion of homes in this city that are heated by oil?
Solution:
̂ ,̂ ,
=2.57
̂̂ ̂̂
̂ √ ̂ √
0.194 0.262
41
Chapter 4
One and Two Sample Tests of Hypotheses
Statistical hypotheses
A claim about the value of a parameter or population
characteristic.
Components of Hypothesis Test
1. State the null and alternative hypothesis.
null hypothesis and alternative hypothesis
2. Choose Level of significance
3. Choose an appropriate test statistic and calculate it using
the sample data.
4. Establish the critical region based on
5. Comparison of test statistic to critical region to
draw initial conclusions.
6. Decision
The objective of hypothesis testing is to decide, based on
sample information, if the alternative hypotheses is actually
supported by the data.
41
Single sample: Test of single mean
S , ,…., represent random sample from
distribution with mean μ and (known) variance , if ̅ be a
sample mean,
̅
√
̅
If is unknown and , it is estimated by ,
√
̅
If is unknown and
√
1- {
2-
3- Test statistic (Z computation):
̅
If known
√
̅
√
If unknown and { ̅
√
42
4- Critical region
Example 1
A random sample of 100 recorded deaths in the United States
during the past year showed an average life span of 71.8 years.
Assuming a population standard deviation of 8.9 years, does this
seem to indicate that the mean life span today is greater than 70
years? Use a 0.05 level of significance.
Solution
1-
2-
3- Computation, ̅ and
̅
√ √
43
4- Critical region
if
5- Decision
Since and accept
Example 2
A manufacturer of sports equipment has developed a new
synthetic fishing line that the company claims has a mean
breaking strength of 8 kilograms with a standard deviation of
0.5 kilogram. Test the hypothesis that μ = 8 kilograms against
the alternative that μ 8 kilograms if a random sample of 50
lines is tested and found to have a mean breaking strength of 7.8
kilograms. Use a 0.01 level of significance.
Solution
1-
2-
3- Computation, ̅ and
̅
√ √
4- Critical region
If
5- Decision
Since
and accept
44
Example 3
The Edison Electric Institute has published figures on the
number of kilowatt hours used annually by various home
appliances. It is claimed that a vacuum cleaner uses an average
of 46 kilowatt hours per year. If a random sample of 12 homes
included in a planned study indicates that vacuum cleaners use
an average of 42 kilowatt hours per year with a standard
deviation of 11.9 kilowatt hours, does this suggest at the 0.05
level of significance that vacuum cleaners use, on average, less
than 46 kilowatt hours annually? Assume the population of
kilowatt hours to be normal.
Solution
1-
2-
3- Computation, ̅ and
̅
√ √
4- Critical region
if
5- Decision
Since
and reject
45
Two samples: Test of two means
For two independent samples, are known, then we
have
̅ ̅
1- {
2-
3- Test statistic (Z computation):
̅ ̅
If known
√
46
̅ ̅
√
If unknown and ̅ ̅
√
{
4- Critical region
Example 4
47
Solution
1-
2-
3- Test statistic
̅ ̅
√ √
4- Critical region
6- Decision
Since
and accept
48
Single sample: Test of single Proportion
S is the proportion of the population, is a number of
̂
1- {
2-
̅
3- Test statistic (Z competition):
√
4- Critical region
49
Example 5
A builder claims that heat pumps are installed in 70% of all
homes being constructed today in the city of Richmond,
Virginia. Would you agree with this claim if a random survey of
new homes in this city showed that 8 out of 15 had heat pumps
installed? Use a 0.10 level of significance.
Solution
1-
2-
3- Computation, and ̂
̂
√ √
4- Critical region
if
5- Decision
Since
and reject
Example 6
A commonly prescribed drug for relieving nervous tension is
believed to be only 60% effective. Experimental results with a
new drug administered to a random sample of 100 adults who
were suffering from nervous tension show that 70 received
relief. Is this sufficient evidence to conclude that the new drug is
51
superior to the one commonly prescribed? Use a 0.05 level of
significance.
Solution
1-
2-
3- Computation, and ̂
̂
√ √
4- Critical region
if
5- Decision
Since
and accept
51
Goodness-of-fit test
Now we shall consider a test to determine if a population has a
specified theoretical distribution. The test is based on how good
a fit we have between the frequency of occurrence of
observations in an observed sample and the expected
frequencies obtained from the hypothesized distribution.
Example 7
Suppose that the die is tossed 120 times and each outcome is
recorded. The results are given in the following table
Face 1 2 3 4 5 6
Observed 20 22 17 18 19 24
Expected 20 20 20 20 20 20
Test if the die is balanced, Use
Solution
1-
2-
3- Computation,
∑
4- Critical region
if reject
52
Example 8
The grades in a statistics course for a particular semester were as
follows:
Grade A B C D F Total
Observed 14 18 32 20 16 100
Expected 20 20 20 20 20 100/5
Test if the distribution of the grades is Uniform, Use
level of significance
Solution
1-
2-
3- Computation,
4- Critical region
if reject
5- Decision
53
Test of Independence
Example 9
A random sample of 30 adults is classified according to gender
and the number of hours they watch T.V. during a week.
Male Female
Over 25 hours 5 9
Under 25 hours 9 7
2-
Male Female Total
Over 25 hours 5 (6.5) 9 (7.5) 14
Under 25 hours 9 (7.5) 7 (8.5) 16
Total 14 16 N=30
∑∑
where,
54
4- Critical region
with
if reject
5- Decision
i.e. person's gender and the number of hours they watch T.V. are
independent.
55