Statistics MMW
Statistics MMW
Nowadays, people are curious about many things, chances are that you are interested with
the role of Statistics that made it useful by understanding of structures in data. Information
developed through the use of statistics has improved our understanding of how life works, helped us
learn each other, allowed control over some societal issues, and helped individuals make informed
decisions. There is almost no area of knowledge that has not been advanced by statistical studies.
Statistics defined in its plural sense is a set of numerical data, while in its singular sense refers
to the scientific discipline consisting of theory and methods in processing numerical information that
one can use when making decisions in the face of uncertainty. Thus,
Determining the distribution of the number of text messages sent per day of
Mathematics in the Modern World (MMW) students.
Prediction of the number of MMW students for the next school year 2019-2020.
ii. Inferential Satistics – methods concerned with the analysis of a subset of data
leading to predictions or inferences about the entire set of data, that is, to
generalize results beyond the data collected provided that the data collected is a
part (sample) of a large set of items (population).
1
Key Terms
Universe – is the set of all entities under study, that is, the collection of things or
observational units under study.
Variable – is a characteristic observed or measured on every unit of the universe.
Population - is the set of all possible values of the variable.
Sample – is a subset of the population.
Parameters – are numerical measures that describe the population or universe of interest.
Statistics – are numerical measures of a sample.
Frame – a listing of all the elements in a population.
Census – the process in which information is gathered for all units in the population.
Sample survey or sampling – the process in which information obtained is only a part of the
population.
Qualitative variables – variables that yield observations by which individuals can be categorized
according to some characteristic or quality.
- e.g., gender, marital status and blood type.
2
associated with points on a continuous scale in such a way that there are no gaps or
interruptions.
Note: Arithmetical operations for quantitative data have some physical interpretation. Some
variables may take numerical values, but it does not make the variable quantitative, e.g.,
sum of two zip codes or the difference of your cellular telephone number to your
seatmate. Thus, the arithmetic operations of the above example do not make sense. The
issue is whether performing arithmetical operations on these data would make any sense.
The figure below illustrates the classification of data collected on particular variables.
- Identity – is the property that enables a person to distinguish one number from the other.
They are recognized by the shapes of the way they are written.
- Order – is the property that numbers are arranged in a sequence. For any integer number
A, B, we can determine whether A B, A B, and B A.
- Addititvity – is the property that allows to add numbers. For any real number
A, B, C , and D, because of the equality of scale, we can determine if
A B C D, A B C D or A B C D.
Nominal scale – the lowest level of measurement and is most often used with
3
- it possess only the property of identity. Thus, numbers are only used to classify. For
example in the variable gender, if 1 is assign to male and 2 is for female, it does not
mean that female is better than male.
Interval scale – possesses the properties of identity, order and additivity but do not have
the absolute zero property.
- Examples: Celsius scale measurement of temperature and intelligence score.
Ratio scale – possesses the properties of identity, order, equality of scale and absolute
zero.
- Examples: weight and height.
The number of observations is also represented stands for any of the numbers 1, 2, 3,…, n is
called a subscript, or index. Any letter other than i , such as j, k , v, q or r , could have been used
as well.
The Summation symbol - it is a more compact way of writing the sum of a set of data
values.
n
- x
i 1
i is defined as
x
i 1
i = x1 x 2 ... x n
Example 1. Consider the age of a sample of six children as shown in the table below
Table 1.1.1: Ages of Six Children
4
Child Number Age symbol Age (year)
1 x1 8
2 x2 10
3 x3 7
4 x4 6
5 x5 10
6 x6 12
Find the following: a. Find the sum of their ages in compact form.
2
4 4 2
b. x i
i 1
c. x
i 1
i .
Rules of Summation
n n n n
1. ( xi yi zi ) xi yi zi
i 1 i 1 i 1 i 1
n n
2. cxi c xi , where c is a constant.
i 1 i 1
n
3. c nc , where c is a constant.
i 1
6
b. (x
i 1
i
2
yi )
2
4 2
c. (x
i 1
i yi ) .
The Factorial Symbol ! - is a compact way of writing the product of a sequence of positive
integers. The symbol n! is defined as
5
n! 1 2 3 ... n.
- n! is the product of all positive integers less than or equal to n.
- 0! 1.
a) n = 5 b) n = 7 c) n = 8 d) n = 10
6
Exercises/Problems
4. Investigate the following problems and determine what is more appropriate to use – descriptive or
inferential statistics.
a. Mathematics Department would like to know the number of BS Mathematics students
interested of the newly revised curriculum of the BS Mathematics program.
b. A biology student studies the mercury content of fishes in Pulangi River and found that the
average mercury content is 400 units.
c. Office of Student Affairs would like to predict the number of students who would like to
stay at the University’s dormitories. However, the enrolment period is a week before the
classes start so the said office randomly selected 100 students and the results were used
as an estimate.
d. Do girls learn to walk at an earlier age than boys?
b. A statement made about a sample based on the measurements in that sample. Statistical
inference helps us draw conclusion about the unknown population characteristics based
on the sample.
6. Fill in the missing words to the quote: “Inferential statistics is defined as drawing conclusions about
____________ based on ____________ computed from the _____________.”
7. A random sample of 100 commuter students in CMU was selected and several variables were
recorded for each student. Which of the following is NOT CORRECT?
a. Their average allowance per month is a continuous variable.
b. Socioeconomic status was coded as 1=low income, 2=middle income, 3=high income and
is an interval scaled variable.
7
8. Identify the following as qualitative or quantitative variable. If quantitative, classify whether it is
discrete or continuous. Also, indicate the appropriate level of measurement required in each.
a. Car ownership (answers the question: Do you own a car?)
b. Citizenship
c. Tuition fees
d. Color of the skin
e. Air temperature of the peak of Mt. Kalayo measured in degree Celsius.
f. Religion
9. The College of Agriculture obtained the following data representing the one-week growth
in centimeters of 33 newly planted tomato plants:
2.3 3.9 3.9 0.8 4.1 1.1 3.1 2.2 2.4 2.4 1.8
2.8 2.4 3.9 1.8 3.9 3.9 4.1 3.9 2.4 4.0 4.2
3.7 1.6 2.3 3.2 2.6 2.6 1.9 2.2 1.7 3.5 1.9
10. Write each of the following as a summation; that is, in the compact notation:
a. z1 z2 z3 z4 z5 z6 b. z2 z3 z4 z5 z6
e. 2 z2 2 z3 2 z4 2 z5 2 z6 f. ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
g. ( x4 3) ( x5 3) ( x6 3) ( x7 3)
s
11. tat
i 1
a) n = 3 b) n = 4 c) n = 1 d) n = 0
8
a. 19! 19 18 17 16! d. 6! 3! 9!
12! 9!
b. 4! e. 36
3! 7!2!
c. 3! 0! 7 f. 15!2! 17!
9
SUMMARY MEASURES
Piles of raw data, by themselves, may not be informative, but when data are presented in
summary form, they may be much more interesting and meaningful to us. In most cases, we need to
summarize a given set of data rather maintain the entire set. Single numbers called summary (or
descriptive) statistics can be calculated for such a purpose. Two kinds of summary statistics are
particularly important to most data users – measures of central tendency and measures of variability.
Summary Measures
Percentile Kurtosis
Maximum Range
Quartile
Minimum Coefficient of
Decile
Variance Variation
Central
Tendency Inter-quartile
Range
Mean Mode
Standard Deviation
Median
Measures of location summarize a data set by giving a “typical value” within the range of
the range of the data values that describes its location relative to entire data set. A measure of
variation is a single value that is used to describe the spread of the distribution. A measure of
central tendency alone does not uniquely describe a distribution. The following are the descriptive
statistics:
Example: In a sample data: 7, 8, 10, 4 and 14 the minimum and maximum are _______ and _______,
respectively.
Answer: MIN = 4 and MAX = 14.
10
Measures of Central tendency or location are values that are typical, or representative, of a
set of data that tend to lie centrally within a set of data arranged according to magnitude.
Measures of central tendency are also called averages.
Arithmetic mean or simply the mean – is the most popular measure of central tendency. It is a
sum of a set of measurements divided by number of measurements in the set.
Population mean – if the set of data x1 , x 2 ,..., x N , not necessarily all distinct represents a
finite population of size N , then the population mean is
N
x i
i 1
N
Sample mean – if the set of data x1 , x 2 ,..., x n , not necessarily all distinct represents a finite
sample of size n, the sample mean is
n
_ x i
x i 1
Examples:
1. The examination scores of a sample of 5 students are 58, 49, 52, 62 and 65. Find the mean.
_
Answer: Since the data pertains to sample the notation x is used. Thus,
n
_ xi 58 49 52 62 65
i 1
x
n 5
57.2
2. Find the weight of these population data: 18, 29, 22, 32 and 15.
_
Answer: Since the data pertains to sample the notation x is used. Thus,
N
xi 18 29 22 32 15
i 1
N 5
23.2
Note: Sometimes we associate with the numbers x1 , x2 ,..., xk certain weighting factors (or
weights) w1 , w2 , w3 ,..., wk , depending on the significance or importance attached to the
numbers. In this case,
11
k
_ wi xi
x i 1k
wi
i 1
_
w1 x1 w2 x2 w3 x3 ... wk xk
x
w1 w2 ... wk
is called the weighted arithmetic mean.
Example:
1. A sample of 40 students took University entrance test. 15 students had a mean score of 75. The
other students had a mean score of 90. What is the average score of these 40 students?
Solution:
_
x
15*75 (25*90)
15 25
3375
40
84.375
2.
Median is the middle value of a set of observations arranged in increasing or decreasing order
of magnitude. It is the middle value when the number of observations is odd, or the
arithmetic mean of the two middle values when the number of observations is even, i.e., it
the value such that half of the observations fall above it and half below it.
x N 1 if N is odd
2
~
a. Population median:
1
x N x N if N is even
2 2 2 1
x n1 if n is odd
~ 2
b. Sample median: x
1
x n x n if n is even
2 2 2 1
Properties of Median
1. May not be an actual observation in the data set.
2. Can be applied in at least ordinal level.
3. A positional measure; may not be affected by extreme values.
12
Mode is the value that appears the most number of times or that value with the greatest
frequency. The mode may not exist, and even if it does exist it may not be unique. A
distribution having only one mode is called unimodal.
If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean of the two
middle values) that divides the set into two equal parts is the median. By extending this idea, we can
think of those values which divide the set into four equal parts, 10 equal parts and 100 equal parts
and these are called quartiles, deciles and percentiles, respectively. Collectively, quartiles, deciles,
percentiles and other values obtained by equal subdivisions of the data are called fractiles.
Percentiles – are values that divide an ordered set of observations into 100 equal parts. These
values, denoted by P1 , P2 ,..., P99 , are such that 1% of the data falls below P1 , 2% falls below
P2 ,..., and 99% falls below P99 .
Deciles – are values that divide an ordered set of observations into 10 equal parts. These
values, denoted by D1 , D2 ,..., D9 , are such that 10% of the data falls below D1 , 20% falls
below D2 ,..., and 90% falls below D9 .
Quartiles – are values that divide an ordered set of observations into 4 equal parts. These
values denoted by Q1 ,Q2 and Q3 , are such that 25% of the data falls below Q1 , 50% falls
below Q2 and 75% falls below Q3 .
observations. If L is fractional, get the next higher integer to find the required location.
The quantile corresponds to the value in that location.
13
Summary Measure for Variation
Measures of variation determine whether the set of observations tend to be quite similar
(homogeneous) or whether they vary considerably (heterogeneous).
Range (R) – difference between the largest and the smallest values in the set.
Variance
(x i )2
2 i 1
.
N
N N
( xi ) 2
x i N 2 2
x i
2
i 1
N
2 i 1
or 2 .i 1
N N
Sample Variance ( s 2 ). Given the random sample x1 , x 2 ,..., x n , the sample variance is:
n _ 2
(x i x)
s
2 i 1
.
n 1
For computational purposes, use the formula
n
n n n
( xi ) 2
n xi ( xi ) 2 x i 1
2 2
i
n
s2 i 1 i 1
or s 2 i 1
.
n(n 1) n 1
Properties of the variance
1. The variance is always non-negative.
2. A large variance corresponds to a highly dispersed set of values.
3. The variance is easy to manipulate for further mathematical computation.
4. The variance makes use of all observations.
5. The variance comes in a unit of measure that is the square of the unit of measure of the
given set of values.
14
Standard deviation is the positive square root of the variance.
Formulas: a. population standard deviation: 2
b. sample standard deviation: s s2
Note: 1. The standard deviation has the same properties as the variance except the last one. Its
unit of measure is the same as the original data.
2. If there is a large amount of variation, then on average, the data values will be far from
the mean. Hence, the standard deviation will be large.
3. If there is only a small amount of variation, then on average, the data values will be
close to the mean. Hence, the standard deviation will be small.
Coefficient of Variation (CV) is the ratio of the standard deviation to the absolute value of the mean,
expressed as a percentage. It is unitless and thus can be used to compare the dispersion of two or
more populations measured in the same or different units.
100 s
CV = _
%.
| x|
When data are presented in a frequency distribution, measures for central tendency and
measures of variation can be computed.
Arithmetic mean:
_ fx i i
The computational formula is: x g i 1
n
Where f i is the class frequency of the i th class interval.
x i is the class mark of the i th class interval.
Note: The arithmetic mean cannot be computed from an open-ended frequency distribution.
Median:
15
~ n Fm 1
The computational formula is: x g Lm c 2
fm
Where Lm is the lower class boundary of the median class. The median
The median of grouped data can be calculated even with open-ended intervals provided the
median class is not open-median.
Mode:
To locate the modal class, look at the highest number in the frequency column.
f m o f1
Modeg Lm o c
2 f m o f1 f 2
Where Lm o is the lower class boundary of the modal class. The modal class is the
class interval with the highest frequency.
f mo is the frequency of the modal class.
f1 is the frequency of the class interval immediately preceding the modal
class.
f 2 is the frequency of the class interval immediately following the modal
class.
c is the class width.
Variance:
2
k
k
n f i xi f i xi
2
Standard deviation:
16
The computational formula is: s g
2
sg
2
Where s g is the variance.
Coefficient of Variation:
sg
The computational formula is: CVg _
(100 %)
| xg |
Measure of skewness describes the degree of departures of the distribution of the data from
symmetry. The degree of skewness is measured by the coefficient of skewness, denoted as SK and
computed as,
SK 3(Mean-Median) / Standard deviation
- if SK 0 it is negatively skewed, SK 0 it means positively skewed.
A distribution is said to be symmetric about the mean, if the distribution of the left of the mean is
the “mirror image” of the distribution to the right of the mean. Likewise, a symmetric distribution has
SK 0 since its mean is equal to its median and its mode.
(X i )4
K 3
N 4
- if K 0 it is leptokurtic, if K 0 it is platykurtic and if K 0 it is mesokurtic.
17
Exercises/Problem sets
1. Find the mean, median, mode, range, variance and standard deviation of the following sample
data:
2 3 5 5 5 5 5 6 7
4. Find the mode of each of the following data, provided, of course, that it exists:
a. 6, 8, 5, 6, 5, 5, 7, 7, 9, 7, 6, 8, 4, and 7;
b. 57, 39, 54, 30, 46, 22, 48, 35, 27, 31, and 23;
c. 11, 15, 13, 14, 13, 12, 10, 11, 12, 13, 11, and 13.
5. The number of hours spent by ten students in studying their lessons per day was recorded as
follows: 2, 2, 2, 3, 3, 4, 4, 4, 4 and 5. Find the mean, median and mode.
6. The University entrance exam scores of a sample of 9 students who joined the varsity team of the
A.Y. 2007-2008 were the following: 74, 87, 85, 80, 84, 84, 75, 79 and 86. Compute mean, median
and mode.
8. If you have one or more extreme scores in a data set, which measure of central tendency is more
likely to be affected?
9. What is the standard deviation of a data set that has a mean of 20 and a variance of 49?
10. If the distribution of the data values is positively skewed, which of the following is true?
a. The median and the mean are equal. b. The median is less than the mean.
c. The median is greater than the mean. d. The median is half the mean.
11. The College of Agriculture obtained the following data representing the one-week growth
in centimeters of 33 newly planted tomato plants:
2.3 3.9 3.9 0.8 4.1 1.1 3.1 2.2 2.4 2.4 1.8
2.8 2.4 3.9 1.8 3.9 3.9 4.1 3.9 2.4 4.0 4.2
3.7 1.6 2.3 3.2 2.6 2.6 1.9 2.2 1.7 3.5 1.9
Obtain the following: mean, median, mode, range, variance, standard deviation, P50, Q2, D5, Inter-
quartile Range and coefficient of variation.
18
12. The frequency table below provides the yields in grams of 230 evenly-spaced soybean plants and
their corresponding frequencies.
Yield 3 8 13 18 23 28 33 38 43 48 53 58 63 68
Frequency 7 5 7 18 32 41 37 25 22 19 6 6 4 1
13. A Rural Bus bound to Cagayan from Davao advertises the following fares for Air conditioned
category:
Type of Passenger Fare
College student: 400 Pesos
High school student: 350 Pesos
At most elementary graduate: 300 Pesos
Senior citizen 300 Pesos
Regular passenger 450 Pesos
Rural Air-conditioned bus capacity is 60 passengers in which on the average per trip consists
15 college students, 5 high school students, 3 children (at most elementary graduate), 9 senior
citizens and 28 regular passengers. What is the expected amount that Rural Bus air-conditioned
type would receive per trip?
14. The standard deviation of scores in Math 15 and Math 34 pre-tests is 5 and the mean score is 18.
Since the result will be recorded as a long exam the teacher decided to give an automatic bonus
of 10 points, that is, 20% of the total score. What is now the mean and standard of the new
scores?
16. Use the results of question # 11 to calculate the measure of skewness. Discuss the symmetry or
skewness of this distribution.
17. Asked whether CMU senior students want to attend University Acquaintance Party, 40 students in
the College of Arts and Sciences replied as follows: rarely, occasionally, never, occasionally,
occasionally, occasionally, rarely, rarely, never, occasionally, never, rarely, occasionally,
frequently, occasionally, rarely, never, occasionally, occasionally, rarely, rarely, never,
occasionally, occasionally, rarely, frequently, rarely, occasionally, occasionally, never, rarely,
frequently, never, rarely, occasionally, occasionally, rarely, rarely, occasionally and never. What is
their modal and median reply?
19
18. Consider the organized data of the Systolic Blood pressure of Nonsmokers. Find the following: a.
mean b. median c. mode d. variance e. standard deviation f.
coefficient of variation.
f
i 1
i 63 fx
i 1
i i
20
PROBABILITY DISTRIBUTIONS
Probability distributions are used to model the behavior of many variables of interest. Random
variable is a function whose value is a real number determined by each element in the sample space.
Usually denoted by capital letters like X, Y or Z. Its use provides a convenient way of expressing
elements of a sample space as numbers. The probability that the random variable will take a value is
equal to the sum of the probabilities of the corresponding outcomes in the sample space.
Continuous Random Variable – a random variable which can assume all values
between two points in a continuous scale. The values of these random variables
are usually called measured data.
Examples: weight, height, age, speed of car
Illustration: Rolling two dice and observing the number of dots on the upturned faces.
S= {(1,1), (1,2), (1,3),...(6,6)}
Random variable can be defined as the total number of dots on the upturned faces.
(1,1) 2
(1,2), (2,1) 3
(1,3), (2,2), (3,1) 4
(6,6) 12
The random variable takes on the values 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12.
21
Some of the values had more corresponding elements in the sample space. For
example, 2 corresponds only to only one outcome while 3 corresponds to 2
outcomes.
Example: What is the probability that the random variable Y will take the value 4 of the above
illustration?
Continuous Probability Distribution – the function f(x) is called the probability density
function for a continuous random variable X if the total area under its curve and
above the x-axis is equal to 1 and the area under the curve between the ordinates
X=a and X=b gives the probability that X lies between a and b.
Binomial Distribution
Some statistical problems involved repeated trials, which are independent and dichotomous
(i.e. it involves two possible outcomes often called success or failure). If all trials have identical
probability of success, then this type is called a binomial experiment or trial.
b( x; n, p) n C x p x q n x for x 0, 1, 2, 3,..., n
n!
= p x q n x where n is the number of trials,
x !(n x )!
p is the probability of success,
q is the probability of failure and
22
x is the number of successes.
Example: Find the probability of obtaining exactly three 2’s if a balance die is tossed 5 times.
Example: Find the probability of getting at least 1 head in tossing a fair coin twice.
Example: A fair coin is tossed 3 times and a head is designated as a success. Find the probability
that:
a. 2 heads occur
b. At least 2 heads
c. No head occurs
Example: A multiple choice quiz has 10 questions, each with four possible answers of which only
one is the correct answer. What is the probability that a complete guesswork would
yield at most 1 correct answer?
Normal Distributions
Also known as Gaussian Distribution in honor of Carl Friedrick Gauss (1777-1855) who
derive its equation from a study of errors in repeated measurements of some quantity.
The graph of a normal distribution is a bell-shaped curve that extends asymptotically to
the horizontal axis in both directions. It is seldom necessary to extend the tails of the
normal distribution very far because the area under that part of the curve lying more than
4 or 5 standard deviations from the mean is negligible.
The mathematical equation of the probability distribution of the normal variable depends
on parameters and , its mean and standard deviation, respectively. The distribution is
denoted by the notation N ( , 2 ). The normal distribution function is given by:
1 x
2
1
f ( x) exp
2 2 2
Where x , 3.14159 ...., and exp 2.71828 ...
Properties of Normal Curve:
1. It is symmetric about the vertical axis through the mean .
2. The mean, median and mode are equal.
3. The tails are asymptotic relative to the horizontal line.
4. The total area under the curve and about the horizontal axis is 1 or 100%.
5. One standard deviation from the mean is 68%.
6. Two standard deviations form the mean is about 95%.
7. Three standard deviations from the mean is about 99.7%.
Remarks:
It is possible that two or more normal distributions can have the same mean but differs in
variance.
23
It is also possible two or more normal distributions have equal variances but different
variances.
There are infinite number of normal curves by varying and .
1. P( Z a) 0
2. P( Z a) can be obtained directly from the Z table
P( Z a)
3. P( Z a) 1 P( Z a)
4. P( Z a) P( Z a)
5. P( Z a) P( Z a)
6. P(a1 Z a2 ) P( Z a2 ) P( Z a1 ); a2 a1
Example: Let Z be a standardized variable. Find the following probabilities using the Normal Table.
a. P(Z 0.40) b. P(Z 0.63) c. P(0.40 Z 0.63)
d. P(Z 0.40) e. P(Z 0.40) e. P(1.96 Z 1.96)
Example: In the previous midterm examination of Math 15; a total of 160 students took the said
examination. If their scores are normally distributed with 22 and 5. Find
the following:
24
c. If your teacher wishes to give a 1.0 grade of those students obtain a score in the
90th percentile or higher, what is the minimum score?
Example: The scores on a standardized test for high school students are normally distributed with
mean 500 and standard deviation 100.
a. If you randomly selected a student taking this test, what is the probability that
student would score at least 450?
b. If you randomly selected a student taking this test, what is the probability they
would score between 450 and 600?
c. What score would a student need to get on this test to place him or her in the top
10% of all students?
25
Exercises/Problem sets
1. In each case determine whether the given values can serve as the values of the probability
distribution of a random variable X that can take on the values 1, 2, and 3, explain your answers:
a. P( X 1) 0.20, P( X 1) 0.40, and P( X 3) 0.40;
b. P( X 1) 0.50, P( X 1) 0.45, and P( X 3) 0.10;
c. P( X 1) 10 , P( X 1) 1 , and P( X 3) 12 .
33 3 33
d. P( X 1) 0.85, P( X 2) 0.20, and P( X 3) 0.05.
2. For each of the following, determine whether it can serve as the probability distribution of a
random variable X :
a. P( X x) 1 for x 1, 2, 3, 4, 5, 6, 7, 8, 9,10;
10
b. P( X x) 1 for x 0,1, 2, 3, 4, 5, 6, 7, 8, 9,10;
10
x2
c. P( X x) for x 1, 2, 3, 4.
18
3. A doctor knows from experience that 12% of the patients to whom he prescribes a certain blood
pressure medication will have undesirable side effects. Use the formula for the binomial
distribution to calculate the probability that none of the four patients to whom he prescribes the
medication will have undesirable sided effects.
4. If 40% of the mice used in an experiment will become very aggressive within two minutes after
having been administered an experimental drug, find the probability that exactly four of nine
mice that have been administered the drug will become very aggressive within two minutes.
5. An agricultural cooperative claims that 96% of the watermelons shipped out are ripe and ready to
eat. Find the probabilities among 20 watermelons shipped out
a. at least 17 are ripe and ready to eat;
b. at least 5 are ripe and ready to eat;
c. at most 2 are ripe and ready to eat;
b. all of them are ripe and ready to eat.
6. Find the area under the normal curve that lies between the given values of Z.
a) Z = 0 and Z = 2.37
b) Z = 0 and Z = -1.94
c) Z = -1.85 and Z = 1.85
d) Z = -0.76 and Z = 1.13
e) Z = 0 and Z = 3.09
f) Z = -2.77 and Z = -0.96
26
8. If the weights of 600 students are normally distributed with a mean of 50 kilograms and a variance
of 16 kilograms,
a. Determine the percentage of students with weights lower than 55 kilograms.
b. How many students have weights exceeding 52 kilograms?
9. If a random variable has a normal distribution with 77.5 and 12.4, find the probabilities
that it will take on a value
a. less than 55.1;
b. greater than 84.3;
c. between 80.0 and 90.0;
d. between 72.4 and 82.6.
10. A random variable has a normal distribution with 10. Find the probability that the random
variable will take on a value less than 82.5 is 0.8212, what is the probability that it will take on a
value greater than 58.3?
11. The LDL cholesterol level of adults follow the normal distribution with mean of 4.8 and a
standard deviation of 0.6.
a. A person has moderate risk if his/her cholesterol level is more than 1 but less than 2
standard deviations above the mean. What proportion of the population has moderate
risk according to this criterion?
b. A person has high risk if his/her cholesterol level is more than 2 standard deviations above
the mean. What proportion of the population has high risk?
c. A person within 1 standard deviation of the mean has normal cholesterol risk. What
proportion of the population has normal risk?
d. What is the cholesterol level that exceeds 90% of the population?
12. Due to increasing environmental awareness in the Philippines, strict adherence to the size of
Lawaan boards being sold in the local market is imposed. In order to monitor and control the size
of the Lawaan boards, a large number of boards are measured periodically. It was found that the
actual thickness of 95% of narra boards with one-inch average thickness ranges between 28/32
inches and 36/32 inches. The thickness of these boards follows a bell-shaped curve. What is the
standard deviation, , of the thickness of these narra boards?
13. The local cable company is installing cable in the next barangay and proceeds to your own
barangay after completion. You are told that the time required is a normally distributed random
variable, 24 days and 2 days. You are planning to buy a new TV. You don’t want to buy
the TV until you are 95% sure that the installation is completed. How many days should you wait
before buying the TV?
14. Mathematics in the Modern World removal exam had a mean score of 50 and a standard
deviation of 6. Assume a normal distribution.
a. What is the median?
b. What is the Z score of the mean?
c. In order to get 3.0 grade, your Z score must be 1.5 or above. What is the minimum
score necessary?
27
d. A Z score below -1.6 will be given 5.0 grade. What is the raw score?
e. If your raw score is 60, what is your Z score?
f. What raw score should be at the 95th percentile?
15. Five thousand students took the University entrance test. The scores were normally distributed.
Your score was in the 97.5th percentile.
a. How many people scored at or below your score?
b. Given that the mean score is 79, what is your raw score?
c. Referring to letter b question, what is the standard deviation if your score is 2 standard
deviation above the mean?
28
ESTIMATION
There are good properties of an estimator - an estimator must be accurate and precise.
Accuracy measures the closeness of an estimate to its true value. Precision measures the closeness of
the different possible values of the estimator to each other. To measure accuracy, bias is used where
it is obtained by getting the difference between the expected value of the estimates and the
parameter measures. It measures how close the estimates are to the parameter. An estimator with its
bias equal to zero is said to be an unbiased estimator of the parameter. The precision of an estimator
can be measured by its variance or by its standard error which is the square root of the variance.
Example: Given the two estimators below, which estimator would you rather choose if the
parameter of interest has a value equal to 7?
Estimator A Estimator B
E(A) = 5 E(B) = 7
Bias(A) = -2 Bias(B) = 0
V(A) = 11 V(B) = 18
MSE(A) = ? MSE(B) = ?
x i
x i 1
; n is the sample size.
n
29
_
x
P p
2 s2
Interval Estimation
- a point estimate with a precision is the concern of interval estimation.
- Interval estimate describes a range of values, constructed from the sample data, within
which a population parameter lies with a predetermined probability or degree of
confidence.
- Confidence interval is the interval estimate.
In estimating the mean the margin of error or the maximum error is given by
Margin of error (E ) = Z
2 n
In the case of the sample mean, the central limit theorem assures that there is approximately
68% chance for the sample mean to be within one standard error from its expected value, and
about 95% chance for the sample mean to be within two standard errors from the population
mean. Such results enable us to attach approximately 68% confidence covering the population
mean in an interval of the form:
_
Sample mean SE of the mean = x
n
and about 95% confidence covering the population mean in an interval of the form
_
Sample mean 2(SE of the mean) = x 2
n
A 95% confidence interval for the population mean is the range of values about 2 standard
errors from the sample mean. In 19 out of 20 sampling experiments, we expect to contain the
true value of the population mean in the resulting interval estimate.
Level of confidence
denoted by 100(1 )%
typical levels: 90%, 95%, 99%
A relative frequency interpretation
- in the long run, 100 (1 )% of all the confidence intervals that can be constructed
will contain the unknown parameter.
Wrong to say: that a specific interval will either have 95% probability of containing
the parameter.
30
The single confidence interval is either correct or incorrect, but the confidence level
gives us an indication of the proportion of correct intervals that can be expected with
repeating the estimation procedure.
Once an interval is constructed, we do not find out if it is actually correct.
2 2
Case 1: Confidence interval for the difference of two means ( 1
and 2
are known)
Parameter of interest: d 1 2 .
Assumptions: known standard deviations and normally distributed or large samples.
2 2
_ _
Confidence interval estimate: ( x1 x 2 ) Z 1 2
2 n1 n2
31
2 2 2 2
_ _ _ _
= ( x1 x 2 ) Z 1 2
1 2 ( x1 x 2 ) Z 1 2
.
2 n1 n2 2 n1 n2
2 2
Case 2: Confidence interval for the difference of two means ( 1
and 2
are unknown
but n1 , n2 30 )
Parameter of interest: d 1 2 .
Assumptions: unknown standard deviations and normally distributed or large samples.
2 2
2 2
Case 3: Confidence interval for the difference of two means ( 1
= 2
but unknown
and n1 , n2 30 )
Parameter of interest: d 1 2 .
Assumptions: unknown standard deviations but equal and are normally (or nearly
normally) distributed populations.
_ _ s 2pooled s 2pooled
Confidence interval estimate: ( x1 x 2 ) t ( , n1 n2 2 )
,
2 n1 n2
where s 2pooled
n1 1s12 n2 1s22 is the estimate of the common
n1 n2 2
population variance.
2 2
_ _ s s _ _ s 2pooled s 2pooled
= ( x 1 x 2 ) t ( 1 2 ( x1 x 2 ) t (
pooled pooled
, n1 n2 2 ) , n1 n2 2 )
.
2 n1 n2 2 n1 n2
32
x1 number of successes in the sample
x
p1 1 (the sample proportion)
n1
q1 1 p1
The corresponding meanings attached to P2 , x2 , n2 , p2 and q2 , which come from
population 2.
pq pq
Confidence interval estimate: p1 p 2 Z
2 n1 n2
pq pq pq pq
= p1 p 2 Z P1 P2 p1 p 2 Z
2 n1 n2 2 n1 n2
x1 x2
where p ; q 1 p
n1 n2
d di
n i 1
and s d2
n 1
; s d s d2
_
sd
3. The confidence interval is then d t ,v
2 n
With v = n-1 degrees of freedom
Error 2
Example: What sample size is needed to be 95% confident of being correct within 10 ? A pilot
study suggested that the standard deviation is 35.
ii. Determining the sample size for Estimating the Population Proportion:
33
Z 2 p(1 p)
n 2
Error 2
Example: A pollster is hired to determine the percentage of voters favoring the opposition
party presidential candidate. If we require 99% confidence that the estimated
value is within two percent of the true percentage of the true value, how large
should the random sample be?
Ethical Issues:
i. Confidence interval (reflects sampling error) should always be reported along with
the point estimate.
ii. The level of confidence should always be reported.
iii. The sample size should be reported.
iv. The interpretation of the confidence interval estimate should also be provided.
34
Exercises/Problem sets
1. Use the given data to find the maximum error of estimate E. Be sure to use the correct expression
for E, depending on whether the normal distribution or Student t distribution applies.
a. 0.05, 10, n 50;
b. 0.05, 10, n 64;
c. 0.01, 10, n 50;
d. 0.01, 10, n 64;
e. 0.05, s 10, n 100;
f. 0.05, 10, n 25;
g. 0.05, s 10, n 20;
2. A random sample of midterm grades of 35 Math students were obtained, the grades are shown
below
88 74 79 89 93 89 86 79 87 88
85 91 93 71 85 90 86 72 84 88
88 95 87 85 86 79 85 94 93 90
85 88 86 87 91
Answer the following:
a. Estimate the mean midterm grade of Math students;
b. Estimate the standard deviation midterm grade of Math students.
c. Estimate the proportion of students who failed during the midterm if the passing grade is
75.
4. Refer to question 2, construct a 95% confidence interval for the proportion of Math students who
pass the midterm examination.
5. A poll of 121 randomly selected car owners revealed that the mean length of time that they plan to
keep their cars is 7.01 years and the standard deviation is 3.75 years. Construct a 95% confidence
interval for the mean length of time all car owners want to keep their cars, include the
interpretation.
35
HYPOTHESIS TESTING: FOR ONE POPULATION CASE
AND TWO POPULAION CASE
Two areas of Inferential Statistics: Estimation and Hypothesis Testing. Hypothesis Testing is an
area statistical inference in which one evaluates a conjecture about some characteristic of the
population based upon the information contained in the random sample. Usually the conjecture
concerns one of the unknown parameters of the population. Hypothesis is a claim or statement about
the population parameter.
Null Hypothesis:
denoted by H o
the statement being tested
it represents what the experimenter doubts to be true
must contain the condition of equality and must be written with the symbol
, , .
Alternative Hypothesis:
denoted by H a
is the statement that must be true if the null hypothesis is false
the operational statement or the theory that the experimenter believes to be true
and wishes to prove
is sometimes referred as the research hypothesis
Test of Significance:
A test of significance is a problem of deciding between the null and the alternative
hypothesis on the basis of the information contained in a random sample.
The goal will be to reject H o in favor of H a , because the alternative is the
hypothesis that the researcher believes to be true. If we are successful in
rejecting H 0 , we then declare the results to be “significant”.
36
It is not a miscalculation or procedural misstep; it is an actual error that can occur.
the probability of rejecting the null hypothesis when it is true is called the significance
level ( )
The value of is predetermined, and very common choices are
0.05 and 0.01.
2. Type II Error – the mistake of failing to reject the null hypothesis when it is false.
The symbol (beta) is used to represent the probability of a type II error.
Test Statistic:
A statistic computed from the sample data that is especially sensitive to the
differences between H 0 and H a .
The test statistic should tend to take on certain values when H o is true and different
values when H a is true.
The decision to reject H o depends on the value of the test statistic.
A decision rule based on the value of the test statistic: Reject H o if the computed
value of the test statistic falls in the region of rejection.
Critical Value/s:
The value or values that separate the critical region from the values of the test
statistic that would not lead to rejection of the null hypothesis.
Depends on the nature of the null hypothesis, the relevant sampling distribution, and
the level of significance.
Types of Tests:
1. Two-tailed Test – if we are primarily concerned with deciding whether the true value of a
population parameter is different from a specified value, then the test should be two-
tailed. For the case of the mean, H a : o .
2. Left-tailed Test – if we are primarily concerned with deciding whether the true value of a
population parameter is less than a specified value, then the test should be left-tailed. For
the case of the proportion, H a : P Po .
37
3. Right-tailed Test – if we are primarily concerned with deciding whether the true value of a
population parameter is greater than a specified value, then the test should be right-
tailed. For the case of the difference of two population means, H a : 1 2 0.
Rejection 0
Regions
Ho: 25
Ha: > 25
0
/2
Ho:
Ha: 25
0
Probability Value or p-value
the smallest level of significance at which the null hypothesis will be rejected based
on the information contained in the sample.
is the actual or observed value of the probability of Type I error.
the smaller the p value the stronger is the evidence of rejecting H 0 .
an alternative form of decision rule: reject H o if the p-value is less than or equal to
the level of significance ( )
represents the chance of generating a value as extreme as the observed value of the
test statistic or something more extreme in the null hypothesis were true
38
Summary of the Tests Concerning the Population Mean
Test Statistic Ho Ha Region of Rejection
Case 1: is known o o z c z
_
x o z c z
zc
o z c z & z c z
2 2
n
Case 2: is unknown and o o t c t ( , v )
n<30
_
o t c t ( , v )
x o o t c t
tc
s 2
,v & t c t 2, v
n where v n 1
Remarks: 1. The above summary of the tests concerning population mean are exact level tests
for samples from a normal distribution. However, they provide good
approximate level test when the distribution is not normal provided that the
sample size is n 30.
2. If is unknown and n 30, use the Z-test but replace by s, that is,
_
x
zc
s
n
Tabulated z – values for the common choice of for both
one-tailed and two-tailed tests
0.01 0.05 0.10
One-tailed test ( Z ) 2.33 or -2.33 1.645 or -1.645 1.28 or -1.28
Two-tailed test Z
2.576 and -2.576 1.96 and -1.96 1.645 and -1.645
2
Test of Hypothesis Concerning One Population Mean ( is known and large sample) Example
Problem: The mean weight of the sample of 100 persons from the Honolulu Heart Study is 63 kg.
If the ideal weight is known to be 60 kg, is the group significantly overweight? Assume 10 kg
and 0.05.
H o : 60 kg
Solution: Step 1.
H a : 60kg
Step 2. 0.05
_
x
Step 3. Appropriate test statistic: z c since is known and
n
sample size is large
39
Step 4. Reject H o if z c 1.645
63 60
Step 5. z c 3
10
100
Step 6. Reject H o since z c 1.645 .
If decision is based on p-value: p-value = P( X 63)
63 60
= P Z
10
100
= P( Z 3)
= 1 P( Z 3)
= 1 – 0.9987
= 0.0013
Through the p-value result, H o is rejected since p-value < 0.05.
Step 7. There is sufficient evidence to warrant rejection that the mean
body weight is 60 kg.
40
Step 7. There is sufficient evidence to warrant rejection that the mean body
weight is 60 kg.
H o : 75
Solution: Step 1.
H a : 75
Step 2. 0.05
_
x
Step 3. Appropriate test statistic: t c since is unknown and
s
n
sample size is less than 30.
Step 4. Reject H o if t c 1.729
78.0500 75 _
79 79 ... 83
Step 5. tc 2.438 ; x = 78.0500
5.5958 20
20
(79 79 ... 83) 2
79 2 79 2 ... 83 2
s2 20
20 1
31.3132
s s 2 31.3132
5.5958
Step 6. Reject H o since t c 1.729 .
41
n number of trials
x
p (sample proportion) x is the number of successes out of n trials.
n
P population proportion (used in the null hypothesis)
q 1- p
Problem: At the 0.10 significance level, test the claim that the proportion of females P at Central
Mindanao University equals 0.60. Sample data consist of n 100 of which 68 are females.
H o : P 0 .6
Solution: Step 1.
H a : P 0.6
Step 2. 0.10
x np
Step 3. Appropriate test statistic: Z since it satisfies the
npq
properties of the binomial experiment and np and nq are both
greater than 5.
42
known 3. 1 and 2 are 3. 1 and 2 are
unknown unknown, but equal
Test Statistic _ _ _ _ _ _
x1 x 2 x1 x 2 x1 x 2
Z Z T
12 22 s12 s 22 s 2pooled s 2pooled
n1 n2 n1 n 2 n1 n2
Summary of the Tests Concerning Two Population Means for Two Independent Samples
Test Statistic Ho Ha Region of Rejection
Case 1: 1 and 2 known
_ _ 1 2 0 zc z
x1 x 2 1 2
zc 1 2 0 zc z
12 22 1 2 0 zc z & zc z
n1 n2 2 2
n1 30 and n2 30 1 2 0 t c t ( ,
_ _ 1 2 1 2 0
v)
Remark: Two samples are independent if the sample selected from one population has no effect on
the sample selected from the other population. If the two samples are not dependent, they are
dependent.
Testing the Difference between Two Population Means (independent samples) Example
Problem: A study was conducted to compare the length of time it took make and female students
from the same year level and college to answer a 50-item IQ test. Independent samples of 50
male students and 50 female students were asked to take the test in which each person was
timed. The results were as follows:
MALE FEMALE
n1 50 n2 50
43
_ _
x1 42 minutes x 2 38 minutes
s12 18 s 22 14
Did the data present sufficient evidence to suggest a difference between the true mean
completion times of male and female students at the 5% level of significance?
H o : 1 2
Solution: Step 1.
H a : 1 2
Step 2. 0.05
_ _
x1 x 2
Step 3. Appropriate test statistic: z c
s12 s22
n1 n2
Step 4. Reject H o if z c 1.96 and z c 1.96
42 38
Step 5. z c 5.0
18 14
50 50
Step 6. Since z c 1.96, we thus reject the null hypothesis at the 5% level of
significance.
Step 7. There is sufficient evidence that the mean completion time between male
and female is significantly not equal to zero.
Testing the Difference between Two Population Means on Two Related Samples ( d 0)
Follow the steps presented in constructing the confidence interval for paired samples. The
_
d
test statistic is t c .
sd
n
Problem: A study was conducted to investigate the effectiveness of hypnotism in reducing pain.
Results for randomly selected subjects are given below. At the 0.05 significance level, test the
claim that the sensory measurements are lower after hypnotism (The values are before and after
hypnosis. The measurements are in centimeters on the mean visual analog scale, and the data are
based on “An analysis of Factors that Contribute to the Efficacy of Hypnotic Analgesia,” by Price
and Barber, Journal of Abnormal Psychology, Vol. 96, No. 1.)
44
Solution: Since each pair of scores is matched for one particular person, we can conclude that the
values are dependent. Each difference is the “before” score minus the “after” score. If the
hypnotism is effective, we would expect the after scores to be lower and significantly greater than
0.
H o :d 0
Solution : Step 1.
H1 : d 0
Step 2. 0.05
_
d
Step 3. Appropriate test statistic: t c
sd
n
Step 4. Reject H o if t c t , n1
t c t 0.05, 81
t c 1.895
3.125 0
Step 5. t c 3.036
2.911
8
Step 6. Reject H o at 0.05 level of significance since t c 1.895 .
Step 7. There is sufficient evidence to support the claim that the after scores are
significantly lower than the before scores.
45
Problem: A survey of 100 women and 100 men indicated that 49 of the women and 35 of the men
said they are trying to lose weight. Is there a significant difference of the proportion of women
trying to lose weight than men? Test at 0.10 level of significance.
46
Exercises/Problem sets
1. For each of the following, state the null ( H o ) and alternative ( H1 ) hypothesis.
a. The mean age of CMU Statistics students is at least 17.
b. The mean waistline of UE Statistics students is at most 29 inches.
c. The mean weekly allowance of Senior High School students is 500 pesos.
d. The mean height of male basketball varsity is 173.00 cm.
e. The mean of girls at birth is at most 7 lbs.
f. The mean age of CMU teachers is more than 31 years.
g. CMU administrative council claims that CMU dormitories mean rental is significantly
lower than 75 pesos that other state universities dormitory in the Philippines.
h. A teacher claims that the mean score of her students’ midterm examination exceeds 10
points above the over-all students who took the exam.
i. The proportion of Statistics students believes that 95% of them will obtain at least 2.25
grade.
j. USEP guidance counselor claims that 60 percent of USEP students have an average IQ.
4. What is the critical value for a test of significance in each of the following situations?
a. right-tailed test, 0.05, known, n 24
b. right-tailed-test, 0.05, unknown, n 15
c. right-tailed test, 0.01, known, n 24
d. right-tailed-test, 0.01, unknown, n 15
e. left-tailed test, 0.05, known, n 35
f. left-tailed test, 0.05, unknown, n 15
g. left-tailed test, 0.10, known, n 35
h. left-tailed test, 0.10, unknown, n 15
i. two-tailed test, 0.05, known, n 35
j. two-tailed test, 0.05, unknown, n 15
k. two-tailed test, 0.05, known, n 24
l. two-tailed test, 0.05, unknown, n 15
5. For each part of question #4, decide whether you should reject H o or fail to reject H o according
to the corresponding computed test statistic.
a. 2.0 b. 2.0 c. 2.0 d. 2.0 e. -1.50 f. -1.50 g. -
1.50 h. -1.50 i. 2.0 or -2.0 j. 2.0 or -2.0 k. 1.5 or -1.5 l. 1.5 or -1.5
6. Find the critical Z value(s) for the given conditions. In each case assume that the normal
distribution applies. Also, draw a graph showing the critical value(s) and critical region(s)
a. right-tailed test; 0.05
b. left-tailed test; 0.05
c. right-tailed test; 0.10
47
d. left-tailed test; 0.10
e. two-tailed test; 0.05
f. two-tailed test; 0.10
7. At 0.05, decide whether to reject or fail to reject H o for the following computed
p values :
a. 0.02524 b. 0.5055 c. 0.1028 d. 0.04987
In each of the following exercises, test the given hypotheses by following the steps in hypothesis
testing.
_
8. Test the claim that 110 given a sample of n 78 for which x 115. Assume that 3, and
test at 0.05 significance level.
_
9. Test the claim that 110 given a sample of n 78 for which x 105 Assume that 3, and
test at 0.05 significance level.
_
10. Test the claim that 110 given a sample of n 20 for which x 115. Assume that 3,
and test at 0.05 significance level.
_
11. Test the claim that 110 given a sample of n 20 for which x 105. Assume that 3,
and test at 0.05 significance level.
_
12. Test the claim that 110 given a sample of n 20 for which x 115. Assume that s 3,
and test at 0.05 significance level.
_
13. Test the claim that 110 given a sample of n 20 for which x 105. Assume that s 3,
and test at 0.05 significance level.
_
14. Test the claim that 110 given a sample of n 20 for which x 115. Assume that s 3,
and test at 0.05 significance level.
_
15. Test the claim that 110 given a sample of n 20 for which x 105. Assume that s 3,
and test at 0.05 significance level.
We’ve learned procedures for testing hypothesis that two population means are equal
( H o : 1 2 ). In this chapter, we test the hypothesis that differences among three or more sample
48
means are due to chance. A typical null hypothesis will be H o : 1 2 ... k where k the
number of means being compared.
A statistical test to determine if k population means are equal: The One - Way Analysis of Variance
The analysis of variance is used to test the hypothesis that the means of three or
more populations are the same against the alternative hypothesis that not all
population means are the same.
It is called the analysis of variance because the test is based on the analysis of
variance in the data obtained from different samples.
Only one factor or variance is analyze in using one-way Analysis of Variance (ANOVA)
5. Variance between samples (MSB) measures the variability caused by differences among the
samples means that correspond to the different treatments or categories of classification. From
the above test statistic, with all samples of the same size n,
MSB = ns_2
x
49
6. Variation within samples (MSW) is the pooled variance obtained by finding the mean of the sample
variances, which provides a good estimate of the common population variance.
MSW = s 2p
7. Interpreting F
If the two estimates of variance are close, the calculated value of F will be close to 1
and conclude that there are no significant differences among the sample means.
If the value of F is excessively large, then reject the claim of equal means.
8. Steps in hypothesis testing for more than two population means follow as presented in unit 8.
9. Testing the Difference for more than two Population Means Example
Problem: CMU Mathematics Department would like to know if students Math 11 scores differ if
they are group according to college (ABC, ABS, RPN and IBC). A random sample of 10 students per
college was obtained. At the 0.05 level of significance, do the data below provide evidence that
Math 11 scores differ? Assume that Math 11 scores follow the normal distribution with different
colleges having homogeneous variance.
Solution: Step 1. H o : 1 2 3 4
H1 : The preceding means are not all equal.
Step 2. 0.05
MSB
Step 3. Appropriate test statistic: Fc
MSW
Step 4. Reject H o if Fc F , df where: n is total number of
Fc F ,(k 1, k ( n1)) observations for each category
50
Fc F0.05, ( 41, 4(101) k is the number of classifications
Fc F0.05, ( 41, 4(101)
Fc 2.84
4 _2 _2
x
xi
k
10 i1
k 1
10(7.8825 )
Step 5. Computation: Fc 2 2 = = 5.2040
s1 s2 s3 s42
2
60.5872
4 4
Step 6. Reject H o at 0.05 level of significance since H o 2.84.
Step 7. There is sufficient evidence that the Math 11 scores of students of
CMU colleges statistically differ.
SUMMARY
Groups Count Sum Average Variance
ABC 10 494 49.4 32.71111
ABS 10 550 55 6.888889
RPN 10 490 49 10.88889
IBC 10 499 49.9 10.1
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 236.475 3 78.825 5.203924 0.004341 2.866266
Within Groups 545.3 36 15.14722
Total 781.775 39
In preceding topics, statistical tools presented are for quantitative data, how about for qualitative
data? Qualitative data are data obtained from a particular variable that are usually expressed in
categories. For example; we may classify students into categories such male or female, university
scholar or not a university scholar. For data which is qualitative the result is frequency data since we
count the number of observations falling in each category. Frequency data is also called categorical
51
data. Categorical data are presented in a contingency table. Contingency table (or two-way
contingency table) is a table in which frequencies correspond to two variables. (One variable is used
to categorize rows, and a second variable is used to categorize columns). Chi-square test is used to
test of significance when we have data that are expressed in frequencies or data that are in terms of
percentages or proportions but which can be readily transformed into frequencies. This test is used to
determine the significance of the following:
i. goodness-of-fit test
ii. test for independence
iii. test of homogeneity
Goodness-of-fit test is used when we would to know if the data on hand conforms with a theoretical
distribution. The test statistic is
(O E ) 2
2 where: O represents the observed frequency of an outcome.
E
E represents the theoretical or expected frequency of an outcome.
E np ; n is the sample size and p is the probability that an element
belongs to the category if the null hypothesis is true.
With number of degrees of freedom (df) equal to the number of possible outcomes minus
one. Note that close agreement between observed and expected values will lead to a small value
of 2 . A large value of 2 will indicate strong disagreement between observed and expected
values. A significantly large value of 2 will therefore cause rejection of the null hypothesis of no
difference between observed and expected frequencies.
To make tests of hypotheses about experiments with more than two possible outcomes (or
categories), such experiments, called multinomial experiments, possess four characteristics.
Binomial experiment is a special case of a multinomial experiment.
52
Multinomial Experiment Assumptions
i. We intend to test a hypothesis that for the k categories of outcomes in a multinomial
experiment, the population proportion for each of the k categories is as claimed.
ii. The sample data consist of frequency counts for the k different categories, and the data
constitute a random sample.
iii. For every one of the k categories, the expected frequency is at least 5.
Goodness-of-fit test example: Use a 0.05 significance level of the case study of 147 industrial
accidents that required medical attention. The sample data are summarized in Table 9.1.3, test
the claim that accidents occur on the five days with equal frequencies.
53
- The test statistic is based on the chi-square distribution, the chi-square test
statistic is given below
(O E ) 2
2 where: O are the observed frequencies
E
E are the expected frequencies
(row total )(column total )
=
grand total
Remarks: 1. The test statistic allows us to measure the degree of disagreement between the
frequencies actually observed and those that we would theoretically expect
when the two variables are independent.
2. Small values of the test statistic result from close agreement between observed
frequencies and frequencies expected with independent row and column
variable.
3. Large values of the test statistic are to the right of the chi-square distribution,
and they reflect significant differences between observed and expected
frequencies.
Problem: In a study of 1,000 randomly selected deaths of males aged 45-64, the causes of death
are listed along with their smoking habits (see Table 9.1.4, which is based on data from
“Chartbook on Smoking, Tobacco, and Health,” USDHEW Publication CDC75-7511)
54
by Smoking Status and Cause of Death (numbers enclosed by
Parentheses are the expected values)
Smoking Status Cause of Death Row Total
Cancer Heart Disease Other
Smoker 135 (123.50) 310 (302.25) 205 (224.25) 650
Nonsmoker 55 (66.50) 155 (162.75) 140 (120.75) 350
Column Total 190 465 345 1,000
Test at 0.05 level of significance that smoking status is independent on the cause of death.
Test of Homogeneity – test the claim that different populations have the same proportions of some
characteristics. Performing the Chi-square Test for Homogeneity will be the same as the Test for
Independence.
Table 9.1.5: Observed and Expected Frequencies for the Voter Sentiment Survey
Voter Sentiment Quezon City Manila Row Total
Favor Digong 204 (216) 228 (216) 432
Favor Noynoy 215 (206) 197 (206) 412
Favor Another 81 (78) 75 (78) 156
55
Column Total 500 500 1,000
For very small frequencies, the test is not a good approximation. The approximation is
usually considered adequate provided that the expected frequencies in all cells are at
least 5. If some frequencies are below 5, this requirement may be met by combining
two rows or columns before computing the test statistic. A corresponding reduction
in the degrees of freedom must then be made to account for the smaller number of
cells.
For one degree of freedom. Yate’s correction for continuity should be applied.
O E 0.5 2
E
The observations should be independent of one another, that is, the total number of
observed frequencies should not be more than the total number of individuals in the
sample.
The chi-square test does not apply to data expressed as percentage or proportions,
unless they can be transformed into frequencies.
56
Problems/Exercises
2. ANOVA Table
SOURCE SS DF MS F
Between 345 4
Within 260
Total 49
3. ANOVA Table
SOURCE SS DF MS F
Between 3 50
Within 32 12.5
Total 35
4. Forty freshmen students were randomly assigned to four groups of experiment with four different
methods of teaching College Algebra. At the end of the semester, the same test was given to all
40 students. The table gives the scores of students in the four groups.
Test that the mean scores of all four groups of freshmen students taught by four different
methods are not equal. Assume that all the required assumptions hold true. Use 0.05 level of
significance.
57
5. The table below provides the data that represent the number of hours of pain relief provided by 5
different brands of headache tablets administered to 25 subjects. The 25 subjects were randomly
selected, divided into 5 groups and each group was treated with a different brand.
6. Stain Less advertises that its detergent will remove all stains, except oil-base paint, in any kind of
water. Consumer Action is evaluating this claim. Batches of were run in 5 randomly chosen homes
having a particular type of water – hard, moderate or soft. Each batch contains an assortment of rags
and cloth scraps stained with food products, grease, and dirt over a 150 square inch area. After
washing the number of square inches that were still stained was determined and the following results
were obtained.
(O E ) 2
7. If you find 10 lemons, 13 apples, and 27 mangoes, what is the value of 2
to test
E
the hypothesis that the relative frequencies of the three fruits are 25%, 25% and 50%,
respectively?
8. Using the value of 2 statistic in question #7, does the ratio of fruits match that of the 25-25-50
relative frequency? Test at 0.05 level of significance.
9. A bank has an ATM installed inside the bank and it is available to its customers only from 7:00 AM
to 6:00 PM Monday through Friday. The manager of the bank wanted to investigate if the
percentage of transactions made on this ATM is the same for each of the 5 days (Monday through
Friday) of the week. The manager randomly selected one week and counted the number of
transactions made on this ATM from Monday to Friday. The information the manager obtained is
given in Table 9.2.5, where the number of users represents the number of transactions on this
ATM on these 5 days. For convenience, transactions are referred as “people” or “users”.
58
ATM Transactions: From Monday to Friday
DAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY
No. of Users 253 197 204 279 267
At the 5% level of significance, can we reject the null hypothesis that the proportion of people
who use this ATM each of the five days of the week is the same? Assume that this week is typical
of all the weeks in regard to the use of this ATM.
10. ABC Brewery manufactures and distributes three types of beer: low calorie, regular beer, and dark
beer. In n analysis of the market segments of the three beers, the firm’s market research group
has raised the question of whether or not preferences for the beers differ between male and
female beer drinkers. If beer preference is dependent of the drinker’s gender, then one
advertising campaign will be initiated for all ABC beers. However, if beer preference depends on
gender, the firm will tailor its promotions toward different target markets. A study survey for this
study resulted as follows:
Is there sufficient evidence at 0.05 that beer preference is related to gender of the
drinker?
11. A firm that sells an accessory for new cars has researched where new cars are being sold. The
accompanying table shows new car sales in four areas. At the 0.05 level of significance, test the
claim for car sales, the manufacturer and the area of sale are independent variables.
59
12. At the 0.05 level of significance, use the data in the following table, that is, Music Preference and
IQ to test the claim that IQ and music preference are independent.
13. In an experiment to study the dependence of hypertension on diet, the following data were taken
on 200 individuals.
Test the hypothesis that the presence or absence of hypertension is independent of diet. Use
a 0.05 level of significance.
14. A survey was conducted in Central Maramag University (CMU) and Valencia State University (VSU)
to determine students’ level of satisfaction on the learning acquired to their teachers. For each
university, three hundred students were randomly selected and the data is given in Table 9.2.10.
At the 0.01 level of significance, test the null hypothesis that proportions of the level of
satisfaction of students are the same for the two universities.
60
SIMPLE LINEAR REGRESSION AND CORRELATION
Regression analysis is a statistical method which makes use of the relationship between two or
more quantitative variables so that one variable, called the dependent variable or response variable,
can be predicted with the knowledge of the values of the other variable, called the independent
variable or explanatory variable. A mathematical equation that allows us to predict values of one
dependent variable from known values of one or more independent variable is called a regression
equation.
For this chapter, it focuses on the problem of estimating or predicting a value of a dependent variable
Y on the basis of a known measurement of an independent variable X . Scatter diagram is a
graphical presentation of the independent variable (plotted on the horizontal axis) and the dependent
variable (plotted on the vertical axis). Through this graph or diagram is the easiest way to determine if
a relationship exists between the two variables. A linear relationship between two variables is one in
which the relationship can be most accurately presented by a straight line. In this section, the
problem of estimating or predicting the value of a dependent variable on the basis of a known
measurement of an independent variable will be given consideration. Although a graphic solution is
sometimes used for prediction, it is much more common to predict Y from the equation of the
straight line. The general form of the equation is given by
Y a bX , linear regression line equation or simple linear regression
For each X , the equation Y a bX will predict a value of Y. The estimated regression line
is defined by the equation
61
Y a bX Where: Y is the predicted dependent variable
a Y intercept (value of Y when X 0)
b slope of the line
a and b are the estimates of the parameters of
regression which are calculated from the available sample
points.
Remark: Through the estimated regression line equation we can now predict any Y value just by
knowing the corresponding X value.
Estimation of Parameters
Given the sample {( xi , yi ), i 1, 2, 3, n} the least squares estimates of the parameters in
the regression line are:
n n n
n xi yi xi yi _ _
b i 1 i 1 i 1
2
; a y b x
n
n
n x xi
2
i
i 1 i1
n n
_ y i _ x i
y i 1
and x i 1
are the means of the sample values
n n
a is the estimate of the population Y intercept o and b is the estimate of the population slope
coefficient 1.
62
Example: Given the data in the following table. Find the following
a. Find the equation of the regression line.
b. Scatter diagram.
c. Find the point estimate of Y when x 113.
Solution:
12 12
n 12, xi 110 112 ... 138 1,503.00,
i 1
x
i 1
2
i 110 2 112 2 ... 138 2 189,187 .00
12 12 _ _
yi 50 56 ... 68 706.00,
i 1
yi2 50 2 56 2 ... 682 41,682.00, x 125.25, y 58.833
i 1
12
x y
i 1
i i 110 (50) 112 (56) ... 138(68) 88,857
12(88,857 ) (1,503)(706)
b 0.4598, a 58.833 0.4598(125.25) 1.2414
12(189,187 ) (1,503) 2
a. Y 1.2414 0.4598 X
b. Scatter Plot
63
70
60
50
SCORE
40
100 110 120 130 140
IQ
c. Y 1.2414 0.4598(113) 53.20
Correlation analysis attempts to measure the strength of the relationship between two random
variables by means of a single number called correlation coefficient. This concerned only with the
strength of the relationship and no causal effect is implied. The Pearson Correlation Coefficient ( )
measure the strength of the linear relationship between two variables X and Y . The estimated
sample correlation coefficient, denoted by (r ), is given by:
n n n
n xi yi xi yi
r i 1 i 1 i 1
where n is the sample size
n 2 n 2
n 2 n 2
n xi xi n yi yi
i1 i1 i 1 i1
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
r = .6 r=1
64
Features of and r
- unit free
- ranges from -1 to 1
- the closer to -1 the stronger the negative linear relationship
- the closer to 1 the stronger the positive linear relationship
- the closer to 0, the weaker the linear relationship
The Sample Coefficient of Determination, r 2 , is a number that determine the total variation in the
values of variable Y that can be accounted for or explained by the linear relationship with the
values of the variable X .
Example: Of the given example above, find the sample correlation coefficient and sample coefficient
of determination and interpret the results.
Using the above example, is there evidence of a linear relationship between the students MMW
midterm scores and IQ scores at 0.05 level of significance?
tc t0.025, 10 or tc t0.05, 10
tc 2.228 or tc 2.228
0.7796 0
Step 5. Computation: tc 19.88
1 .7796 2
12 2
Step 6. Reject H o since tc 2.228 .
Step 7. There is sufficient sample evidence that there is a significant
linear relationship between Students IQ scores and their MMW
midterm scores.
65