0% found this document useful (0 votes)
75 views65 pages

Statistics MMW

Uploaded by

biacakesss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views65 pages

Statistics MMW

Uploaded by

biacakesss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

What is Statistics and who should care?

Nowadays, people are curious about many things, chances are that you are interested with
the role of Statistics that made it useful by understanding of structures in data. Information
developed through the use of statistics has improved our understanding of how life works, helped us
learn each other, allowed control over some societal issues, and helped individuals make informed
decisions. There is almost no area of knowledge that has not been advanced by statistical studies.

Statistics defined in its plural sense is a set of numerical data, while in its singular sense refers
to the scientific discipline consisting of theory and methods in processing numerical information that
one can use when making decisions in the face of uncertainty. Thus,

Some Applications of Statistics


 Determining the level of patient’s satisfaction on the nursing care administered by
student nurses at Common View University.

 Determining the distribution of the number of text messages sent per day of
Mathematics in the Modern World (MMW) students.

 Relationship of faculty status and work commitment.

 Prediction of the number of MMW students for the next school year 2019-2020.

Major Categories of Statistics


i. Descriptive Statistics – methods concerned with collecting, describing, and
analyzing a set of data without drawing conclusions (or inferences) beyond the
data.

ii. Inferential Satistics – methods concerned with the analysis of a subset of data
leading to predictions or inferences about the entire set of data, that is, to
generalize results beyond the data collected provided that the data collected is a
part (sample) of a large set of items (population).

Examples of Descriptive Statistics


 Total number of Statistics students weighing at least 50 kilograms.

 The University registrar cited statistics showing an increase number of students


during the past five years.

Example of Inferential Statistics


 A new teaching strategy was designed to improve the academic performance of
college students was tested on randomly selected college students. Based on the
results, it was concluded that the new teaching strategy is effective in improving the
academic performance of college students.

1
Key Terms
Universe – is the set of all entities under study, that is, the collection of things or
observational units under study.
Variable – is a characteristic observed or measured on every unit of the universe.
Population - is the set of all possible values of the variable.
Sample – is a subset of the population.
Parameters – are numerical measures that describe the population or universe of interest.
Statistics – are numerical measures of a sample.
Frame – a listing of all the elements in a population.
Census – the process in which information is gathered for all units in the population.
Sample survey or sampling – the process in which information obtained is only a part of the
population.

“A statistic is to a parameter as a sample is to a population”.

Types of Variables and Data


The building blocks of statistical science are data. Specific characteristics (e.g., age, height,
and weight) that we want to assess for a certain population are referred to as variables. Variables
may be categorized further as qualitative and quantitative variables.

Qualitative variables – variables that yield observations by which individuals can be categorized
according to some characteristic or quality.
- e.g., gender, marital status and blood type.

- Are expressed in categories.

Quantitative variables – variables that yield observations that can be measured.


- e.g., weight, height, systolic blood pressure and body mass index.

- Numerical measure exists.

Constant – variable that only assume one value.

Data collected on particular variables are classified as either qualitative or quantitative.


Qualitative data if no numerical measure exists (e.g., gender, marital status and blood type), data
obtained on particular variables are usually expressed in categories. Quantitative data are
expressed in numbers (e.g., weight, height, systolic blood pressure and body mass index); data
collected on particular variables are measured and counted.

Quantitative data is either classified as discrete or continuous data.


 Discrete data – data that can be counted, e.g., number of patients in a hospital,
number of students who obtained 1.0 grade in MMW. These data assume only a
countable number of values.
 Continuous data – data that can be measured, e.g., systolic blood pressure, weight
and height. These data result from infinitely many possible values that can be

2
associated with points on a continuous scale in such a way that there are no gaps or
interruptions.

Note: Arithmetical operations for quantitative data have some physical interpretation. Some
variables may take numerical values, but it does not make the variable quantitative, e.g.,
sum of two zip codes or the difference of your cellular telephone number to your
seatmate. Thus, the arithmetic operations of the above example do not make sense. The
issue is whether performing arithmetical operations on these data would make any sense.
The figure below illustrates the classification of data collected on particular variables.

Figure 1.1.1 Classification of Data on Particular Variables

Levels of Measurement or Measurement Scales


Since measurement is the assignment of numbers to objects or events according to a
predetermined set of rules, e.g., it is desired to measure a person’s n weight in kilograms, we may
assign the number 50 to a person and say that a person’s weight is 50 kilograms. Determining the
level of measurement of gathered data is important because it helps determine which statistical
inference test that will be used to analyze the data. There are four types of measurement scales:
nominal, ordinal, interval and ratio scales. They differ in the property of numbers (identity, order,
additivity) that they possess.

- Identity – is the property that enables a person to distinguish one number from the other.
They are recognized by the shapes of the way they are written.

- Order – is the property that numbers are arranged in a sequence. For any integer number
A, B, we can determine whether A  B, A  B, and B  A.

- Addititvity – is the property that allows to add numbers. For any real number
A, B, C , and D, because of the equality of scale, we can determine if
A  B  C  D, A  B  C  D or A  B  C  D.

- Absolute zero property means nothing of the characteristic being measured.

 Nominal scale – the lowest level of measurement and is most often used with

variables that are qualitative in nature, rather than quantitative.

- Examples: gender, eye color, smoking status and nationality.

3
- it possess only the property of identity. Thus, numbers are only used to classify. For
example in the variable gender, if 1 is assign to male and 2 is for female, it does not
mean that female is better than male.

 Ordinal scale – possesses the property of identity and order.


- can rank-order the objects to whether they possess more, less or the same amount
of the variables being measured. Thus can determine whether A  B, A  B,
or A  B.
- cannot determine how much greater or less A is than B in the attribute being
measured.
- Examples: level of educational attainment, military ranks.

 Interval scale – possesses the properties of identity, order and additivity but do not have
the absolute zero property.
- Examples: Celsius scale measurement of temperature and intelligence score.

 Ratio scale – possesses the properties of identity, order, equality of scale and absolute
zero.
- Examples: weight and height.

Index, Subscript, Notation


In statistics, we usually deal with group of data that result from measuring one or more
variables. The data are often derived from samples and occasionally from population, but in
either case it is useful to let symbols stand for the variables measured in the study. Usually most
statistics books used the Roman letter X and sometimes Y , to stand for the variable(s)
measured.

The number of observations is also represented stands for any of the numbers 1, 2, 3,…, n is
called a subscript, or index. Any letter other than i , such as j, k , v, q or r , could have been used
as well.

 The Summation symbol  - it is a more compact way of writing the sum of a set of data
values.
n
- x
i 1
i is defined as

x
i 1
i = x1  x 2  ...  x n

Example 1. Consider the age of a sample of six children as shown in the table below
Table 1.1.1: Ages of Six Children

4
Child Number Age symbol Age (year)

1 x1 8

2 x2 10

3 x3 7

4 x4 6

5 x5 10

6 x6 12

Find the following: a. Find the sum of their ages in compact form.
2
 4  4 2

b.   x i 
 i 1 
c. x
i 1
i .

Rules of Summation
n n n n
1.  ( xi  yi  zi )   xi   yi   zi
i 1 i 1 i 1 i 1

n n
2.  cxi  c xi , where c is a constant.
i 1 i 1

n
3.  c  nc , where c is a constant.
i 1

Example 2. Let y1  1, y2  1, y 3  5, y 4  4, y5  7 and y 6  6. Let the xi ' s be as defined


in Example 1. Find the following:
6
a. x y
i 1
i i

6
b.  (x
i 1
i
2
 yi )
2

4 2

c.  (x
i 1
i  yi ) .

 The Factorial Symbol ! - is a compact way of writing the product of a sequence of positive
integers. The symbol n! is defined as

5
n! 1 2  3  ... n.
- n! is the product of all positive integers less than or equal to n.

- 0! 1.

Example 3. Solve for n !.

a) n = 5 b) n = 7 c) n = 8 d) n = 10

6
Exercises/Problems

1. Give an example of a universe.

2. For the given universe, define at least 3 populations.

3. Through the given populations in question #2, answer the following:


a. Determine whether the variable of interest in the specified populations is discrete or
continuous variable.
b. Determine the level of measurement of the data obtained considering the specific variable
of interest.

4. Investigate the following problems and determine what is more appropriate to use – descriptive or
inferential statistics.
a. Mathematics Department would like to know the number of BS Mathematics students
interested of the newly revised curriculum of the BS Mathematics program.
b. A biology student studies the mercury content of fishes in Pulangi River and found that the
average mercury content is 400 units.
c. Office of Student Affairs would like to predict the number of students who would like to
stay at the University’s dormitories. However, the enrolment period is a week before the
classes start so the said office randomly selected 100 students and the results were used
as an estimate.
d. Do girls learn to walk at an earlier age than boys?

5. Which of the following statements best describes statistical inference?


a. A decision, estimate, prediction, or generalization about the sample based on information
contained in a population. The population parameters are estimated using the sample.

b. A statement made about a sample based on the measurements in that sample. Statistical
inference helps us draw conclusion about the unknown population characteristics based
on the sample.

c. A decision, estimate, prediction or generalization about the population based on


information contained in a sample.

6. Fill in the missing words to the quote: “Inferential statistics is defined as drawing conclusions about
____________ based on ____________ computed from the _____________.”

7. A random sample of 100 commuter students in CMU was selected and several variables were
recorded for each student. Which of the following is NOT CORRECT?
a. Their average allowance per month is a continuous variable.

b. Socioeconomic status was coded as 1=low income, 2=middle income, 3=high income and
is an interval scaled variable.

c. The primary language used at home is an ordinal scale variable.

7
8. Identify the following as qualitative or quantitative variable. If quantitative, classify whether it is
discrete or continuous. Also, indicate the appropriate level of measurement required in each.
a. Car ownership (answers the question: Do you own a car?)
b. Citizenship
c. Tuition fees
d. Color of the skin
e. Air temperature of the peak of Mt. Kalayo measured in degree Celsius.
f. Religion

9. The College of Agriculture obtained the following data representing the one-week growth
in centimeters of 33 newly planted tomato plants:
2.3 3.9 3.9 0.8 4.1 1.1 3.1 2.2 2.4 2.4 1.8
2.8 2.4 3.9 1.8 3.9 3.9 4.1 3.9 2.4 4.0 4.2
3.7 1.6 2.3 3.2 2.6 2.6 1.9 2.2 1.7 3.5 1.9

Obtain the following:


a. Sum of the one week growth in centimeters of 24 newly planted tomato plants
in compact form.
b. Let X be the one week growth of tomato plants, find
2
30
 30 
 x and   xi 
2
i
i 1  i 1 
2
30
 30 

i 1
x    xi 
2
i
 i 1 
c. Evaluate
30  1

10. Write each of the following as a summation; that is, in the compact  notation:

a. z1  z2  z3  z4  z5  z6 b. z2  z3  z4  z5  z6

c. x1 f1  x2 f 2  x3 f3  x4 f 4  x5 f 5 d. x12  x22  x32  x42

e. 2 z2  2 z3  2 z4  2 z5  2 z6 f. ( x1  y1 )  ( x2  y2 )  ( x3  y3 )

g. ( x4  3)  ( x5  3)  ( x6  3)  ( x7  3)

s
11.  tat 
i 1

12. Solve for n ! .

a) n = 3 b) n = 4 c) n = 1 d) n = 0

13. Determine for each of the following whether it is true or false:

8
a. 19!  19 18 17 16! d. 6! 3!  9!

12! 9!
b.  4! e.  36
3! 7!2!
c. 3! 0!  7 f. 15!2!  17!

9
SUMMARY MEASURES

Piles of raw data, by themselves, may not be informative, but when data are presented in
summary form, they may be much more interesting and meaningful to us. In most cases, we need to
summarize a given set of data rather maintain the entire set. Single numbers called summary (or
descriptive) statistics can be calculated for such a purpose. Two kinds of summary statistics are
particularly important to most data users – measures of central tendency and measures of variability.

The figure below shows the summary measures

Summary Measures

Location Variation Skewness

Percentile Kurtosis
Maximum Range
Quartile
Minimum Coefficient of
Decile
Variance Variation
Central
Tendency Inter-quartile
Range
Mean Mode
Standard Deviation
Median

Figure 2.1.1 Summary Measures

Measures of location summarize a data set by giving a “typical value” within the range of
the range of the data values that describes its location relative to entire data set. A measure of
variation is a single value that is used to describe the spread of the distribution. A measure of
central tendency alone does not uniquely describe a distribution. The following are the descriptive
statistics:

Summary Measure for Location

 Minimum is the smallest value in the data set, denoted by MIN.


 Maximum is the largest value in the data set, denoted by MAX.

Example: In a sample data: 7, 8, 10, 4 and 14 the minimum and maximum are _______ and _______,
respectively.
Answer: MIN = 4 and MAX = 14.

10
 Measures of Central tendency or location are values that are typical, or representative, of a
set of data that tend to lie centrally within a set of data arranged according to magnitude.
Measures of central tendency are also called averages.
 Arithmetic mean or simply the mean – is the most popular measure of central tendency. It is a
sum of a set of measurements divided by number of measurements in the set.
 Population mean – if the set of data x1 , x 2 ,..., x N , not necessarily all distinct represents a
finite population of size N , then the population mean is
N

x i
 i 1

N
 Sample mean – if the set of data x1 , x 2 ,..., x n , not necessarily all distinct represents a finite
sample of size n, the sample mean is
n

_ x i
x i 1

Properties of the Arithmetic Mean


1. May not be an actual value observation in the data set.
2. Can be applied in at least an interval level of measurement.
3. Easy to compute.
4. Every observation contributes to the value of the mean.
5. Subgroup mean can be combined to come up with a group mean.
6. Easily affected by extreme values.

Examples:
1. The examination scores of a sample of 5 students are 58, 49, 52, 62 and 65. Find the mean.
_
Answer: Since the data pertains to sample the notation x is used. Thus,
n

_  xi  58  49  52  62  65 
i 1
x  
n  5 
 57.2
2. Find the weight of these population data: 18, 29, 22, 32 and 15.
_
Answer: Since the data pertains to sample the notation x is used. Thus,
N
 xi  18  29  22  32 15 
i 1
  
N  5 
 23.2
Note: Sometimes we associate with the numbers x1 , x2 ,..., xk certain weighting factors (or
weights) w1 , w2 , w3 ,..., wk , depending on the significance or importance attached to the
numbers. In this case,

11
k

_  wi xi
x  i 1k
 wi
i 1
_
w1 x1  w2 x2  w3 x3  ...  wk xk
x
w1  w2  ...  wk
is called the weighted arithmetic mean.

Example:

1. A sample of 40 students took University entrance test. 15 students had a mean score of 75. The
other students had a mean score of 90. What is the average score of these 40 students?
Solution:
_
x
15*75  (25*90)
15  25
3375

40
 84.375
2.

 Median is the middle value of a set of observations arranged in increasing or decreasing order
of magnitude. It is the middle value when the number of observations is odd, or the
arithmetic mean of the two middle values when the number of observations is even, i.e., it
the value such that half of the observations fall above it and half below it.

 x N 1 if N is odd
 2
~
a. Population median:     
1
  x N x N  if N is even
 2  2  2 1 
 x n1 if n is odd
~  2
b. Sample median: x  
1
  x n x n  if n is even
 2  2  2 1 

Properties of Median
1. May not be an actual observation in the data set.
2. Can be applied in at least ordinal level.
3. A positional measure; may not be affected by extreme values.

12
 Mode is the value that appears the most number of times or that value with the greatest
frequency. The mode may not exist, and even if it does exist it may not be unique. A
distribution having only one mode is called unimodal.

Properties of the Mode


1. Can be used for qualitative as well as quantitative data.
2. May not be unique.
3. Not affected by extreme values.
4. Can be computed for ungrouped and grouped data.

If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean of the two
middle values) that divides the set into two equal parts is the median. By extending this idea, we can
think of those values which divide the set into four equal parts, 10 equal parts and 100 equal parts
and these are called quartiles, deciles and percentiles, respectively. Collectively, quartiles, deciles,
percentiles and other values obtained by equal subdivisions of the data are called fractiles.

 Percentiles – are values that divide an ordered set of observations into 100 equal parts. These
values, denoted by P1 , P2 ,..., P99 , are such that 1% of the data falls below P1 , 2% falls below
P2 ,..., and 99% falls below P99 .

 Deciles – are values that divide an ordered set of observations into 10 equal parts. These
values, denoted by D1 , D2 ,..., D9 , are such that 10% of the data falls below D1 , 20% falls
below D2 ,..., and 90% falls below D9 .

 Quartiles – are values that divide an ordered set of observations into 4 equal parts. These
values denoted by Q1 ,Q2 and Q3 , are such that 25% of the data falls below Q1 , 50% falls
below Q2 and 75% falls below Q3 .

Procedure to compute for these values.


Step 1. Arrange the data in an increasing order of magnitude.
Step 2. Solve for the value of L , where
 mn
100 , percentiles

 mn
L   , deciles
 10
 mn
 4 , quartiles

Where m is the location of the percentile, decile or quartile,
n is the number of observations.
Step 3. If L is an integer, the desired quantile get the average of the L and the ( L  1) th
th

observations. If L is fractional, get the next higher integer to find the required location.
The quantile corresponds to the value in that location.

13
Summary Measure for Variation

Measures of variation determine whether the set of observations tend to be quite similar
(homogeneous) or whether they vary considerably (heterogeneous).

Range (R) – difference between the largest and the smallest values in the set.

(R) = Highest value – lowest value

Properties of the Range.


1. Computation-wise, it is a quick but rough measure of dispersion.
2. The larger the value of the range, the more dispersed are the observations.
3. It considers only the lowest and highest values.

Variance

Population Variance ( 2 ) . Given the finite population x1 , x2 ,..., x N , the population


variance is:
N

 (x i  )2
2  i 1
.
N

For computational purposes, use the formula


N

N N
( xi ) 2
 x i  N 2 2
x i
2
 i 1

N
2  i 1
or  2  .i 1

N N
Sample Variance ( s 2 ). Given the random sample x1 , x 2 ,..., x n , the sample variance is:
n _ 2

 (x i  x)
s 
2 i 1
.
n 1
For computational purposes, use the formula
n

n n n
( xi ) 2
n xi  ( xi ) 2 x  i 1
2 2
i
n
s2  i 1 i 1
or s 2  i 1
.
n(n  1) n 1
Properties of the variance
1. The variance is always non-negative.
2. A large variance corresponds to a highly dispersed set of values.
3. The variance is easy to manipulate for further mathematical computation.
4. The variance makes use of all observations.
5. The variance comes in a unit of measure that is the square of the unit of measure of the
given set of values.

14
Standard deviation is the positive square root of the variance.
Formulas: a. population standard deviation:    2
b. sample standard deviation: s  s2

Note: 1. The standard deviation has the same properties as the variance except the last one. Its
unit of measure is the same as the original data.
2. If there is a large amount of variation, then on average, the data values will be far from
the mean. Hence, the standard deviation will be large.
3. If there is only a small amount of variation, then on average, the data values will be
close to the mean. Hence, the standard deviation will be small.

Inter-quartile Range (IQR)


- the difference between the third quartile and the first quartile, i.e.,
IQR  Q3  Q1

Properties of the Inter-quartile Range


1. Reduce the influence of extreme values.
2. Not as easy to calculate as the range.

Coefficient of Variation (CV) is the ratio of the standard deviation to the absolute value of the mean,
expressed as a percentage. It is unitless and thus can be used to compare the dispersion of two or
more populations measured in the same or different units.
100 s
CV = _
%.
| x|

When data are presented in a frequency distribution, measures for central tendency and
measures of variation can be computed.

Measures of Central Tendency (Grouped data).

Arithmetic mean:

_ fx i i
The computational formula is: x g  i 1

n
Where f i is the class frequency of the i th class interval.
x i is the class mark of the i th class interval.
Note: The arithmetic mean cannot be computed from an open-ended frequency distribution.

Median:

15
~  n  Fm 1 
The computational formula is: x g  Lm  c  2 
 fm 
 

Where Lm is the lower class boundary of the median class. The median

class is the class interval where the n  2 th


value falls.
Fm 1 is the cumulative frequency of the class interval immediately preceding
the median class.
fm is the frequency of the median class.
c is the class width or class size.

The median of grouped data can be calculated even with open-ended intervals provided the
median class is not open-median.

Mode:
To locate the modal class, look at the highest number in the frequency column.
 f m o  f1 
Modeg  Lm o  c  
 2 f m o  f1  f 2 
Where Lm o is the lower class boundary of the modal class. The modal class is the
class interval with the highest frequency.
f mo is the frequency of the modal class.
f1 is the frequency of the class interval immediately preceding the modal
class.
f 2 is the frequency of the class interval immediately following the modal
class.
c is the class width.

Measures of Variability (Grouped data)

Variance:
2
k
 k 
n f i xi    f i xi 
2

The computational formula is: s g 


2 i 1  i 1 
n(n  1)
Where n is the number of observations.
f i is the frequency of the i th class interval.
x i is the class mark of the i th class interval.
k is the number of class intervals.

Standard deviation:

16
The computational formula is: s g 
2
sg
2
Where s g is the variance.

Coefficient of Variation:
sg
The computational formula is: CVg  _
(100 %)
| xg |

Measure of skewness describes the degree of departures of the distribution of the data from
symmetry. The degree of skewness is measured by the coefficient of skewness, denoted as SK and
computed as,
SK  3(Mean-Median) / Standard deviation
- if SK  0 it is negatively skewed, SK  0 it means positively skewed.

A distribution is said to be symmetric about the mean, if the distribution of the left of the mean is
the “mirror image” of the distribution to the right of the mean. Likewise, a symmetric distribution has
SK  0 since its mean is equal to its median and its mode.

Measure of kurtosis describes the extent of peakedness or flatness of the distribution of


the data. Measured by coefficient of Kurtosis (K) computed as,

(X i  )4
K 3
N 4
- if K  0 it is leptokurtic, if K  0 it is platykurtic and if K  0 it is mesokurtic.

17
Exercises/Problem sets

1. Find the mean, median, mode, range, variance and standard deviation of the following sample
data:
2 3 5 5 5 5 5 6 7

2. Find the median position for


a. n  9;
b. n  20.

3. Find the positions of the median, Q2 and Q3 for


a. n  21 ;
b. n  38 ;
c. n  50.

4. Find the mode of each of the following data, provided, of course, that it exists:
a. 6, 8, 5, 6, 5, 5, 7, 7, 9, 7, 6, 8, 4, and 7;
b. 57, 39, 54, 30, 46, 22, 48, 35, 27, 31, and 23;
c. 11, 15, 13, 14, 13, 12, 10, 11, 12, 13, 11, and 13.

5. The number of hours spent by ten students in studying their lessons per day was recorded as
follows: 2, 2, 2, 3, 3, 4, 4, 4, 4 and 5. Find the mean, median and mode.

6. The University entrance exam scores of a sample of 9 students who joined the varsity team of the
A.Y. 2007-2008 were the following: 74, 87, 85, 80, 84, 84, 75, 79 and 86. Compute mean, median
and mode.

7. Compute the standard deviation in questions 5 and 6.

8. If you have one or more extreme scores in a data set, which measure of central tendency is more
likely to be affected?

9. What is the standard deviation of a data set that has a mean of 20 and a variance of 49?

10. If the distribution of the data values is positively skewed, which of the following is true?
a. The median and the mean are equal. b. The median is less than the mean.
c. The median is greater than the mean. d. The median is half the mean.

11. The College of Agriculture obtained the following data representing the one-week growth
in centimeters of 33 newly planted tomato plants:
2.3 3.9 3.9 0.8 4.1 1.1 3.1 2.2 2.4 2.4 1.8
2.8 2.4 3.9 1.8 3.9 3.9 4.1 3.9 2.4 4.0 4.2
3.7 1.6 2.3 3.2 2.6 2.6 1.9 2.2 1.7 3.5 1.9
Obtain the following: mean, median, mode, range, variance, standard deviation, P50, Q2, D5, Inter-
quartile Range and coefficient of variation.

18
12. The frequency table below provides the yields in grams of 230 evenly-spaced soybean plants and
their corresponding frequencies.

Yield 3 8 13 18 23 28 33 38 43 48 53 58 63 68
Frequency 7 5 7 18 32 41 37 25 22 19 6 6 4 1

a. Find the range.


b. Find the modal value and the median.
c. Find the minimum and maximum.
d. Compute the most appropriate measure of central tendency.

13. A Rural Bus bound to Cagayan from Davao advertises the following fares for Air conditioned
category:
Type of Passenger Fare
College student: 400 Pesos
High school student: 350 Pesos
At most elementary graduate: 300 Pesos
Senior citizen 300 Pesos
Regular passenger 450 Pesos

Rural Air-conditioned bus capacity is 60 passengers in which on the average per trip consists
15 college students, 5 high school students, 3 children (at most elementary graduate), 9 senior
citizens and 28 regular passengers. What is the expected amount that Rural Bus air-conditioned
type would receive per trip?

14. The standard deviation of scores in Math 15 and Math 34 pre-tests is 5 and the mean score is 18.
Since the result will be recorded as a long exam the teacher decided to give an automatic bonus
of 10 points, that is, 20% of the total score. What is now the mean and standard of the new
scores?

15. Construct the Box-and-Whisker plot in question # 11.

16. Use the results of question # 11 to calculate the measure of skewness. Discuss the symmetry or
skewness of this distribution.

17. Asked whether CMU senior students want to attend University Acquaintance Party, 40 students in
the College of Arts and Sciences replied as follows: rarely, occasionally, never, occasionally,
occasionally, occasionally, rarely, rarely, never, occasionally, never, rarely, occasionally,
frequently, occasionally, rarely, never, occasionally, occasionally, rarely, rarely, never,
occasionally, occasionally, rarely, frequently, rarely, occasionally, occasionally, never, rarely,
frequently, never, rarely, occasionally, occasionally, rarely, rarely, occasionally and never. What is
their modal and median reply?

19
18. Consider the organized data of the Systolic Blood pressure of Nonsmokers. Find the following: a.
mean b. median c. mode d. variance e. standard deviation f.
coefficient of variation.

Table 3.2.1: Systolic Blood Pressure of Nonsmokers


Class
interval
Frequency
(f)
Cumulative
frequency
Class boundary Class mark  fx
90-109 10 10 89.5-109.5 99.5
110-129 24 34 109.5-129.5 119.5
130-149 18 52 129.5-149.5 139.5
150-169 9 61 149.5-169.5 159.5
170-189 2 63 169.5-189.5 179.5
Total k k

f
i 1
i  63 fx
i 1
i i 

20
PROBABILITY DISTRIBUTIONS

Probability distributions are used to model the behavior of many variables of interest. Random
variable is a function whose value is a real number determined by each element in the sample space.
Usually denoted by capital letters like X, Y or Z. Its use provides a convenient way of expressing
elements of a sample space as numbers. The probability that the random variable will take a value is
equal to the sum of the probabilities of the corresponding outcomes in the sample space.

Types of Random Variable


 Discrete Random Variable – a random variable which can assume only a finite number
of values, most frequently integers. The values of these random variables are sometimes
called count data.

Examples: number of students in a class, number of heads in 2 tosses of a fair coin,


numbers of chairs in a room.

 Continuous Random Variable – a random variable which can assume all values
between two points in a continuous scale. The values of these random variables
are usually called measured data.
Examples: weight, height, age, speed of car

Probability Distribution of a Random Variable


 When making estimates about unknown population parameters, values that are
computed only from the sample are usually used.
 When a different sample is taken, a different value results, even though the same
formula is used.
 The computed values are called statistics and assumed to be values of random
variable.
 For every value that a statistic takes in a particular sample, a corresponding
probability can be computed. This leads the idea of a probability distribution of a
random variable.

Illustration: Rolling two dice and observing the number of dots on the upturned faces.
S= {(1,1), (1,2), (1,3),...(6,6)}

Random variable can be defined as the total number of dots on the upturned faces.
(1,1)  2
(1,2), (2,1)  3
(1,3), (2,2), (3,1)  4



(6,6)  12
 The random variable takes on the values 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12.

21
 Some of the values had more corresponding elements in the sample space. For
example, 2 corresponds only to only one outcome while 3 corresponds to 2
outcomes.

Example: What is the probability that the random variable Y will take the value 4 of the above
illustration?

Types of Probability Distributions


 Discrete Probability Distribution – this is a table or a formula listing all possible values
that a discrete random variable can take on, along with the associated
probabilities

Example: Refer to the above illustration, the probability distribution of Y is


Y 2 3 4 5 6 7 8 9 10 11 12
P(Y=y) 1/36 2/36 3/36 1/36

 Continuous Probability Distribution – the function f(x) is called the probability density
function for a continuous random variable X if the total area under its curve and
above the x-axis is equal to 1 and the area under the curve between the ordinates
X=a and X=b gives the probability that X lies between a and b.

Some Probability Distributions


 Discrete Probability Distributions: Bernoulli, Binomial, Geometric, Hypergeometric,
Negative Binomial.
 Continuous Probability Distributions: Normal, Exponential, Gamma, Beta, Uniform.

Binomial Distribution
Some statistical problems involved repeated trials, which are independent and dichotomous
(i.e. it involves two possible outcomes often called success or failure). If all trials have identical
probability of success, then this type is called a binomial experiment or trial.

A binomial experiment is one that possesses the following properties:


1. The experiment consists on n repeated trials.
2. Each trial results in an outcome that may be classified as a success or failure.
3. The probability of success, denoted by p remains constant from trial to trial.
4. The repeated trials are independent.

If a binomial trial can result in a success with probability p, a failure with


probability q  1  p , x the number of successes in n trials, then the probability distribution of
the binomial random variable X ,

b( x; n, p)  n C x p x q n x for x  0, 1, 2, 3,..., n
n!
= p x q n x where n is the number of trials,
x !(n  x )!
p is the probability of success,
q is the probability of failure and

22
x is the number of successes.

Note: A success refers to the event under consideration.

Example: Find the probability of obtaining exactly three 2’s if a balance die is tossed 5 times.

Example: Find the probability of getting at least 1 head in tossing a fair coin twice.

Example: A fair coin is tossed 3 times and a head is designated as a success. Find the probability
that:
a. 2 heads occur
b. At least 2 heads
c. No head occurs

Example: A multiple choice quiz has 10 questions, each with four possible answers of which only
one is the correct answer. What is the probability that a complete guesswork would
yield at most 1 correct answer?

Normal Distributions
 Also known as Gaussian Distribution in honor of Carl Friedrick Gauss (1777-1855) who
derive its equation from a study of errors in repeated measurements of some quantity.
 The graph of a normal distribution is a bell-shaped curve that extends asymptotically to
the horizontal axis in both directions. It is seldom necessary to extend the tails of the
normal distribution very far because the area under that part of the curve lying more than
4 or 5 standard deviations from the mean is negligible.

 The mathematical equation of the probability distribution of the normal variable depends
on parameters  and  , its mean and standard deviation, respectively. The distribution is
denoted by the notation N (  ,  2 ). The normal distribution function is given by:
1 x
2
1
f ( x)  exp   
2 2 2  
Where    x  ,   3.14159 ...., and exp  2.71828 ...
Properties of Normal Curve:
1. It is symmetric about the vertical axis through the mean  .
2. The mean, median and mode are equal.
3. The tails are asymptotic relative to the horizontal line.
4. The total area under the curve and about the horizontal axis is 1 or 100%.
5. One standard deviation from the mean is 68%.
6. Two standard deviations form the mean is about 95%.
7. Three standard deviations from the mean is about 99.7%.

Remarks:
 It is possible that two or more normal distributions can have the same mean but differs in
variance.

23
 It is also possible two or more normal distributions have equal variances but different
variances.
 There are infinite number of normal curves by varying  and  .

Areas under the Normal Distribution


Since there are many normal curves, often it is important to standardize, and refer to a
Standard Normal Distribution where the mean   0 and   1. The standardize score which is
usually denoted by Z is shown below
x
Z   , the effect of this is to change any normal distribution to the Standard Normal
  
Distribution. Any variable which is normally distributed their individual raw score can be
converted into a corresponding Z score. Standardized observations provide an indication as to
how many standard deviations an observation falls either below or above the mean.

Rules in Computing Probabilities Using the Standard Normal Table

1. P( Z  a)  0
2. P( Z  a)  can be obtained directly from the Z  table
 P( Z  a)
3. P( Z  a) 1  P( Z  a)
4. P( Z  a)  P( Z  a)
5. P( Z  a)  P( Z  a)
6. P(a1  Z  a2 )  P( Z  a2 )  P( Z  a1 ); a2  a1

Computation of Probabilities Using the Normal Table

Example: Let Z be a standardized variable. Find the following probabilities using the Normal Table.
a. P(Z  0.40)  b. P(Z  0.63)  c. P(0.40  Z  0.63) 
d. P(Z  0.40)  e. P(Z  0.40)  e. P(1.96  Z  1.96) 

Example: Find the value of k , if:


a. P( Z  k )  0.0250 b. P( Z  k )  1.96 c. P(k  Z  k )  0.95

Applications of Normal Distribution

Example: In the previous midterm examination of Math 15; a total of 160 students took the said
examination. If their scores are normally distributed with   22 and   5. Find
the following:

a. Proportion of students who obtain a score between 24-30.


b. Proportion of students whose scores are greater than 30.

24
c. If your teacher wishes to give a 1.0 grade of those students obtain a score in the
90th percentile or higher, what is the minimum score?

Example: The scores on a standardized test for high school students are normally distributed with
mean 500 and standard deviation 100.

a. If you randomly selected a student taking this test, what is the probability that
student would score at least 450?
b. If you randomly selected a student taking this test, what is the probability they
would score between 450 and 600?
c. What score would a student need to get on this test to place him or her in the top
10% of all students?

25
Exercises/Problem sets

1. In each case determine whether the given values can serve as the values of the probability
distribution of a random variable X that can take on the values 1, 2, and 3, explain your answers:
a. P( X  1)  0.20, P( X  1)  0.40, and P( X  3)  0.40;
b. P( X  1)  0.50, P( X  1)  0.45, and P( X  3)  0.10;
c. P( X  1)  10 , P( X  1)  1 , and P( X  3)  12 .
33 3 33
d. P( X  1)  0.85, P( X  2)  0.20, and P( X  3)  0.05.

2. For each of the following, determine whether it can serve as the probability distribution of a
random variable X :
a. P( X  x)  1 for x  1, 2, 3, 4, 5, 6, 7, 8, 9,10;
10
b. P( X  x)  1 for x  0,1, 2, 3, 4, 5, 6, 7, 8, 9,10;
10
x2
c. P( X  x)  for x  1, 2, 3, 4.
18

3. A doctor knows from experience that 12% of the patients to whom he prescribes a certain blood
pressure medication will have undesirable side effects. Use the formula for the binomial
distribution to calculate the probability that none of the four patients to whom he prescribes the
medication will have undesirable sided effects.

4. If 40% of the mice used in an experiment will become very aggressive within two minutes after
having been administered an experimental drug, find the probability that exactly four of nine
mice that have been administered the drug will become very aggressive within two minutes.

5. An agricultural cooperative claims that 96% of the watermelons shipped out are ripe and ready to
eat. Find the probabilities among 20 watermelons shipped out
a. at least 17 are ripe and ready to eat;
b. at least 5 are ripe and ready to eat;
c. at most 2 are ripe and ready to eat;
b. all of them are ripe and ready to eat.

6. Find the area under the normal curve that lies between the given values of Z.

a) Z = 0 and Z = 2.37
b) Z = 0 and Z = -1.94
c) Z = -1.85 and Z = 1.85
d) Z = -0.76 and Z = 1.13
e) Z = 0 and Z = 3.09
f) Z = -2.77 and Z = -0.96

7. If a set of grades in a Statistics examination is approximately normally distributed with a mean of 74


and a standard deviation of 7.9, find the probability that a student received grades between 75
and 80.

26
8. If the weights of 600 students are normally distributed with a mean of 50 kilograms and a variance
of 16 kilograms,
a. Determine the percentage of students with weights lower than 55 kilograms.
b. How many students have weights exceeding 52 kilograms?

9. If a random variable has a normal distribution with   77.5 and   12.4, find the probabilities
that it will take on a value
a. less than 55.1;
b. greater than 84.3;
c. between 80.0 and 90.0;
d. between 72.4 and 82.6.

10. A random variable has a normal distribution with   10. Find the probability that the random
variable will take on a value less than 82.5 is 0.8212, what is the probability that it will take on a
value greater than 58.3?

11. The LDL cholesterol level of adults follow the normal distribution with mean of 4.8 and a
standard deviation of 0.6.
a. A person has moderate risk if his/her cholesterol level is more than 1 but less than 2
standard deviations above the mean. What proportion of the population has moderate
risk according to this criterion?
b. A person has high risk if his/her cholesterol level is more than 2 standard deviations above
the mean. What proportion of the population has high risk?
c. A person within 1 standard deviation of the mean has normal cholesterol risk. What
proportion of the population has normal risk?
d. What is the cholesterol level that exceeds 90% of the population?

12. Due to increasing environmental awareness in the Philippines, strict adherence to the size of
Lawaan boards being sold in the local market is imposed. In order to monitor and control the size
of the Lawaan boards, a large number of boards are measured periodically. It was found that the
actual thickness of 95% of narra boards with one-inch average thickness ranges between 28/32
inches and 36/32 inches. The thickness of these boards follows a bell-shaped curve. What is the
standard deviation,  , of the thickness of these narra boards?

13. The local cable company is installing cable in the next barangay and proceeds to your own
barangay after completion. You are told that the time required is a normally distributed random
variable,   24 days and   2 days. You are planning to buy a new TV. You don’t want to buy
the TV until you are 95% sure that the installation is completed. How many days should you wait
before buying the TV?

14. Mathematics in the Modern World removal exam had a mean score of 50 and a standard
deviation of 6. Assume a normal distribution.
a. What is the median?
b. What is the Z score of the mean?
c. In order to get 3.0 grade, your Z score must be  1.5 or above. What is the minimum
score necessary?

27
d. A Z score below -1.6 will be given 5.0 grade. What is the raw score?
e. If your raw score is 60, what is your Z score?
f. What raw score should be at the 95th percentile?

15. Five thousand students took the University entrance test. The scores were normally distributed.
Your score was in the 97.5th percentile.
a. How many people scored at or below your score?
b. Given that the mean score is 79, what is your raw score?
c. Referring to letter b question, what is the standard deviation if your score is 2 standard
deviation above the mean?

28
ESTIMATION

Problems addressed by inferential statistics: Estimation and Test of Hypothesis. Estimation is


concerned with finding a value or range of values for an unknown parameter, while Hypothesis
Testing deals with evaluating a claim or a conjecture about a parameter or distribution of the
population. Estimator of a parameter is a rule or formula for computing an estimate using the sample
data. The value of the estimator is referred as the estimate. There can be several estimators for a
particular parameter. A population mean can be estimated by any one of the following: sample mean,
sample modal value and sample median.

There are good properties of an estimator - an estimator must be accurate and precise.
Accuracy measures the closeness of an estimate to its true value. Precision measures the closeness of
the different possible values of the estimator to each other. To measure accuracy, bias is used where
it is obtained by getting the difference between the expected value of the estimates and the
parameter measures. It measures how close the estimates are to the parameter. An estimator with its
bias equal to zero is said to be an unbiased estimator of the parameter. The precision of an estimator
can be measured by its variance or by its standard error which is the square root of the variance.

Remark: We want the estimator to be both accurate and precise.

Measure of Accuracy and Precision


 Mean Square Error (MSE) measures both accuracy and precision. This is shown below
MSE  Bias2  Variance

Example: Given the two estimators below, which estimator would you rather choose if the
parameter of interest has a value equal to 7?

Estimator A Estimator B
E(A) = 5 E(B) = 7
Bias(A) = -2 Bias(B) = 0
V(A) = 11 V(B) = 18
MSE(A) = ? MSE(B) = ?

Two Types of Estimator

1. Point estimator – is a formula that gives a single value in estimating a parameter.


2. Interval estimator – is a formula that gives a range of values for estimating a parameter.

Point Estimation Using Simple Random Sampling



- x is a point estimator of  where
n

 x i
x i 1
; n is the sample size.
n

Common Population Parameters with its Best Point Estimator


Population Parameter Point Estimator

29
 _
x
P p
2 s2

Interval Estimation
- a point estimate with a precision is the concern of interval estimation.
- Interval estimate describes a range of values, constructed from the sample data, within
which a population parameter lies with a predetermined probability or degree of
confidence.
- Confidence interval is the interval estimate.

General format of a confidence interval: Point estimate  margin of error


The margin of error is a multiple of the standard error (SE), that is, the standard deviation of
the sampling distribution.

In estimating the mean the margin of error or the maximum error is given by
  
Margin of error (E ) = Z  
2 n
In the case of the sample mean, the central limit theorem assures that there is approximately
68% chance for the sample mean to be within one standard error from its expected value, and
about 95% chance for the sample mean to be within two standard errors from the population
mean. Such results enable us to attach approximately 68% confidence covering the population
mean in an interval of the form:
_

Sample mean  SE of the mean = x 
n
and about 95% confidence covering the population mean in an interval of the form
_

Sample mean  2(SE of the mean) = x  2
n
A 95% confidence interval for the population mean is the range of values about 2 standard
errors from the sample mean. In 19 out of 20 sampling experiments, we expect to contain the
true value of the population mean in the resulting interval estimate.

Level of confidence
 denoted by 100(1   )%
 typical levels: 90%, 95%, 99%
 A relative frequency interpretation
- in the long run, 100 (1   )% of all the confidence intervals that can be constructed
will contain the unknown parameter.
 Wrong to say: that a specific interval will either have 95% probability of containing
the parameter.

Confidence Interval Estimation


 An ideal estimate is one that is narrow and accurate.
 Interval estimation is based on repeated sampling, but for a given case, we usually
have only a single interval under consideration.

30
 The single confidence interval is either correct or incorrect, but the confidence level
gives us an indication of the proportion of correct intervals that can be expected with
repeating the estimation procedure.
 Once an interval is constructed, we do not find out if it is actually correct.

Interpreting Confidence Intervals


 The 95% confidence intervals are different for 100 samples.
 For about 95 of them the interval covers the parameter, but about 5 do not.

Confidence Interval Estimation of the Population Mean

Case 1: Confidence interval for  (  is known)


Assumptions: Populations standard deviation is known and population is normally
distributed.
_
 _

Confidence interval estimate: x Z    x Z .
2 n 2 n
Note: Z  is the standard normal deviate whose area above it is  .
2 2

Case 2: Confidence interval for  (  is unknown, but large sample)


Assumptions: Population standard deviation is unknown but large sample size.
_ _
s s
Confidence interval estimate: x Z    x  Z .
2 n 2 n

Case 3: Confidence interval for  (  is unknown)


Assumptions: Population standard deviation is unknown and population is normally
distributed.
_ _
s s
Confidence interval estimate: x  t , n 1
   x t , n1 .
2 n 2 n
Note: (n-1) is the degrees of freedom (df); df is the number of observations that are
free to vary after sample mean has been computed.

Confidence Interval Estimation of the Population Mean Difference

 
2 2
Case 1: Confidence interval for the difference of two means ( 1
and 2
are known)
Parameter of interest:  d  1   2 .
Assumptions: known standard deviations and normally distributed or large samples.

 
2 2
_ _
Confidence interval estimate: ( x1  x 2 )  Z  1 2
2 n1 n2

31
   
2 2 2 2
_ _ _ _
= ( x1  x 2 )  Z  1 2
 1   2  ( x1  x 2 )  Z  1 2
.
2 n1 n2 2 n1 n2

 
2 2
Case 2: Confidence interval for the difference of two means ( 1
and 2
are unknown
but n1 , n2  30 )
Parameter of interest:  d  1   2 .
Assumptions: unknown standard deviations and normally distributed or large samples.
2 2

Confidence interval estimate: ( x1  x 2 )  Z 


_ _
s s1 2
2 n1 n2
2 2 2 2
_ _
= ( x1  x 2 )  Z  s s 1 2
_
 1   2  ( x1  x 2 )  Z 
_
s s 1 2
2 n1 n2 2 n1 n2

 
2 2
Case 3: Confidence interval for the difference of two means ( 1
= 2
but unknown
and n1 , n2  30 )
Parameter of interest:  d  1   2 .
Assumptions: unknown standard deviations but equal and are normally (or nearly
normally) distributed populations.
_ _ s 2pooled s 2pooled
Confidence interval estimate: ( x1  x 2 )  t ( , n1  n2  2 )
 ,
2 n1 n2

where s 2pooled 
n1  1s12  n2  1s22 is the estimate of the common
n1  n2  2
population variance.
2 2
_ _ s s _ _ s 2pooled s 2pooled
= ( x 1  x 2 )  t (   1   2  ( x1  x 2 )  t ( 
pooled pooled
, n1  n2  2 ) , n1  n2  2 )
.
2 n1 n2 2 n1 n2

Confidence Interval of the Population Proportion


Assumptions: - The variable of interval of interval follows the binomial distribution.
- Normal approximation can be used if np  5 and n(1  p)  5.
p(1  p)
Confidence interval estimate: p  Z
2 n
p(1  p) p(1  p)
= p  Z  P  p  Z .
2 n 2 n

Confidence Interval for two Population Proportions


Assumptions: - We have two independent sets of randomly selected sample data.
- For both samples, the conditions np  5 and nq  5 are satisfied.
Notations: For population 1: P1  population proportion; n1 = sample size

32
x1  number of successes in the sample
x
p1  1 (the sample proportion)
n1
q1  1  p1
The corresponding meanings attached to P2 , x2 , n2 , p2 and q2 , which come from
population 2.
pq pq
Confidence interval estimate: p1  p 2  Z  
2 n1 n2

pq pq pq pq
= p1  p 2  Z    P1  P2  p1  p 2  Z  
2 n1 n2 2 n1 n2

x1  x2
where p  ; q  1 p
n1  n2

Confidence Interval for Paired Observations


Assumptions: The sample from one population has an effect on the sample from the
other population.
Steps in constructing confidence interval estimate for paired observations:
1. Take the difference for each pair, that is,
d i  x1i  x2i for all i  1, 2, ..., n where n is the number of pairs.
_
2. Get the mean ( d ) and standard deviation ( s d ) of the d i ' s where :
2
 n 
  di 
_
1 n
 d i   i1 n 
2

d  di
n i 1
and s d2 
n 1
; s d  s d2

_
sd
3. The confidence interval is then d  t ,v
2 n
With v = n-1 degrees of freedom

Sample Size Determination


i. Determining the sample size for Estimating the Population Mean:
Z 2  2
n 2

Error 2
Example: What sample size is needed to be 95% confident of being correct within  10 ? A pilot
study suggested that the standard deviation  is 35.

ii. Determining the sample size for Estimating the Population Proportion:

33
Z 2 p(1  p)
n 2

Error 2
Example: A pollster is hired to determine the percentage of voters favoring the opposition
party presidential candidate. If we require 99% confidence that the estimated
value is within two percent of the true percentage of the true value, how large
should the random sample be?

Ethical Issues:
i. Confidence interval (reflects sampling error) should always be reported along with
the point estimate.
ii. The level of confidence should always be reported.
iii. The sample size should be reported.
iv. The interpretation of the confidence interval estimate should also be provided.

34
Exercises/Problem sets

1. Use the given data to find the maximum error of estimate E. Be sure to use the correct expression
for E, depending on whether the normal distribution or Student t distribution applies.
a.   0.05,   10, n  50;
b.   0.05,   10, n  64;
c.   0.01,   10, n  50;
d.   0.01,   10, n  64;
e.   0.05, s  10, n  100;
f.   0.05,   10, n  25;
g.   0.05, s  10, n  20;

2. A random sample of midterm grades of 35 Math students were obtained, the grades are shown
below
88 74 79 89 93 89 86 79 87 88
85 91 93 71 85 90 86 72 84 88
88 95 87 85 86 79 85 94 93 90
85 88 86 87 91
Answer the following:
a. Estimate the mean midterm grade of Math students;
b. Estimate the standard deviation midterm grade of Math students.
c. Estimate the proportion of students who failed during the midterm if the passing grade is
75.

3. Refer to question #2, find the following:


a. The maximum error of the estimate of the Population Mean midterm score.
b. The confidence interval for  .

4. Refer to question 2, construct a 95% confidence interval for the proportion of Math students who
pass the midterm examination.

5. A poll of 121 randomly selected car owners revealed that the mean length of time that they plan to
keep their cars is 7.01 years and the standard deviation is 3.75 years. Construct a 95% confidence
interval for the mean length of time all car owners want to keep their cars, include the
interpretation.

35
HYPOTHESIS TESTING: FOR ONE POPULATION CASE
AND TWO POPULAION CASE

Two areas of Inferential Statistics: Estimation and Hypothesis Testing. Hypothesis Testing is an
area statistical inference in which one evaluates a conjecture about some characteristic of the
population based upon the information contained in the random sample. Usually the conjecture
concerns one of the unknown parameters of the population. Hypothesis is a claim or statement about
the population parameter.

Steps in Hypothesis Testing


1. State the null and alternative hypotheses.
2. Decide on a level of significance,  .
3. Select the appropriate test statistic.
4. Establish the critical region/regions.
5. Compute the actual value of the test statistic from the sample.
6. Make the statistical decision:
a. If decision rule is based on region of rejection: Check if test statistic falls in the
region of rejection. If yes, reject the null hypothesis.
b. If decision rule is based on p-value: Determine the p-value. If the p-value is less
than or equal to  , reject the null hypothesis.
7. Interpret results.

Null Hypothesis:
 denoted by H o
 the statement being tested
 it represents what the experimenter doubts to be true
 must contain the condition of equality and must be written with the symbol
 , ,  .

Alternative Hypothesis:
 denoted by H a
 is the statement that must be true if the null hypothesis is false
 the operational statement or the theory that the experimenter believes to be true
and wishes to prove
 is sometimes referred as the research hypothesis

Test of Significance:
 A test of significance is a problem of deciding between the null and the alternative
hypothesis on the basis of the information contained in a random sample.
 The goal will be to reject H o in favor of H a , because the alternative is the
hypothesis that the researcher believes to be true. If we are successful in
rejecting H 0 , we then declare the results to be “significant”.

Two Types of Errors:


1. Type I Error – the mistake of rejecting the null hypothesis when it is true.

36
 It is not a miscalculation or procedural misstep; it is an actual error that can occur.
 the probability of rejecting the null hypothesis when it is true is called the significance
level ( )
 The value of  is predetermined, and very common choices are
  0.05 and   0.01.

2. Type II Error – the mistake of failing to reject the null hypothesis when it is false.
 The symbol  (beta) is used to represent the probability of a type II error.

Test Statistic:
 A statistic computed from the sample data that is especially sensitive to the
differences between H 0 and H a .
 The test statistic should tend to take on certain values when H o is true and different
values when H a is true.
 The decision to reject H o depends on the value of the test statistic.
 A decision rule based on the value of the test statistic: Reject H o if the computed
value of the test statistic falls in the region of rejection.

Region of Rejection or Critical Region:


 the set of all values of the test statistic which will lead to the rejection of H o .
Factors that determine the region of rejection:
 The behavior of the test statistic if the null hypothesis were true.
 The alternative hypothesis: the location of the region of rejection depends on the
form of H a .
 Level of significance ( ) : the smaller  is, the smaller the region of rejection.

Critical Value/s:
 The value or values that separate the critical region from the values of the test
statistic that would not lead to rejection of the null hypothesis.
 Depends on the nature of the null hypothesis, the relevant sampling distribution, and
the level of significance.

Types of Tests:
1. Two-tailed Test – if we are primarily concerned with deciding whether the true value of a
population parameter is different from a specified value, then the test should be two-
tailed. For the case of the mean, H a :    o .

2. Left-tailed Test – if we are primarily concerned with deciding whether the true value of a
population parameter is less than a specified value, then the test should be left-tailed. For
the case of the proportion, H a : P  Po .

37
3. Right-tailed Test – if we are primarily concerned with deciding whether the true value of a
population parameter is greater than a specified value, then the test should be right-
tailed. For the case of the difference of two population means, H a : 1   2  0.

13. Level of Significance and the Rejection Region:

Ho: 25  Critical


Ha:  < 25 Value(s)

Rejection 0
Regions 
Ho:   25
Ha:  > 25
0
/2
Ho:  
Ha:   25
0
Probability Value or p-value
 the smallest level of significance at which the null hypothesis will be rejected based
on the information contained in the sample.
 is the actual or observed value of the probability of Type I error.
 the smaller the p  value the stronger is the evidence of rejecting H 0 .
 an alternative form of decision rule: reject H o if the p-value is less than or equal to
the level of significance ( )
 represents the chance of generating a value as extreme as the observed value of the
test statistic or something more extreme in the null hypothesis were true

Common Interpretation of p  values:


Table 8.1.1: P-value and its Interpretation
p -VALUE INTERPRETATION
p  0.01 Very strong evidence against the null
hypothesis
0.01  p  0.05 Moderate evidence against the null hypothesis
0.05  p  0.10 Suggestive evidence against the null hypothesis
p  0.01 Little or no evidence against the null hypothesis

38
Summary of the Tests Concerning the Population Mean
Test Statistic Ho Ha Region of Rejection
Case 1:  is known   o   o z c  z
_
x    o z c   z
zc 
   o z c   z & z c  z
2 2
n
Case 2:  is unknown and   o   o t c  t ( , v )
n<30
_
  o t c   t ( , v )
x  o   o t c   t 
tc 
s 2
,v  & t c  t  2, v 
n where v  n  1

Remarks: 1. The above summary of the tests concerning population mean are exact   level tests
for samples from a normal distribution. However, they provide good
approximate   level test when the distribution is not normal provided that the
sample size is n  30.

2. If  is unknown and n  30, use the Z-test but replace  by s, that is,
_
x 
zc 
s
n
Tabulated z – values for the common choice of  for both
one-tailed and two-tailed tests
 0.01 0.05 0.10
One-tailed test ( Z  ) 2.33 or -2.33 1.645 or -1.645 1.28 or -1.28

Two-tailed test  Z  
2.576 and -2.576 1.96 and -1.96 1.645 and -1.645
 2 

Test of Hypothesis Concerning One Population Mean (  is known and large sample) Example

Problem: The mean weight of the sample of 100 persons from the Honolulu Heart Study is 63 kg.
If the ideal weight is known to be 60 kg, is the group significantly overweight? Assume   10 kg
and   0.05.
H o :   60 kg
Solution: Step 1.
H a :   60kg
Step 2.   0.05
_
x 
Step 3. Appropriate test statistic: z c  since  is known and

n
sample size is large

39
Step 4. Reject H o if z c  1.645
63  60
Step 5. z c  3
10
100
Step 6. Reject H o since z c  1.645 .
If decision is based on p-value: p-value = P( X  63)
 
 63  60 
= P Z  
 10 
 100 
= P( Z  3)
= 1  P( Z  3)
= 1 – 0.9987
= 0.0013
Through the p-value result, H o is rejected since p-value < 0.05.
Step 7. There is sufficient evidence to warrant rejection that the mean
body weight is 60 kg.

Revising the above problem; given that it is a two-tailed test:


H o :   60 kg
Solution: Step 1.
H a :   60 kg
Step 2.   0.05
_
x 
Step 3. Appropriate test statistic: z c  since  is known

n
and sample size is large
Step 4. Reject H o if z c  1.96 and z c  1.96
63  60
Step 5. zc  3
10
100
Step 6. Reject H o since z c  1.645 .
If decision is based on p-value: p-value = 2 P( X  63)
 
 63  60 
= 2 P Z  
 10 
 100 
= 2 P( Z  3)
= 21  P(Z  3)
= 2(1 – 0.9987)
= 2( 0.0013)
= 0.0026
Through the p-value result, H o is rejected since p-value < 0.05.

40
Step 7. There is sufficient evidence to warrant rejection that the mean body
weight is 60 kg.

Test of Hypothesis Concerning Population Mean (  is unknown and n<30) Example


Problem: A standard final examination in an elementary statistics course produces a mean score
of 75. At the 5% level of significance, tests the claim that the following sample scores reflect an
above-average class:
79 79 78 74 82 89 74 75 78 73
74 84 82 66 84 82 82 71 72 83

H o :   75
Solution: Step 1.
H a :   75
Step 2.   0.05
_
x 
Step 3. Appropriate test statistic: t c  since  is unknown and
s
n
sample size is less than 30.
Step 4. Reject H o if t c  1.729
78.0500  75 _
79  79  ...  83
Step 5. tc   2.438 ; x  = 78.0500
5.5958 20
20
(79  79  ...  83) 2
79 2  79 2  ...  83 2 
s2  20
20  1
 31.3132
s  s 2  31.3132
 5.5958
Step 6. Reject H o since t c  1.729 .

Step 7. There is sufficient evidence to support that the class is above


average.

Testing a Claim about a Proportion:


Assumptions: 1. We are testing a claim made about a population proportion, probability, or
percentage.
2. The conditions for a binomial experiment are satisfied (i.e., we have a fixed number
of independent trials having constant probabilities, and each trial has two
outcome categories).
3. The conditions np  5 and nq  5 are both satisfied so that the binomial
distribution of sample proportions can be approximated by a normal distribution
with   np and   npq.

Notation in Testing a Claim about a Proportion

41
n  number of trials
x
p  (sample proportion) x is the number of successes out of n trials.
n
P  population proportion (used in the null hypothesis)
q  1- p

Test Statistic for Testing a Claim about a Proportion


pP
z
p (1  p )
n
The test statistic above is justified by noting that when using the normal distribution
to approximate a normal distribution, substitute   np and   npq to get
x x  np
z  .
 npq
Remark: To test hypotheses made about proportions simply follow the steps in hypothesis testing
and use the test statistic given above.

Test of Hypothesis Concerning Population Proportion Example

Problem: At the 0.10 significance level, test the claim that the proportion of females P at Central
Mindanao University equals 0.60. Sample data consist of n  100 of which 68 are females.
H o : P  0 .6
Solution: Step 1.
H a : P  0.6
Step 2.   0.10
x  np
Step 3. Appropriate test statistic: Z  since it satisfies the
npq
properties of the binomial experiment and np and nq are both
greater than 5.

Step 4. Reject H o if z c  2.576 and z c  2.576


68  100 (0.60)
Step 5. z c   1.63 ;
100 (0.60)(1  0.60)
Step 6. Fail to reject H o since z c  2.576 .

Step 7. There is no sufficient evidence to support that the proportion of


females at Central Mindanao University is not equal to 0.60

Tests for the Difference between Two Population Means


Assumptions 1. Independent 1. Independent small 1. Independent small
samples samples samples
2. Normal populations 2. n1  30, and 2. Approximately
3.  1 and  2 are n  30 normal distributions
2

42
known 3.  1 and  2 are 3.  1 and  2 are
unknown unknown, but equal
Test Statistic _ _ _ _ _ _
x1  x 2 x1  x 2 x1  x 2
Z Z T
 12  22 s12 s 22 s 2pooled s 2pooled
  
n1 n2 n1 n 2 n1 n2

Summary of the Tests Concerning Two Population Means for Two Independent Samples
Test Statistic Ho Ha Region of Rejection
Case 1:  1 and  2 known
_ _ 1  2  0 zc  z
x1  x 2 1   2
zc  1  2  0 zc   z
 12  22 1  2  0 zc   z & zc  z

n1 n2 2 2

Case 2:  1 and  2 are unknown


but n1  30 and n2  30 1   2  0 zc  z
_ _ 1   2 1   2  0 zc   z
x1  x 2
zc  1   2  0 zc   z & zc  z
s12 s22 2 2

n1 n2
Case 3:  1 and  2 are unknown,
but equal and t c  t ( , v)

n1  30 and n2  30 1   2  0 t c   t ( ,
_ _ 1   2 1   2  0
v)

x1  x 2 tc   t  & tc  t 2 , v 


tc  1   2  0 2
,v

s 2pooled s 2pooled where v  n1  n2  2



n1 n2

Remark: Two samples are independent if the sample selected from one population has no effect on
the sample selected from the other population. If the two samples are not dependent, they are
dependent.

Testing the Difference between Two Population Means (independent samples) Example

Problem: A study was conducted to compare the length of time it took make and female students
from the same year level and college to answer a 50-item IQ test. Independent samples of 50
male students and 50 female students were asked to take the test in which each person was
timed. The results were as follows:
MALE FEMALE
n1  50 n2  50

43
_ _
x1  42 minutes x 2  38 minutes
s12  18 s 22  14
Did the data present sufficient evidence to suggest a difference between the true mean
completion times of male and female students at the 5% level of significance?
H o : 1   2
Solution: Step 1.
H a : 1   2
Step 2.   0.05
_ _
x1  x 2
Step 3. Appropriate test statistic: z c 
s12 s22

n1 n2
Step 4. Reject H o if z c  1.96 and z c  1.96
42  38
Step 5. z c   5.0
18 14

50 50
Step 6. Since z c  1.96, we thus reject the null hypothesis at the 5% level of
significance.

Step 7. There is sufficient evidence that the mean completion time between male
and female is significantly not equal to zero.

Testing the Difference between Two Population Means on Two Related Samples (  d  0)
Follow the steps presented in constructing the confidence interval for paired samples. The
_
d
test statistic is t c  .
sd
n

Example for Testing the Difference for Paired Observations

Problem: A study was conducted to investigate the effectiveness of hypnotism in reducing pain.
Results for randomly selected subjects are given below. At the 0.05 significance level, test the
claim that the sensory measurements are lower after hypnotism (The values are before and after
hypnosis. The measurements are in centimeters on the mean visual analog scale, and the data are
based on “An analysis of Factors that Contribute to the Efficacy of Hypnotic Analgesia,” by Price
and Barber, Journal of Abnormal Psychology, Vol. 96, No. 1.)

Sensory Measurements Before and After Hypnotism


Subject A B C D E F G H
Before 6.6 6.5 9.0 10.3 11.3 8.1 6.3 11.6
After 6.8 2.4 7.4 8.5 8.1 6.1 3.4 2.0
Difference -0.2 4.1 1.6 1.8 3.2 2.0 2.9 9.6

44
Solution: Since each pair of scores is matched for one particular person, we can conclude that the
values are dependent. Each difference is the “before” score minus the “after” score. If the
hypnotism is effective, we would expect the after scores to be lower and significantly greater than
0.
H o :d  0
Solution : Step 1.
H1 :  d  0
Step 2.   0.05
_
d
Step 3. Appropriate test statistic: t c 
sd
n
Step 4. Reject H o if t c  t , n1
t c  t 0.05, 81
t c  1.895
3.125  0
Step 5. t c   3.036
2.911
8
Step 6. Reject H o at 0.05 level of significance since t c  1.895 .
Step 7. There is sufficient evidence to support the claim that the after scores are
significantly lower than the before scores.

Testing the Difference between Two Population Proportions


Assumptions: - We have two independent sets of randomly selected sample data.
- For both samples, the conditions np  5 and nq  5 are satisfied.
Notations: For population 1: P1  population proportion; n1 = sample size
x1  number of successes in the sample
x
p1  1 (the sample proportion)
n1
q1  1  p1
The corresponding meanings attached to P2 , x2 , n2 , p2 and q2 , which come from
population 2.

Test Statistic for Two Proportions


( p1  p2 )  ( P1  P2 ) x1  x2
zc  where p  ; q  1 p
pq pq n1  n2

n1 n2
Example of Testing the Difference Between Two Proportions

45
Problem: A survey of 100 women and 100 men indicated that 49 of the women and 35 of the men
said they are trying to lose weight. Is there a significant difference of the proportion of women
trying to lose weight than men? Test at 0.10 level of significance.

46
Exercises/Problem sets

1. For each of the following, state the null ( H o ) and alternative ( H1 ) hypothesis.
a. The mean age of CMU Statistics students is at least 17.
b. The mean waistline of UE Statistics students is at most 29 inches.
c. The mean weekly allowance of Senior High School students is 500 pesos.
d. The mean height of male basketball varsity is 173.00 cm.
e. The mean of girls at birth is at most 7 lbs.
f. The mean age of CMU teachers is more than 31 years.
g. CMU administrative council claims that CMU dormitories mean rental is significantly
lower than 75 pesos that other state universities dormitory in the Philippines.
h. A teacher claims that the mean score of her students’ midterm examination exceeds 10
points above the over-all students who took the exam.
i. The proportion of Statistics students believes that 95% of them will obtain at least 2.25
grade.
j. USEP guidance counselor claims that 60 percent of USEP students have an average IQ.

2. Answer the following with reference to question #1:


a. Identify the parameter of interest for each situation;
b. Identify the type I error and type II error for each claim;
c. Categorize the hypothesis test as a right-tailed test, a left-tailed test, or a two-tailed test.

4. What is the critical value for a test of significance in each of the following situations?
a. right-tailed test,   0.05,  known, n  24
b. right-tailed-test,   0.05,  unknown, n  15
c. right-tailed test,   0.01,  known, n  24
d. right-tailed-test,   0.01,  unknown, n  15
e. left-tailed test,   0.05,  known, n  35
f. left-tailed test,   0.05,  unknown, n  15
g. left-tailed test,   0.10,  known, n  35
h. left-tailed test,   0.10,  unknown, n  15
i. two-tailed test,   0.05,  known, n  35
j. two-tailed test,   0.05,  unknown, n  15
k. two-tailed test,   0.05,  known, n  24
l. two-tailed test,   0.05,  unknown, n  15

5. For each part of question #4, decide whether you should reject H o or fail to reject H o according
to the corresponding computed test statistic.
a. 2.0 b. 2.0 c. 2.0 d. 2.0 e. -1.50 f. -1.50 g. -
1.50 h. -1.50 i. 2.0 or -2.0 j. 2.0 or -2.0 k. 1.5 or -1.5 l. 1.5 or -1.5
6. Find the critical Z value(s) for the given conditions. In each case assume that the normal
distribution applies. Also, draw a graph showing the critical value(s) and critical region(s)
a. right-tailed test;   0.05
b. left-tailed test;   0.05
c. right-tailed test;   0.10

47
d. left-tailed test;   0.10
e. two-tailed test;   0.05
f. two-tailed test;   0.10

7. At   0.05, decide whether to reject or fail to reject H o for the following computed
p  values :
a. 0.02524 b. 0.5055 c. 0.1028 d. 0.04987

In each of the following exercises, test the given hypotheses by following the steps in hypothesis
testing.

_
8. Test the claim that   110 given a sample of n  78 for which x  115. Assume that   3, and
test at   0.05 significance level.

_
9. Test the claim that   110 given a sample of n  78 for which x  105 Assume that   3, and
test at   0.05 significance level.

_
10. Test the claim that   110 given a sample of n  20 for which x  115. Assume that   3,
and test at   0.05 significance level.

_
11. Test the claim that   110 given a sample of n  20 for which x  105. Assume that   3,
and test at   0.05 significance level.

_
12. Test the claim that   110 given a sample of n  20 for which x  115. Assume that s  3,
and test at   0.05 significance level.

_
13. Test the claim that   110 given a sample of n  20 for which x  105. Assume that s  3,
and test at   0.05 significance level.

_
14. Test the claim that   110 given a sample of n  20 for which x  115. Assume that s  3,
and test at   0.05 significance level.

_
15. Test the claim that   110 given a sample of n  20 for which x  105. Assume that s  3,
and test at   0.05 significance level.

ANALYSIS OF VARIANCE AND CHIS-SQUARE TEST

We’ve learned procedures for testing hypothesis that two population means are equal
( H o : 1   2 ). In this chapter, we test the hypothesis that differences among three or more sample

48
means are due to chance. A typical null hypothesis will be H o : 1   2  ...   k where k the
number of means being compared.

Hypothesis Testing for More than Two Populations Means Assumptions:


1. Test the hypothesis that three or more samples come from populations with the same
mean
2. The populations being considered have normal distribution.
3. The populations being considered have the same variance (or standard deviation).
4. The different samples are from populations that are categorized in only one way.
5. The samples are random and independent of each other.

A statistical test to determine if k population means are equal: The One - Way Analysis of Variance
 The analysis of variance is used to test the hypothesis that the means of three or
more populations are the same against the alternative hypothesis that not all
population means are the same.
 It is called the analysis of variance because the test is based on the analysis of
variance in the data obtained from different samples.
 Only one factor or variance is analyze in using one-way Analysis of Variance (ANOVA)

4. One-Way ANOVA Test


 The one-way ANOVA test is applied by calculating two estimates of the variance of
population distributions: variance between samples and the variance within samples.
 The variance between samples is also called the mean square between samples or
MSB. The variance within samples is also called the mean square within samples or
MSW.
 The variance between samples (also called variation due to treatment), MSB, gives
an estimate of  2 based on the variation among the means of samples taken from
different populations.
 The variance within samples (also called variation due to error), MSW , gives an
estimate of  2 based on the variation within the data of different samples.
 The value of the test statistic F for a test of hypothesis is given by the ratio of two
variances, the variance between samples (MSB) and the variance within samples
(MSW ).
2
MSB nsx_
F  2 where: n is the sample size common to each sample
MSW sp
n _2 is the variance of the sample means
x
2
s is the pooled variance (mean of the sample variances)
p

5. Variance between samples (MSB) measures the variability caused by differences among the
samples means that correspond to the different treatments or categories of classification. From
the above test statistic, with all samples of the same size n,
MSB = ns_2
x

49
6. Variation within samples (MSW) is the pooled variance obtained by finding the mean of the sample
variances, which provides a good estimate of the common population variance.
MSW = s 2p

7. Interpreting F
 If the two estimates of variance are close, the calculated value of F will be close to 1
and conclude that there are no significant differences among the sample means.
 If the value of F is excessively large, then reject the claim of equal means.

8. Steps in hypothesis testing for more than two population means follow as presented in unit 8.

9. Testing the Difference for more than two Population Means Example
Problem: CMU Mathematics Department would like to know if students Math 11 scores differ if
they are group according to college (ABC, ABS, RPN and IBC). A random sample of 10 students per
college was obtained. At the 0.05 level of significance, do the data below provide evidence that
Math 11 scores differ? Assume that Math 11 scores follow the normal distribution with different
colleges having homogeneous variance.

Table 9.1.1: Students Math 11 Scores of Different Colleges


ABC ABS RPN IBC
52 56 51 54
51 53 54 52
48 52 54 49
55 53 49 51
43 58 50 45
52 58 45 47
50 55 48 49
59 54 46 50
41 52 48 47
43 59 45 55
Table 9.1.2: Summary Measures of the above Problem
_ _ _ _
x1  49.4 x 2  55 x 3  49 x  49.9
s1  5.719 s2  2.625 s3  3.3 s4  3.178

Solution: Step 1. H o : 1   2  3   4
H1 : The preceding means are not all equal.
Step 2.   0.05
MSB
Step 3. Appropriate test statistic: Fc 
MSW
Step 4. Reject H o if Fc  F , df where: n is total number of
Fc  F ,(k 1, k ( n1)) observations for each category

50
Fc  F0.05, ( 41, 4(101) k is the number of classifications
Fc  F0.05, ( 41, 4(101)
Fc  2.84
 4 _2 _2 
 x 
  xi  
k 
10 i1
 k 1 
 
  10(7.8825 )
Step 5. Computation: Fc  2  2  = = 5.2040
s1  s2  s3  s42
2
60.5872
4 4
Step 6. Reject H o at 0.05 level of significance since H o  2.84.
Step 7. There is sufficient evidence that the Math 11 scores of students of
CMU colleges statistically differ.

10. Computer output:

ANOVA: Single Factor

SUMMARY
Groups Count Sum Average Variance
ABC 10 494 49.4 32.71111
ABS 10 550 55 6.888889
RPN 10 490 49 10.88889
IBC 10 499 49.9 10.1

ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 236.475 3 78.825 5.203924 0.004341 2.866266
Within Groups 545.3 36 15.14722

Total 781.775 39

In preceding topics, statistical tools presented are for quantitative data, how about for qualitative
data? Qualitative data are data obtained from a particular variable that are usually expressed in
categories. For example; we may classify students into categories such male or female, university
scholar or not a university scholar. For data which is qualitative the result is frequency data since we
count the number of observations falling in each category. Frequency data is also called categorical

51
data. Categorical data are presented in a contingency table. Contingency table (or two-way
contingency table) is a table in which frequencies correspond to two variables. (One variable is used
to categorize rows, and a second variable is used to categorize columns). Chi-square test is used to
test of significance when we have data that are expressed in frequencies or data that are in terms of
percentages or proportions but which can be readily transformed into frequencies. This test is used to
determine the significance of the following:
i. goodness-of-fit test
ii. test for independence
iii. test of homogeneity

The chi-square distribution


 Written as  2 distribution, the symbol  is the Greek letter chi, pronounced
“ki”.
 The values of a chi-square distribution are denoted by the symbol  2 (read as
chi- square).
 The chi-square distribution has only one parameter, called the degrees of
freedom (df).
 The shape of a chi-square distribution curve is skewed to the right for small df and
becomes symmetric for large df.
 The entire chi-square distribution curve lies to the right of the vertical axis.
 The chi-square distribution assumes nonnegative values.

Goodness-of-fit test is used when we would to know if the data on hand conforms with a theoretical
distribution. The test statistic is
(O  E ) 2
2   where: O represents the observed frequency of an outcome.
E
E represents the theoretical or expected frequency of an outcome.
E  np ; n is the sample size and p is the probability that an element
belongs to the category if the null hypothesis is true.

With number of degrees of freedom (df) equal to the number of possible outcomes minus
one. Note that close agreement between observed and expected values will lead to a small value
of  2 . A large value of  2 will indicate strong disagreement between observed and expected
values. A significantly large value of  2 will therefore cause rejection of the null hypothesis of no
difference between observed and expected frequencies.
To make tests of hypotheses about experiments with more than two possible outcomes (or
categories), such experiments, called multinomial experiments, possess four characteristics.
Binomial experiment is a special case of a multinomial experiment.

Multinomial experiment is experiment with the following characteristics:


i. It consists of n identical trials.
ii. Each trial results in one of k possible outcomes (or categories), where k  2.
iii. The trials are independent.
iv. The probabilities for various outcomes remain constant for each trial.

52
Multinomial Experiment Assumptions
i. We intend to test a hypothesis that for the k categories of outcomes in a multinomial
experiment, the population proportion for each of the k categories is as claimed.
ii. The sample data consist of frequency counts for the k different categories, and the data
constitute a random sample.
iii. For every one of the k categories, the expected frequency is at least 5.

Goodness-of-fit test example: Use a 0.05 significance level of the case study of 147 industrial
accidents that required medical attention. The sample data are summarized in Table 9.1.3, test
the claim that accidents occur on the five days with equal frequencies.

Case Study of 147 Industrial Accidents that Required Medical Attention


Day Monday Tuesday Wednesday Thursday Friday
Observed Accidents 31 42 18 25 31
Expected Accidents 29.4 29.4 29.4 29.4 29.4

Solution: Step 1. H o : Pi  0.20 ; i  1, 2, 3, 4, 5


H1 : at least one of the proportions is different.
Step 2.   .05
Step 3. Appropriate test statistic:  2 statistic

Step 4. Reject H o if c2   2


 c2   , k 1
 c2   0.05, 51
c2  9.488
(O  E ) 2
 c2  
E
(31  29.4) 2 (42  29.4) 2 (31  29.4) 2
Step 5. Computation:    ... 
29.4 29.4 29.4
10.653
Step 6. Reject H o at 0.05 level of significance since c2  9.488.
Step 7. There is sufficient evidence to warrant rejection of the claim that
the proportion of accidents occur is the same for all the five days
of the week.

Test for Independence – is used to determine whether an attribute or characteristic is independent


on another attribute or characteristic.
- in a test of independence in a contingency table, the null hypothesis that the two
attributes of the given elements of a given population are not related (that is,
they are independent) against the alternative hypothesis that the two
characteristics are related.
- Chi-square distribution is used.

53
- The test statistic is based on the chi-square distribution, the chi-square test
statistic is given below
(O  E ) 2
2   where: O are the observed frequencies
E
E are the expected frequencies
(row total )(column total )
=
grand total
Remarks: 1. The test statistic allows us to measure the degree of disagreement between the
frequencies actually observed and those that we would theoretically expect
when the two variables are independent.
2. Small values of the test statistic result from close agreement between observed
frequencies and frequencies expected with independent row and column
variable.
3. Large values of the test statistic are to the right of the chi-square distribution,
and they reflect significant differences between observed and expected
frequencies.

Steps in Performing a Chi-square Test for Independence


Step 1. Organizing the data into a contingency table.
Step 2. Computing all marginal totals and the grand total.
Step 3. Constructing a table of expected frequencies formed from the product of row and
column totals divided by the grand total.
Step 4. Computing the  2 statistic.
Step 5. Identifying a desired level of significance and establishing the critical region based on
a:
 c2  2 , df
where 2 , df is the chi-square percentile from the distribution with (number of rows
– 1) times (number of columns – 1) degrees of freedom.

Step 6. Computation of the actual value of the test statistic.


Step 7. Make the statistical decision and results interpretation.

Chi-square Test for Independence Assumptions


i. We intend to test the hypothesis that for a contingency table, the row variable and column
variable are independent
ii. The sample data are randomly selected.
iii. For every cell in the contingency table, the expected frequency is at least 5.

Chi-square Test for Independence Example

Problem: In a study of 1,000 randomly selected deaths of males aged 45-64, the causes of death
are listed along with their smoking habits (see Table 9.1.4, which is based on data from
“Chartbook on Smoking, Tobacco, and Health,” USDHEW Publication CDC75-7511)

Number of 1,000 Randomly Selected Deaths of Males Aged 45-64

54
by Smoking Status and Cause of Death (numbers enclosed by
Parentheses are the expected values)
Smoking Status Cause of Death Row Total
Cancer Heart Disease Other
Smoker 135 (123.50) 310 (302.25) 205 (224.25) 650
Nonsmoker 55 (66.50) 155 (162.75) 140 (120.75) 350
Column Total 190 465 345 1,000
Test at 0.05 level of significance that smoking status is independent on the cause of death.

Solution: Step 1. H o : Smoking status is independent of the cause of death.


H1 : Smoking status and cause of death are related.
Step 2.   0.05
Step 3. Appropriate test statistic:  2 statistic
Step 4. Critical Region: Reject H o if  c2  2 , df
 c2   02.05, ( r 1)(c1)
 c2   0.05, ( 21)(31)
 c2  5.991
Step 5. Computation:
(O  E ) 2
 c2  
E
(135  123.50) 2 (310  302.5) 2 (140  120.75) 2
   ... 
123.50 302.5 120.75
 8.349
Step 6. Reject H o at 0.05 level of significance since c2  5.991.
Step 7. There is sufficient sample evidence to warrant rejection of the claim that
smoking status and cause of death are independent.

Test of Homogeneity – test the claim that different populations have the same proportions of some
characteristics. Performing the Chi-square Test for Homogeneity will be the same as the Test for
Independence.

Chi-square Test for Homogeneity Example


Problem: A survey was conducted in Quezon City and Manila to determine voter sentiment for
two presidential candidates Digong and Noynoy. Five hundred voters were randomly selected
from each city and the data is given on the table below. At the 0.05 level of significance, test the
null hypothesis that proportions of voters favoring Merriam, Erap and some other candidates are
the same for each city.

Table 9.1.5: Observed and Expected Frequencies for the Voter Sentiment Survey
Voter Sentiment Quezon City Manila Row Total
Favor Digong 204 (216) 228 (216) 432
Favor Noynoy 215 (206) 197 (206) 412
Favor Another 81 (78) 75 (78) 156

55
Column Total 500 500 1,000

Solution: Step 1. H o : P1  P2  P3 where P1 , P2 , and P3 represent the true


proportion of voters favoring Digong, Noynoy, or
another candidate, respectively.
H1 : The proportions are different.
Step 2.   0.05
Step 3. Appropriate test statistic:  2 statistic
Step 4. Critical Region: Reject H o if  c2  2 , df
 c2   02.05, ( r 1)(c1)
 c2   0.05, (31)(21)
 c2  5.991
Step 5. Computation:
(O  E ) 2
 c2  
E
(204  216 ) 2 (228  216 ) 2 (75  78) 2
   ... 
216 216 78
 2.35
Step 6. Fail to reject H o at 0.05 level of significance since c2  5.991.
Step 7. There is not sufficient evidence to support the claim that the proportions of
voters favoring Digong, Noynoy, or another candidate are the same for the
two cities.

Some Issues Regarding the Use of Chi-square Test

 For very small frequencies, the test is not a good approximation. The approximation is
usually considered adequate provided that the expected frequencies in all cells are at
least 5. If some frequencies are below 5, this requirement may be met by combining
two rows or columns before computing the test statistic. A corresponding reduction
in the degrees of freedom must then be made to account for the smaller number of
cells.
 For one degree of freedom. Yate’s correction for continuity should be applied.
O  E  0.5 2

 E
 The observations should be independent of one another, that is, the total number of
observed frequencies should not be more than the total number of individuals in the
sample.
 The chi-square test does not apply to data expressed as percentage or proportions,
unless they can be transformed into frequencies.

56
Problems/Exercises

1. What are the critical F values for the   0.05 level?


a. F1, 18  _________;F2, 20  _________;F3, 18  _________
b. What are these critical values for   0.10 ?
c. What are these critical values for   0.01?

For questions # 2-3, complete the ANOVA table:

2. ANOVA Table
SOURCE SS DF MS F
Between 345 4
Within 260
Total 49

3. ANOVA Table
SOURCE SS DF MS F
Between 3 50
Within 32 12.5
Total 35

4. Forty freshmen students were randomly assigned to four groups of experiment with four different
methods of teaching College Algebra. At the end of the semester, the same test was given to all
40 students. The table gives the scores of students in the four groups.

Students College Algebra Scores of the Different Teaching Methods


METHOD 1 METHOD II METHOD III METHOD IV
25 26 28 29
28 28 29 25
29 32 34 19
31 29 45 18
25 36 38 30
36 35 39 35
25 29 40 28
38 28 43 26
24 27 32 24
28 36 37 28

Test that the mean scores of all four groups of freshmen students taught by four different
methods are not equal. Assume that all the required assumptions hold true. Use 0.05 level of
significance.

57
5. The table below provides the data that represent the number of hours of pain relief provided by 5
different brands of headache tablets administered to 25 subjects. The 25 subjects were randomly
selected, divided into 5 groups and each group was treated with a different brand.

Hours of Relief from Headache Tablet


HOURS OF RELIEF
Tablet 1 Tablet 2 Tablet 3 Tablet 4 Tablet 5
3 9 3 4 7
6 7 5 1 4
8 8 2 4 9
5 6 3 3 6
4 9 7 2 7

6. Stain Less advertises that its detergent will remove all stains, except oil-base paint, in any kind of
water. Consumer Action is evaluating this claim. Batches of were run in 5 randomly chosen homes
having a particular type of water – hard, moderate or soft. Each batch contains an assortment of rags
and cloth scraps stained with food products, grease, and dirt over a 150 square inch area. After
washing the number of square inches that were still stained was determined and the following results
were obtained.

Square Inches Left of Stained for Different Types of Water Used


OBSERVATION TYPE OF WATER
Hard Moderate Soft
1 6 5 5
2 4 6 0
3 3 9 2
4 9 4 4
5 7 3 3

(O  E ) 2
7. If you find 10 lemons, 13 apples, and 27 mangoes, what is the value of   2
to test
E
the hypothesis that the relative frequencies of the three fruits are 25%, 25% and 50%,
respectively?

8. Using the value of  2 statistic in question #7, does the ratio of fruits match that of the 25-25-50
relative frequency? Test at 0.05 level of significance.

9. A bank has an ATM installed inside the bank and it is available to its customers only from 7:00 AM
to 6:00 PM Monday through Friday. The manager of the bank wanted to investigate if the
percentage of transactions made on this ATM is the same for each of the 5 days (Monday through
Friday) of the week. The manager randomly selected one week and counted the number of
transactions made on this ATM from Monday to Friday. The information the manager obtained is
given in Table 9.2.5, where the number of users represents the number of transactions on this
ATM on these 5 days. For convenience, transactions are referred as “people” or “users”.

58
ATM Transactions: From Monday to Friday
DAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY
No. of Users 253 197 204 279 267

At the 5% level of significance, can we reject the null hypothesis that the proportion of people
who use this ATM each of the five days of the week is the same? Assume that this week is typical
of all the weeks in regard to the use of this ATM.

10. ABC Brewery manufactures and distributes three types of beer: low calorie, regular beer, and dark
beer. In n analysis of the market segments of the three beers, the firm’s market research group
has raised the question of whether or not preferences for the beers differ between male and
female beer drinkers. If beer preference is dependent of the drinker’s gender, then one
advertising campaign will be initiated for all ABC beers. However, if beer preference depends on
gender, the firm will tailor its promotions toward different target markets. A study survey for this
study resulted as follows:

Beer Preference when Group According to Gender


GENDER BEER PREFERENCE TOTAL
Low Calorie Regular Beer Dark Beer
Male 20 40 80 140
Female 30 30 70 130
TOTAL 50 70 150 270

Is there sufficient evidence at   0.05 that beer preference is related to gender of the
drinker?

11. A firm that sells an accessory for new cars has researched where new cars are being sold. The
accompanying table shows new car sales in four areas. At the 0.05 level of significance, test the
claim for car sales, the manufacturer and the area of sale are independent variables.

New Care Sales in Four Areas with its Corresponding Manufacturer


MANUFACTURER AREA OF SALES
Davao City Cagayan de Oro Iligan City General Santos
City City
Manufacturer 1 545 625 420 485
Manufacturer 2 649 328 209 345
Manufacturer 3 458 512 389 515

59
12. At the 0.05 level of significance, use the data in the following table, that is, Music Preference and
IQ to test the claim that IQ and music preference are independent.

Music Preference and IQ


MUSIC PREFERENCE IQ
High Medium Low
Classical 4 26 17
Pop 47 59 25
Rock 83 104 79

13. In an experiment to study the dependence of hypertension on diet, the following data were taken
on 200 individuals.

Hypertension and Diet


CONDITION DIET
Vegetarian Diet Non-vegetarian Diet
With Hypertension 15 46
Without Hypertension 80 59

Test the hypothesis that the presence or absence of hypertension is independent of diet. Use
a 0.05 level of significance.

14. A survey was conducted in Central Maramag University (CMU) and Valencia State University (VSU)
to determine students’ level of satisfaction on the learning acquired to their teachers. For each
university, three hundred students were randomly selected and the data is given in Table 9.2.10.
At the 0.01 level of significance, test the null hypothesis that proportions of the level of
satisfaction of students are the same for the two universities.

Students Level of Satisfaction on the Learning Acquired to their Teachers


LEVEL OF SATISFACTION CMU VSU
Not satisfied 25 50
Satisfied 50 150
Very satisfied 275 100

60
SIMPLE LINEAR REGRESSION AND CORRELATION

Regression analysis is a statistical method which makes use of the relationship between two or
more quantitative variables so that one variable, called the dependent variable or response variable,
can be predicted with the knowledge of the values of the other variable, called the independent
variable or explanatory variable. A mathematical equation that allows us to predict values of one
dependent variable from known values of one or more independent variable is called a regression
equation.

Purposes of Regression Analysis


i. Predicts the value of a dependent variable based on the value of a least one independent
variable.
ii. Explains the effect of the independent variables on the dependent variable.

Types of Regression Models

Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship

For this chapter, it focuses on the problem of estimating or predicting a value of a dependent variable
Y on the basis of a known measurement of an independent variable X . Scatter diagram is a
graphical presentation of the independent variable (plotted on the horizontal axis) and the dependent
variable (plotted on the vertical axis). Through this graph or diagram is the easiest way to determine if
a relationship exists between the two variables. A linear relationship between two variables is one in
which the relationship can be most accurately presented by a straight line. In this section, the
problem of estimating or predicting the value of a dependent variable on the basis of a known
measurement of an independent variable will be given consideration. Although a graphic solution is
sometimes used for prediction, it is much more common to predict Y from the equation of the
straight line. The general form of the equation is given by
Y  a  bX , linear regression line equation or simple linear regression
For each X , the equation Y  a  bX will predict a value of Y. The estimated regression line
is defined by the equation

61
 
Y  a  bX Where: Y  is the predicted dependent variable
 
a  Y intercept (value of Y when X  0)
b  slope of the line
a and b are the estimates of the parameters of
regression which are calculated from the available sample
points.

Remark: Through the estimated regression line equation we can now predict any Y value just by
knowing the corresponding X value.

Assumptions on Regression Analysis


i. The values of the independent variable X may be “fixed”, that is, X values may be
selected in advance by the researcher, or they may be obtained without the imposition of
any restriction, in which case, X is a random variable.
ii. The values of X are measured without error.
iii. The variance of the subpopulations of the dependent variable, given different values of the
independent variable, are equal.
iv. The subpopulation of the dependent variable X , given different values of the
independent variable Y , is normally distributed.
v. The means of the subpopulations Y all lies on the same straight line (assumptions of
linearity).

Estimation of Parameters
Given the sample {( xi , yi ), i  1, 2, 3, n} the least squares estimates of the parameters in
the regression line are:
n n n
n xi yi   xi  yi _ _
b i 1 i 1 i 1
2
; a  y b x
 n 
n
n x    xi 
2
i
i 1  i1 
n n

_ y i _ x i
y i 1
and x i 1
are the means of the sample values
n n
a is the estimate of the population Y intercept  o and b is the estimate of the population slope
coefficient 1.

Interpretation of the Slope and the Intercept


 o is the average value of Y when the value of X is zero.
1 measures the change in the average value of Y as a result of a one-unit change
in X .
a is the estimated average value of Y when the value of X is zero.
b is the estimated change in the average value of Y as a result of a one-unit
change in X .

62
Example: Given the data in the following table. Find the following
a. Find the equation of the regression line.
b. Scatter diagram.
c. Find the point estimate of Y when x  113.

IQ Scores and MMW Midterm Scores of 12 College Students


Student Number IQ (X) MMW Score (Y)
1 110 50
2 112 56
3 118 52
4 119 59
5 122 61
6 125 53
7 127 61
8 130 58
9 132 65
10 134 59
11 136 64
12 138 68

Solution:
12 12
n  12,  xi  110  112  ...  138  1,503.00,
i 1
x
i 1
2
i  110 2  112 2  ...  138 2  189,187 .00
12 12 _ _
 yi  50  56  ...  68  706.00,
i 1
 yi2  50 2  56 2  ...  682  41,682.00, x  125.25, y  58.833
i 1
12

x y
i 1
i i  110 (50)  112 (56)  ...  138(68)  88,857

12(88,857 )  (1,503)(706)
b  0.4598, a  58.833  0.4598(125.25)  1.2414
12(189,187 )  (1,503) 2


a. Y  1.2414  0.4598 X

b. Scatter Plot

63
70

60

50
SCORE

40
100 110 120 130 140

IQ


c. Y  1.2414  0.4598(113)  53.20

Correlation analysis attempts to measure the strength of the relationship between two random
variables by means of a single number called correlation coefficient. This concerned only with the
strength of the relationship and no causal effect is implied. The Pearson Correlation Coefficient (  )
measure the strength of the linear relationship between two variables X and Y . The estimated
sample correlation coefficient, denoted by (r ), is given by:
n n n
n xi yi   xi  yi
r i 1 i 1 i 1
where n is the sample size
 n 2  n  2
  n 2  n 2 
n xi    xi   n yi    yi  
 i1  i1    i 1  i1  

Sample of Observation from various r values

Y Y Y

X X X
r = -1 r = -.6 r=0
Y Y

r = .6 r=1

64
Features of  and r
- unit free
- ranges from -1 to 1
- the closer to -1 the stronger the negative linear relationship
- the closer to 1 the stronger the positive linear relationship
- the closer to 0, the weaker the linear relationship

The Sample Coefficient of Determination, r 2 , is a number that determine the total variation in the
values of variable Y that can be accounted for or explained by the linear relationship with the
values of the variable X .

Example: Of the given example above, find the sample correlation coefficient and sample coefficient
of determination and interpret the results.

Test for a Linear Relationship


Hypotheses: H o :   0 (no correlation)
` H1 :   0 (correlation)
r
Test statistic: t 
1 r2
n2

Using the above example, is there evidence of a linear relationship between the students MMW
midterm scores and IQ scores at 0.05 level of significance?

Solution: Step 1. H o :   0 (no association)


H1 :   0 (association)
Step 2.   0.05
Step 3. Appropriate test statistic: t test
Step 4. Critical region: Reject H o if tc  t , n2
or tc  t , n2
2 2

tc  t0.025, 10 or tc  t0.05, 10
tc  2.228 or tc  2.228
0.7796  0
Step 5. Computation: tc   19.88
1  .7796 2
12  2
Step 6. Reject H o since tc  2.228 .
Step 7. There is sufficient sample evidence that there is a significant
linear relationship between Students IQ scores and their MMW
midterm scores.

65

You might also like