Block-3
Block-3
Statistical Analysis
BLOCK 3 INTRODUCTION
Block 3 “Statistical Analysis” deals with the measures of central tendency and dispersion,
sampling distribution and statistical analysis.
Unit 7 “Descriptive Statistics-I” deals with different types of measures of central tendency
and measures of dispersion.
Unit 9 “Sampling Distributions” deals with sampling distribution, the concept of standard
error and the Central Limit Theorem.
Unit 10 “Statistical Analysis-I” deals with the concept of hypothesis, the procedure of
testing a hypothesis, and also the test of hypothesis for large samples for the population
mean and the difference between two population means.
Unit 11 “Statistical Analysis-II” deals with the small sample test procedures for a single
population mean and the difference between two population means based on t-
distribution. Also, the unit delves into the chi-square test for the goodness of fit and
independence of attributes. The test for the equality of two variances using F- distribution
is also discussed in the unit.
Unit 12 “Analysis of Variance” delves into the one-way as well as two-way Analysis of
Variance.
Objectives
7.1 INTRODUCTION
A frequency distribution for a given set of numerical data and different
graphs using the raw data in hand; which, in fact, is a method of
summarization of data for having a rough or approximate idea about the
nature, properties and pattern of variation among the values of the variable.
But what, if we need to observe whether there is some tendency in the data
set to concentrate around a single numerical figure or, in other words, what
would be the appropriate single representative value of the variable,
representing most of the part of the data set? Obviously, then we have to find
such a numerical figure based upon all the values of the characteristic and not
only with only one or few of the values, in order to make it the best 151
Statistical Analysis representative figure. The choice of such a single number or summary
statistics that we choose to summarize a given data depends upon the
particular characteristics we want to describe. The statistical measures which
describe such characteristics are called measures of central tendency or
measures of location.
In this unit, we are going to discuss about different measures of central
tendency. Measures of central tendency in Statistics play a very important
and powerful role; since, such a measure reflects most of the features of the
entire mass of data as regards to its tendency to concentrate around a single
value. Generally, any measure of central tendency is called an “average”.
Average of a set of data is actually found to locate the value (or point of
location of it) to which the average value coincides in the range of values of
the characteristic.
Different measures of central tendency provide us an idea about the central
value around which most of the values of the data set are concentrated.
However, being values of a variable and not of a constant, all of them are not
of the same magnitudes; rather they are scattered or dispersed over a given
range of values. This scatteredness of variate-values is a peculiar and natural
characteristic of all the variables; whether they may be less scattered or may
be highly scattered. Thus, even if central values of two different sets of data
are same, the scatteredness will be different of all the data sets. In this
context, we should have different measures of the second property, namely,
property of scatteredness or spreadness of data; or what is popularly called,
“property of dispersion”.
Section 7.3 highlights the meaning of measure of central tendency along with
its significance for the preliminary analysis of the data, important properties
which a good measure (average) is supposed to be satisfied. Section 7.4
describes different types of arithmetic mean which can be computed over
data sets which are tabulated in different manners along with the merits and
demerits of it. Section 7.5 describes the methods of computing median from
data sets which are available in different forms of tabular representation. The
merits and demerits of it have also been mentioned in the section. Section 7.6
describes the methods of computing it from data sets which are available in
different forms of tabular representation. The merits and demerits of it have
also been mentioned in the section. Section 7.7 will present a discussion on
different measures of dispersion, like, range, mean deviation, standard
deviation, etc. Also, we shall highlight the essential properties of such
measures which are desirable for a good measure of dispersion. Section 7.8
deals with the simplest kind of measure of dispersion: Range, its calculation
and merits and demerits. Section 7.9 is devoted to the description of Mean
Deviation as a measure of dispersion along with its computation and its
merits and demerits. Section 7.10 describes Standard Deviation and Variance
as very important and most popular measures. The methods of computing of
both are explained using different kinds of data.
152
Descriptive Statistics-I
7.2 OBJECTIVES
After studying this unit, you should be able to
explain the concept of central tendency;
define an average;
explain the significance of a measures of central tendency;
explain the properties of a good average;
know the different types of measures of central tendency;
calculate arithmetic mean, median and mode for different types of
classification of data; and
calculate range, mean deviation and standard deviation for different
types of data.
It is a bare fact that, whatever be the characteristic under study and the
number of values of the characteristic measured; there is always a tendency
of the data set, so obtained, to concentrate or cluster around a particular
value, which is termed as “central value”. This property of the data tending
towards the central value is known as “property of central tendency”. For
example, if the data on age of primary school children is considered, most of
the ages would be around 5 to 6 years, whereas, most of the ages of
undergraduate students would be around 17-18 years. Thus, central value in
the first data set is 5 years and for second data set is about 18 years.
Central values of data sets are generally those figures which might be
considered to be representative figures to describe the ‘property of central
tendency’ of the data sets. In this sense, these are also known as “measures of
Central tendency” or more popularly the “average”.
According to Professor Bowley, averages are “statistical constants which
enable us to comprehend in a single effort the significance of the whole”.
They throw light as to how the values are concentrated in the central part of
the distribution. In other words, an average is a single value which is
considered as the most representative for a given set of data. Measures of
central tendency show the tendency of some central value around which data
tend to cluster.
2. To facilitate comparison
Measures of central tendency enable us to compare two or more
populations by deducing the mass of data in one single figure. The
comparison can be made either at the same time or over a period of
time. For example, if a subject has been taught in more than two
classes, obtaining the average marks of students of those classes,
comparison of average performance of students over classes can be
made.
155
Statistical Analysis
7.4 ARITHMETIC MEAN
“Arithmetic mean” (also called “mean”) of any set of data is defined as the
sum of all the observations of the data set, divided by the total number of
observations in the set. It is the most popular and widely used measure of
central tendency.
As per above definition of mean, mathematically it can be written as
Sum of all the observations of the data set
Mean = … (7.1)
Number of observations in the set
This is the only basic formula for computation of mean. However, you will
come across with some other formulae for computing mean in further text;
but these formulae in no way provide any new method of computation of
mean; rather all these formulae are, in fact, reducible to this basic formula on
simplification.
We have seen that in Statistics, generally the raw data are summarized in a
number of forms, which are
1. Individual series or “ungrouped data”;
2. Discrete frequency distribution and
3. Continuous frequency distribution.
You might be knowing that the data summarized in the forms 2 and 3 are
popularly termed as “grouped data”. Accordingly, you will see below that the
basic formula for the computation of mean, which is used for ungrouped data,
is presented in a different form for the grouped data.
where X stands for the arithmetic mean or mean of the set; X being the
characteristic (variable) under study.
(b) Short-Cut Method
Under this method, we arbitrarily choose two constants, say A and h; A is
used to change the origin of the values and h for changing the scale of values.
156
The constant A is frequently called “assumed mean”. This technique is very Descriptive Statistics-I
much used in daily life.
If x1, x 2 , x3 ,..., xN are the N observations for the variable X and let A and h be
the chosen constants, where A be the assumed mean, then let deviations of
x -values from A respectively be d1,d2 ,d3 ,...,dN where
di xi A , i 1, 2,, N.
di
Further, let ui = , where u1, u2, …, uN are N values of the transformed
h
variable U. Then, we have
N N N N
X =
x
i1 i
A d
i 1 i i1
A h u
i1 i
N N N
N
NA
h
u
i1 i
A hU;
N N
N
where, U =
u
i1 i
being the mean of the transformed variable U.
N
Thus, the relation between the mean of the original variable X and the mean
of the transformed variable U is obtained as
X A hU. … (7.3)
We see that the mean is affected by change of origin and scale both, since the
relation obtained above, involves both the constants, A, used to change the
origin and h, the constant used for changing the scale.
Example 1: Calculate the mean of runs scored, as given below, by ten
players in 20-20 cricket match:
Player 1 2 3 4 5 6 7 8 9 10
Score 15 20 25 19 12 11 13 17 18 20
Solution: We have been given the scores of 10 players in a 20-20 cricket
match. Therefore, we have N = 10.
Now for calculating the arithmetic mean or average score of 10 players, we
need to follow the following steps:
X A hU A h
u i i
18 1
10
17 runs.
N 10
It can be observed in the example that we have taken h = 1, since, the
transformed values are one-digit numbers, which are easy to tackle in the
computation process. This is the wise choice of the constant h; otherwise, we
take h = 10 or 100, if the transformed values would be two-digit or three-digit
figures, which are not so easy to tackle.
(II) For Discrete Frequency Distribution
By a discrete frequency distribution, we mean a grouped data where all the
values of the concerned variable are given separately associated with the
number of times they occur in the data (you know that the number of times a
value occur in the data is called the “frequency” of the value). Thus, in fact, it
is also a form of the grouped data. For example, suppose that the k distinct
values in the data are x1, x 2 , x3 ,..., xk with associated frequencies f1,f2 ,f3 ,...,fk
respectively.
Since the value x i appears fi times (i = 1, 2, .., k) in the data, the sum of all
observations belonging to the set is equal to
158
k Descriptive Statistics-I
x x1 x1 x1 + x 2 x 2 x 2 + … + xk xk x k
i1 i
f1 times f2 times fk times
k k
x f1x1 f2 x 2 fk xk
i 1 i fx.
i1 i i
Therefore, using the basic formula (5.1), we have the formula for computing
mean of a discrete frequency distribution as
k
Mean =
fx .i1 i i
… (7.4)
k
f i 1 i
Both the methods, namely, direct method and short-cut method for
computing the mean are also applicable in this case without any new concept.
Therefore, without adding any new thing here, we present Examples 3 and 4
below to illustrate how these methods can be applied for computing mean of
a discrete frequency distribution.
(a) Direct Method
Let us consider the following discrete frequency distribution for illustration:
Example 3: The following table gives the data relating to the marks (out of
10) of 100 students in a statistics test:
Score(X) 0 1 2 3 4 5 6 7 8 9 Total
Frequency (f) 1 2 4 9 15 21 20 15 9 4 100
X
fxi i
535
5.35 ;
f i 100
which provides the average marks of the students in statistics test.
(b) Short-cut Method
The short-cut method of calculating the mean, as described in sub-section
7.4.1 (Ib) can also be applied in this case with the same aim to reduce the
complexity of the computation process. We may similarly choose appropriate
constants A and h, where A is the assumed mean. Defining the transformed
Step III: Find the values fiui for each i and hence, find their sum, f u .
i
i i
Step IV: Find the mean of the transformed variable U, using formula
U
fu .
ii i
f ii
Step V: Use the formula X A hU for computing the mean of the variable X.
Now let us solve the following exercises for illustrating the calculation
process:
Example 4: Following table gives the wages paid to 125 workers in a
factory. Calculate the arithmetic mean of ways for the data by using short-cut
method.
Wage (in Rs) 200 210 220 230 240 250 260
No. of Workers 5 15 32 42 15 12 4
160
Wages No. of Deviations fU Descriptive Statistics-I
(X) Workers U = (X – A)/h
(f)
200 5 –30 –150
210 15 –20 –300
220 32 –10 –320
230 42 0 0
240 15 10 150
250 12 20 240
260 4 30 120
125 f u 260
i i
i
Using the formula (1.3), the arithmetic mean for the variable X will be
X A hU A h
fu
ii i
230
260
1 230 2.08 227.92.
f ii
125
Solution: For computing arithmetic mean for the given continuous frequency
distribution, we first obtain mid values (X) for all the classes. After
determining these mid-values, formula (1.4) is used for calculating X. The
calculations needed are shown in the following table:
Classes Mid-values Frequency(f) fX
(X)
20-25 22.5 8 180.0
25-30 27.5 10 275.0
30-35 32.5 12 390.0
161
Statistical Analysis 35-40 37.5 20 750.0
40-45 42.5 11 467.5
45-50 47.5 4 190.0
50-55 52.5 5 262.5
f = 70
i
i fxi i 2515
X
fx
i i
2515
35.93 .
f i 70
X A hU A h
fuii i
37.5
110
37.5 1.57 35.93.
f ii
70
162
Here, X12 combined mean of two groups; Descriptive Statistics-I
Using formula, the combined mean of all the 200 students will be
N1X1 N2 X2 120 80 80 90
X12
N1 N2 120 80
Note: i) Check your answers with those given at the end of the unit.
1) Find the arithmetic mean of the observations 5, 8, 12, 15, 20, 30.
2) For the following discrete frequency distribution, find arithmetic
mean.
163
Statistical Analysis
7.5 MEDIAN
Generally, in a raw data set, values of the study variable are given in a
haphazard way, that is, in the order in which values are collected in the
experiment or survey and, therefore, it does not follow any rule of
arrangement. However, for any statistical analysis, it is sometimes advisable
to arrange the data in the form of an ‘array’, that is, arranging the values
either in an ascending or descending order of magnitude. In an array, every
observation holds a certain rank. For example, when arranged in ascending
order of magnitude, the very first value is the smallest one, the second value
is the second smallest value, and so on; but when arranged in the descending
order, the very first value is the largest value, the second value is the second
largest value and so on. Thus, it is the way of arrangement which changes the
rank of values in the series.
“Median” of a set of data is a positional value. The Median is defined as the
middle-most value of the variable, when values are arranged either in
ascending or descending order of magnitude. According to Connor, “The
median is the value of the variable which divides the group or series into two
equal parts, one part comprising all values greater and the other part, value
lesser than the Median”. Since the median denotes the central position in a
series of values, it is called a position average.
Clearly, in this series the middle-most value is xk+1, and there are equal
number of values on both the sides of it. According to the definition of
median, therefore, xk+1 is the median of the series. Thus, if there are 2k+1
values (an odd number of values) in the data, the median is the value which
occupies the (k +1)th place or, equivalently, [(N + 1)/2]th position, when
values are arranged either in ascending or descending order of magnitude.
Example 8: Let us consider the following set of raw data:
5, 12, 45, 8, 21, 32, 5, 36, 16, 8, 20, 24, 54, 18, 44, 56, 20, 30, 44
164
Solution: Here, N = 19 = 2x9 +1, means k = 9. Descriptive Statistics-I
Arranging these values in ascending order of magnitude, we get the series as:
5, 5, 8, 8, 12, 16, 18, 20, 20, 21, 24, 30, 32, 36, 44, 44, 45, 54, 56
Then, since the value 21 occupies the middle-most position (that is, 10th
position), it is the median of the set.
Case II: When there are even number of values
Let there be an even number of observations. Then, N can be written as N =
2k for k = 1, 2, 3, … . If these values are arranged in ascending order of
magnitude as
x1 x2 x3 … xk xk+1 xk+2 xk+3 … x2k.
In such situations, we observe that there is no single middle-most value in the
series, rather, we will get two middle-most values which would be xk and xk+1
occupying the (N/2)th and [(N/2)+1]th positions. Following the definition of
the median, therefore, we consider neither of the values as the median, but we
take the arithmetic mean of both the middle-most values as the median of the
series. In other words, in such situations the median of the series, denoted by
Md, will be given by
th th
N N
2 value 2 1 value
Md
2
Example 9: Calculate the median for the following data:
7, 8, 9, 3, 4, 10
Solution: First we arrange given data in ascending order as 3, 4, 7, 8, 9, 10
Here, (N) = 6 = 2x3 (even). So, here k = 3. Therefore, we get the median as
165
Statistical Analysis (ii) Obtain the “less than type” cumulative frequencies F1 = f1, F2 = f1+f2,
… , Fk = f1+f2+…+fk corresponding to each value, and list them in a
separate column. We have, F0 = 0 and Fk+1 = Fk. The last cumulative
frequency will coincide with the total number of values, that is, N;
Now, as before, we shall consider two cases: N is odd and N is even.
(iii) If N is odd, then, find the value of [(N +1)/2]. If the inequality Fi-1 (N
+1)/2 Fi is satisfied for the ith value (i = 1, 2, …, k), then the ith value
will be the median of the data.
(iv) If N is even, then we know that there will be two middle-most values,
the mean of which would be the median. Find the values of (N/2) and
[(N/2)+1]. The first middle-most value (ith value) will be obtained using
the inequality
Fi-1 Fi. Now, the second middle-most value (jth value) will be
obtained by using the inequality Fj-1 1 Fj, where j = 1, 2, …
, k.
Note: Since, for the case of N even, there would be two middle most values,
therefore, care must be taken in finding the second middle-most value (i+1)th
value after finding the first middle-most value (the ith value). It may happen
that both the jth and ith values are the same value, that is, j = i or j i 1.
Example 10: Find the median size of the following series:
Size (X) 4 5 6 7 8 9 10
Frequency (f) 06 12 15 28 20 14 05
Here, N is even, so we compute (N/2) and [(N/2) +1], which are 50 and 51.
The first value (N/2) satisfies the condition Fi-1 Fi for F3 F4;
F3 and F4 are respectively, 33 and 61. Also, we see that the value [(N/2) + 1]
satisfies the condition for F3 F4. This implies that both (N/2) and
[(N/2) +1], correspond to the value 7. There mean will also be 7. Thus,
median is 7.
166
(III) For Continuous Frequency Distribution Descriptive Statistics-I
Note: i) Check your answers with those given at the end of the unit.
7.6 MODE
The Word ‘Mode’ comes from the French word ‘La Mode’ which means the
fashion. In statistical language, the Mode is that observation in a distribution
which occurs most often, i.e., the value which is most typical. According to
Croxton and Cowden, the mode of a distribution is the value around which
the items tend to be most heavily concentrated. Thus, mode or the modal
value is the value in a series of observations which occurs with highest
frequency. For example, when we say that the average size of shoes which
are sold in maximum number in a shop is 7, we talk about the modal size of
shoes.
In real life situations, generally we observe data with three types of mode. A
data set may be either (i) unimodal that is, having only one mode; (ii)
bimodal, that is, having two modes and (iii) multi-modal, that is, more than
two modes. To illustrate this, let the scores of 10 batsmen be 40, 45, 58, 46,
48, 48, 58, 58, 70 and 81. Then the only mode for the data set is 58 runs,
because it appears three times. So, it is a unimodal data set or unimodal
distribution. Again, let the runs scored by 10 batsmen of other team be 42,
47, 58, 79, 72, 81, 95, 50, 56 and 69. Then, there is no mode in the above
series of scores as every score has equal frequency.
168
Frequency 5 Descriptive Statistics-I
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
X
Consider then a third series of scores, say: 86, 62, 58, 58, 48, 47, 48, 81, 59
and 50. Here, we observe that there are two modes 48 and 58, as they both
occur twice while other values occur only once. In such a case the
distribution is called bimodal. But still, in case of a bimodal distribution, the
two modes need not always have the same frequency as shown in the figure
given below:
7
6
Frequency
5
4
3
2
1
0
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
X
However, this is clear from the above discussion that in some cases mode
may be absent while in others there may be one or more modes. If the
frequency curve of the frequency distribution is drawn, the mode is the value
of the variable at which curve reaches its maximum, i.e., it is the value where
the concentration of observations is maximum. In case of a bimodal
distribution, the concentration of observation occurs at two points and thus,
we have two modes.
Here, it is important that the occurrence of more than one mode in a
distribution may be useful for further statistical analysis, but the mode as a
measure of central tendency has little significance in the case of bimodal or a
trimodal data.
Mode is an important ‘average’ in many situations specially in data related to
marketing studies. For example, when we talk about, a brand of shirt which is
most popular amongst younger generation, a song which is listened by the
largest section of societies, etc.
X 10 12 14 16 18 20 22
f 4 6 10 11 21 10 5
Solution: In the above discrete frequency distribution the value 18 has the
maximum frequency 21. Therefore, 18 is the mode of the given discrete
distribution.
CHECK YOUR PROGRESS 3
Note: i) Check your answers with those given at the end of the unit.
170
(III) For Continuous Frequency Distribution Descriptive Statistics-I
Solution: In this example, the highest frequency 45 lies in the class (40-50)
hence, this is the modal class. For determining the mode value in the modal
class, we use the formula as:
f1 f0
Mode L h
2f1 f0 f2
171
Statistical Analysis Here, f0 25, f1 45, f2 11, L 40 and h 10.
45 25
Thus Mode 40 10 40 3.70 43.70
2 45 25 11
Note: i) Check your answers with those given at the end of the unit.
7.8 RANGE
Range is the simplest measure of dispersion. It is defined as the difference
between the largest value (L) and the smallest value (S) of the variable in the
distribution. Its merit lies in its simplicity. It can be defined as
Range(R) L S
where, L: Largest value of the variable, and
S: Smallest value of the variable, as give in the data set.
It should be mentioned here that whether the data are given in the form of a
series of the values or in the form of grouped data, in both the cases range is
well defined and, therefore, can be computed easily.
Example 15: Find the range of the distribution 6, 8, 2, 10, 15, 5, 1, 13.
Solution: For the given distribution, the largest value of variable is 15 and
the smallest value is 1.
Hence, Range (R) = Largest value (L) - Smallest value (S) =15-1= 14.
Example 16: Marks of 10 students in Mathematics and Statistics are given
below:
Marks in Mathematics 25 40 30 35 21 45 23 33 10 29
Marks in Statistics 30 39 23 42 20 40 25 30 18 19
Note: i) Check your answers with those given at the end of the unit.
10) Calculate the range for the following data:
60, 65, 70, 12, 52, 40, 48
11) Calculate range for the following frequency distribution:
Class 0-10 10-20 20-30 30-40 40-50
Frequency 2 6 12 7 3
N
where, x
i 1
i A Sum of absolute deviations of values taken from the
constant A.
As mentioned above, it can be shown that if we take A = median of the data,
the quantity xi – Ais least as compared to any other choice of A. But in
174 practice, the mean deviation is computed by taking the constant A as the
mean of the data, since, mean is the most frequently used measure of central Descriptive Statistics-I
tendency. If A is the mean of the data, the M. D. would be given by
N
x X
i 1
i
M. D.= ,
N
N
where x
i 1
i X Sum of absolute deviations taken from mean.
x M
i 1
i d
M. D. =
N
N
Where x M
i1
i d = Sum of absolute deviation taken from the median.
In case the data are given in the form of frequency distribution, the formula
for M. D. can be put as:
k k
f
i 1
i xi X f
i 1
i xi Md
M. D. = k
and M. D. = k
f
i 1
i f
i1
i
X 1 2 3 4 5 6 7
F 3 5 8 12 10 7 5
X f fX x X f x X
1 3 3 3.24 9.72
2 5 10 2.24 11.20
175
Statistical Analysis 3 8 24 1.24 9.92
4 12 48 0.24 2.88
5 10 50 0.76 7.60
6 7 42 1.76 12.32
7 5 35 2.76 13.80
Total 50 212 12.24 67.44
f x
i 1
i i
212
X k
4.24
50
f
i 1
i
f
i1
i xi X
67.44
and M.D k
1.348
50
f
i1
i
Example 19: Compute the mean deviation (M.D.) from the following data.
Solution: We use the technique of change of origin and scale for calculating
the mean of the data. We choose the assumed mean as 50 and the constant H
for changing the scale, as 20. The transformed values, denoted by d’ are then
computed and presented in the fourth column of the table. We do other
necessary calculations in the following table:
x 50
Classes X f d fd xX f xX
20
0–20
10 5 –2 –10 41.07 205.35
20–40
30 50 –1 –50 21.07 1053.50
40–60
50 84 0 0 1.07 89.88
60–80
70 32 1 32 18.93 605.76
80–100
90 10 2 20 38.93 389.30
100–
110 6 3 18 58.93 353.58
120
N
= fd 10 f x X 2697.37
187
176
X A h
f d' 50 20 10
50 1.07 51.07
Descriptive Statistics-I
N 187
1 2697.37
Mean Deviation
N
f x X 187
14.42
Note: i) Check your answers with those given at the end of the unit.
12) Following are the marks of 7 students in statistics. Find the mean
deviation.
16, 24, 13, 18, 15, 10, 23
13) Find the mean deviation for the following distribution:
Class 0-10 10-20 20-30 30-40 40-50
Frequency 5 8 15 16 6
Where, A may be any number. Since, deviations are taken from the constant
A, we call this formula as RMSD with respect to the value A.
177
Statistical Analysis 7.10.2 Standard Deviation (S.D.)
Standard deviation (S. D.), popularly denoted by the letter σ, is a special case
of RMSD. It is defined as
N
1 2
S.D.( )
N
(x X)
i 1
i
Where, X is the mean of the data set. It is defined as the positive square root
of the mean of the squared deviations taken from the respective mean of the
data. You can observe that S.D. is same as RMSD when A = mean of the data
(a constant).
The question is that why the constant A has been taken as mean of the
distribution and not any other measure of central tendency. The reason is that
RMSD is the least when A = X . Let us show this fact mathematically as
follows:
We have
N N
1 1 2
RMSD =
N
i 1
(xi A)2
N
(x X ) X A
i 1
i
N N N
1 1 1
N
(xi X)2
N
(X A)2 2
N
(x i
X) X A
i1 i1 i 1
N
1 2
Now, if A = X, it reduces to
N
(x X)
i 1
i , which is defined as S.D. So, we
see that RMSD attains its least value when constant A is taken as the mean of
the data, and then RMSD is called Standard Deviation (σ).
The concept of standard deviation was first introduced by Karl Pearson. It
satisfies most of the properties of a good measure of dispersion. However, it
can be seen that S.D. is not unit free. Its unit is same as that of x-values.
Alternative formula for computation of Standard Deviation:
In order to simplify the calculation process and also to reduce the error of
rounding to the minimum extent, we can use an alternative formula for the
calculation purpose of Standard Deviation, which is obtained below:
We have
N N N N
1 1 2 2
Standard Deviation (σ) (xi X)2 x X i x .X
2 i
N i1 N i 1 i 1 i 1
If the data are given in the grouped data form, either in the discrete or
continuous frequency distribution, the formula for S. D. becomes:
178
1 k Descriptive Statistics-I
S tan dard Deviation( ) fi (x i X)2 i 1,2,...,k
N i 1
k
where, there are k class intervals; N fi and fi is the frequency of the ith
i1
class interval.
Proceeding in the same manner as we did above to find the alternative
formula of standard deviation; we can derive alternative formula for standard
deviation when data are available in the grouped form, which is given as
k k
1 2
S tan dard Deviation ( ) fx i i X2 where, N fi
N i 1 i1
7.10.3 Variance
Variance is another frequently used measure of dispersion. Actually, variance
is nothing but the square of the standard deviation and hence, is denoted by
σ2 .
Variance is the average of the square of deviations of the values taken from
the respective mean. Thus, variance is defined as
N
1
Variance (σ2 ) (x X) i
2
N i1
where, all symbols have their usual meanings as given for the formula of
standard deviation above. The unit of the variance will be the square of the
variable X.
Alternative formula for computation of Variance:
Since, variance is the square of standard deviation; the same problem arises
for variance also. Therefore, we have to use an alternative formula for
variance also which must be used for computation purpose.
Squaring the alternative formula for standard deviation, we have alternative
formula for calculation of the variance, which is given by
N
1
2 x 2
i X2
N i 1
Similarly, the alternative formula for computing σ when the data are given
in the grouped form, will be
k
2 1 k fx i i k
fi x i2 X 2
where, X i 1
k
and N fi
N i1 i1
f
i 1
i
x
i 1
i
140
Therefore, X 14 ;
N 10
2234 2
and hence, S.D.( ) 14 223.4 196 27.4 5.2345
10
Solution: Let us use the short-cut method for the calculation. We choose A =
11 and H = 2. Then, the calculations are shown in the table given below:
Frequency Mid – X 11
Age d fd fd2
f Value 2
16–18 20 17 3 60 180
2
N = 550 fd 130 fd 1250
2
2
X h
f d
f d
N N
2
1250 130
2 2 2.272 0.055 2.977
550 550
Note: i) Check your answers with those given at the end of the unit.
Worker A Worker B
Mean time of completing the job
25 20
(minutes)
36 16
Variance
2 4
For worker B, Coefficient of Variation 100 100 20%
X2 20
Note: i) Check your answers with those given at the end of the unit.
16) The following table shows the results of an analysis relating to the
prices of two shares X and Y, that were recorded during the year
2000:
2
17) If n=10, x 24, x 200 , find the coefficient of variation.
183
Statistical Analysis
7.15 ANSWERS TO CHECK YOUR PROGRESS
1) For calculating the arithmetic mean, we add all the observations and
divide by 6 as follows:
X
x i
5 8 12 15 20 30
15
n 6
2) For calculation of mean, the following table is made:
X 25
Marks f X d fd
10
0-10 6 5 2 12
10-20 9 15 1 9
25=
20-30 17 0 0
A
30-40 10 35 +1 +10
40-50 8 45 +2 +16
f 50 i
fd 5
i i
Now X A
fidi 5
h X 25 10 = 26
N 50
4) (1) Here, N = 7, an odd number. After arranging the values in
ascending order, we get 2, 3, 6, 8, 10, 12, 15 and, therefore, the
Median value will be
Median = value of [(N+1)/2]th item= value of 4th item = 8.
(2) Here N= 8, an even number, so arranging values in ascending
order we get the values as 2, 3, 5, 6, 8, 10, 12, 15 and, therefore,
Median = Mean of [N/2]th and [(N/2) +1]th values = (6+8)/2 = 7.
184
5) For the given data, we compute less than cumulative frequencies as Descriptive Statistics-I
follows:
Marks 5 10 15 20 25 30 35 40 45 50
No. of Students 2 5 10 14 16 20 13 9 7 4
Cumulative 2 7 17 31 47 67 80 89 96
Frequency 100
Since, here N is even, so there will be two middle-most values, the
first is (N/2)th value and the second is [(N/2) + 1]th value.
Here, N/2 = 50 and [(N/2) + 1] = 51. Thus, median is equal to the
mean of the 50th and 51st values. We see that (N/2)th value satisfies the
condition F5< (N/2)< F6.
So, (N/2)th value will be equal to 30. Also, [(N/2)+1] value is 30.
Therefore, their mean is also 30. So, median is 30.
6) First we shall calculate the cumulative frequency distribution
Cumulative
Marks f
Frequency
0-10 5 5
10-20 10 15
20-30 15 30=C
30-40 20=f 50
40-50 12 62
50-60 10 72
60-70 8 80
N= 80
Here N/2 = 40. Since, 40 is not in the cumulative frequency so, the
class corresponding to the next cumulative frequency 50 is median
class. Thus 30-40 is median class.
N
2 C
Median L h = 30 40 30 10 = 35
f 20
7) For calculation of mode, the following table is made:
Size 1 2 3 4 5 6 7 8 9 10
Frequency 3 3 3 4 2 3 7 2 2 1
185
Statistical Analysis 8) First we prepare frequency table as
X 2 3 4 7 9 10 12
f 3 2 2 7 9 4 6
This table shows that 9 have the maximum frequency. Thus, mode is 9.
9) From the given frequency distribution corresponding to highest
frequency 9 modal class is 30-40 and, therefore, we have L=30, f1 =9,
fo 7 , f2 4 , h 10 Applying the formula,
97
Mode 30 10 30 2.86 32.86
29 7 4
11) First convert the given inclusive class limits continuous classes
(Exclusive classes) as shown in the table:
Inclusive Continuous
Frequency
Classes Classes
6–10 5.5–10.5 7
11–15 10.5–15.5 8
16–20 15.5–20.5 15
21–25 20.5–25.5 35
26–30 25.5–30.5 18
31–35 30.5–35.5 7
36–40 35.5–40.5 5
x (x-17) xX
16 -1 1
24 +7 7
13 -4 4
18 +1 1
15 -2 2
10 -7 7
23 +6 6
x 119 x i X 28
186
N N Descriptive Statistics-I
x
i1
i
119
x X
i1
i
28
X 17 And M.D 4
N 7 N 7
13) For calculation of mean deviation, the following table is made:
50 1350 f x X 472
k k
i1
f i xi
1350
f
i1
i xi X
472
X k
27 and M.D k
9.44
50 50
f
i 1
i f
i1
i
X
x 210 30
N 7
N
1 2 8118
N
x X
i 1
2
7
900 1159.7 34.61
2
k
1 k fi di
1 100
2
fidi
2
i 1 = 6800
N i1 N 50 50
136 4 132 11.49
17) We have
N
1 1 2
2 x 2
i X2 200 4 20 16 4
N i 1 10
2 4 2
188
Descriptive Statistics-II
UNIT 8 DESCRIPTIVE STATISTICS-II
Structure
8.1 Introduction
8.2 Objectives
8.3 Correlation Analysis
8.3.1 Types of Correlation
8.3.2 Measures of Correlation
8.1 INTRODUCTION
In the previous unit, you have studied various measures, such as, measures of
central tendency and measures of dispersion which are virtually used for
analyzing the nature of a single variable, say X, and its distribution in many
aspects. These measures, in fact, disclosemany such aspects of the variable,
which are considered to be basic properties of the variable for its further
study. However, when we come across with simultaneous study of two or
more variables, besides the separate study of these variables, one thing is
more important and necessary to note. This is, whether the variables exhibit
some kind of relationship between them or whether they do not exert any
kind of impact on others, that is, variables are supposed to be independent to
each other. For example, take the case of price, demand and supply of 50
commodities in a market. We know that, according to laws of economics,
variables ‘price’ and ‘demand’ exhibit one kind of direct relationship, since, 189
Statistical Analysis increasing demand results to increase in price also. On the other hand,
‘supply’ and ‘price’ have an indirect relationship, since, more the supply, less
would the price of commodities. In statisticalterminology, if there exist any
type of relationship between two variables, we say that they are mutually
‘correlated’ to each other. Thus, the word ‘correlation’, in fact, describes a
specific property of two variables.
If we state that the relationship between the two variables under consideration
is linear, it means that the this relation may be represented in the
mathematical form Y = b0 + b1X; which is the equation of a straight line,
where Y and X are respectively the dependent and independent variables and
b0 and b1 are some suitable constants (we shall use the word ‘parameter’ for
such constants in further text) so chosen that the straight line passes through
maximum number of points (xi, yi) for i = 1, 2, …, N; on the scatter diagram.
This is same as to say that if our purpose is to estimate or to find the value of
the dependent variable Y for a given value of the independent variable X
through the straight line obtained, then the actual value Y and the
corresponding estimated value of it must be as close to each other as possible,
that is, the difference of actual and estimated values must be minimum for
each Y-values. Such a line used for this purpose is generally called a
“Regression Line” and the theory behind regression lines and their
applications to practical problems is termed “Linear Regression Analysis”.
Section 8.3 of this unit discusses the conceptsand different type
ofcorrelationssupported with a number of practical examples that describe the
situationsunder which they might be observed.Section 8.4 describes the
‘scatter diagram’ which helps us diagrammatically to explore the nature of
the relationship between the variables. Section 8.5 describes a measure of
correlation, popularly known as “correlation Coefficient”, which provides the
type of correlation and the magnitude of the correlation between two
variables. Some of the salient properties of it are mentioned and theoretically
proved. Section 8.6 discusses rank correlation coefficient, which provides the
magnitude of the correlation between two attributes. Section 8.7 here
explains the basic meaning of the word ‘regression’. Section 8.8 describes the
mathematical procedure of obtaining all the linear regression lines in case of
two variables. Some of the properties of linear regression lines with one
independent variable are also discussed. Section 8.9 provides the meaning of
regression coefficients in regression analysis and their significance. Some
important properties of regression coefficients are also discussed.
8.2 OBJECTIVES
After studying this unit, you should be able to:
explain the concept of correlation between two variables and identify
different types of correlations;
describe the scatter diagram and its use in correlation analysis;
190
explain the computation method and describe the important properties Descriptive Statistics-II
of correlation coefficient;
explain regression, regression analysis and of linear regression;
derive expressions of lines of regression of Y on X and X on Y
mathematically on the basis of least squares principle;
explain how to apply both the regression lines in practical problems
for predicting the value of one variable given the value of the other;
and
explain the salient properties of regression coefficients.
191
Statistical Analysis In all these situations and some other situations, we try to find out a type of
relation, if any, between two variables and its magnitude. Then, correlation
analysis comes up in the way for answering the question, if there is any
relationship between one variable and the other.
For instance, generally, the weight of a person is seen to increase with height
up to a certain age; prices of commodities seem to vary with supply; pressure
exerted on a flexible body generally reduces its volume and, thus, are seen to
have been inversely related; cost of industrial production varies with the cost
of raw material and so on. In all these cases, the change in the value of one
variable appears to be accompanied by a change in the values of other
variable. If it happens, variables are said to be correlated, and then this
relationship is called correlation or covariation. The study and measurement
of the extent or degree of relationship between the two or more than two
variables along with the nature of relationshipbetween them is called
correlation analysis.
where, Cov X,Y stands for the covariance between X and Y which is
definedas:
1 N
Cov X, Y x i X yi Y
N i 1
Similarly, V Y is defined as
1 N
V Y i 1
(yi Y)2
N
where, N is number of paired observations.
Using the definitions of covariance, and variance as given above,the
correlation coefficient “r” may be defined as:
1 N
x i X yi Y
N i 1
r Corr X,Y . … (2)
1 N 2 1 N 2
i 1( x i X) i 1(yi Y)
N N
If V(X) and V(Y) are denoted by the notations σ2X and σ2Y respectively, then
the formula of coefficient of correlation is reduces to
1 N
i1 xi X yi Y
r Corr X,Y N . … (3)
σx σy
It can be seen that given the N values of the variable Y, that is, y1, y2, …, yN
and corresponding values of the variable X; x1, x2, …, xN; it is easy to
N N 2 N 2
compute the quantities x
i 1 i X yi Y , y
i 1 i Y and x
i 1 i X
and hence to compute the value of r.
The Karl Pearson’s correlation coefficient r is seen to be a unit free quantity,
the reason being that the numerator and denominator have same units of
measurement. If X is measures in kilograms and Y in meters, then the unit of 197
Statistical Analysis numerator will be “kilogram meter” whereas in the denominator, the unit of
standard deviation, σ x , would be kilogram and that of σ y would be meter.
Hence, unit of numerator is same as that of denominator and hence cancels
each other. Because r does not depend on any unit, it is called a coefficient.
which is always true for all values of xi and yi, since, the left-hand side
(L.H.S.) expression is a squared quantity. Now, considering first the plus sign
between the two terms and expanding the L.H.S., we get
2 2
1 N xi X 1 N yi Y 2 N x i X yi Y
N i 1 σ x N i 1 σ y N i 1 σ x σ y
0
N 2 N 2
1 i 1 x i X 1 y
i 1 i Y 2 N x i X yi Y
or,
N σx2
N σ 2
N
i 1
0. … (4)
y σx σy
N
i 1
x i X V X σ 2x ; and
N
i 1
yi Y V Y σ 2y
1 N
i1 x i X yi Y
andalso N r, from (3).
σxσy
2
1 N x i X yi Y
N i 1 σ x σ y
0 ,
we have
2 2
1 N xi X 1 N yi Y 2 N x X yi Y
i 1 i 1 i 1 i
0
N σx N σy N σx σy
or, 1 + 1 – 2r 0 that is 2 - 2r 0 that is, r 1. … (6)
Therefore, combining (5) and (6), we get the result
1 r 1. … (7)
which gives the range of the coefficient of correlation.This completes the
proof of property 1.
Some specific values of r:
We observe thatr 1, 1 , this means that depending upon the values of the
two variables, it may assume any value in between –1 to +1, both values
inclusive. Three important values of r are important to consider in this range.
These are –1, 0 and +1.
(i) It assumes value –1, when there is a perfect negative linear relationship
between the variables, that is, asone variable increases in its values, the
other variable decreases in its values through an exact linear
relationship rule.
(ii) Similarly, it assumes value +1 when there is a perfect positive linear
relationship between the variables, that is, as one variable increases in
its values, the other variable also increases in its values through an
exact linear rule.
(iii) The r assumes value zero, when there is no linear relationship between
the variables. However, if there exists a linear relationship, then a zero
value of r indicates that the variables are not related at all.
(iv) However, other values of r also indicate the extent of the linear
relationship.
Being maximum values of r; –1 and +1 on extreme left side and extreme right
side of the range respectively and zero at the middle of the range; it seems
that r becomes weaker and weaker as it approaches to middle most point of
the range from both the sides. Depending upon the values of r in its range,
we, therefore, define weak, moderate and strong relationships as follows:
(a) The linear relationship is said to be a “weak positive (negative)
relationship” if the value of r lies between 0.0 and 0.3 (–0.0 and - 0.3).
This happens when the linear rule, showing the relationship is
represented by a shaky linear rule.
(b) Values between 0.3 and 0.7 (–0.3 and –0.7) indicate a “moderate positive
(negative) linear relationship” through a fuzzy-firm linear rule. 199
Statistical Analysis (c) Lastly, values between 0.7 and 1.0 (–0.7 and –1.0) is an indication of a
“strong positive (negative) linear relationship” through a perfect linear
rule.
Property 2: The coefficient of correlation, r, is independent of change of
origin and scale of the variables. This indicates that if rXYbe the coefficient of
correlation between two variables X and Y and rUV be the same between the
variables U and V which are respectively obtained by changing the origin and
scale of the variables X and Y, then we haverUV= rXY.
1 N
Cov. X, Y i1 x i X yi Y
Proof: We know that rXY= N .
σXσY σx σy
Xa
Now, let us define the variable U , which is obtained by changing the
b
origin and scale of the variable X; constant a changes the origin of X and
constant b is used to change the scale of X. Similarly, we define the variable
Yc
V ,which is obtained by changing the origin and scale of the variable
d
Y. From these relations, we get X= a + bU and Y = c + dV.
Hence, we see that X a bU and Y c dV . Therefore, substituting these
values of X, Y, X and Y in the formula of rXY,,we have
1 N
i 1
b u i U d vi V
rXY N … (8)
bσ U .dσ V
Solution: In order to find the value of r from this formula, it is clear that we
N N 2
need the values of the quantities xi 1 i X yi Y , y
i 1 i Y and
N 2
xi 1 i X . For this, we first need the values of X and Y. The following
table provides rest of the necessary calculations:
Taking Advertisement expenditure as variable X and Profit as variable Y, we
1 6 240 1 6 360
see that X
6
x
i 1 i
6
40 and Y
6
y
i 1 i
6
60.
X Y X X (X X) 2 Y Y (Y Y) 2 X X Y Y
30 56 −10 100 −4 16 40
44 55 4 16 −5 25 −20
45 60 5 25 0 0 0
43 64 3 9 4 16 12
34 62 −6 36 2 4 −12
44 63 4 16 3 9 12
6 6 6 6 7 6
xi (x i x) 2 (y i y) 2
y
i 1 i x
i 1 i x i 1 y i 1 i y i 1 x i 1 i x yi y
240 360 0 202 0 70 32
1 N 1 N
σ X i 1x i2 X 2 and σ Y i 1yi2 Y 2
N N
1 N
x i yi NXY NXY NXY ;
N i 1
N N N 1 N
since, x NX,
i 1 i y NY and
i 1 i i 1
XY NXY xi yi XY.
N i 1
Thus, the extended formula for r is given by
1 N
x i yi XY
N i 1
r . … (9)
1 N 2 2 1 N 2 2
i 1x i X i 1yi Y
N N
It can be seen that in this formula of r, all the deviation-type terms have been
202 removed and all the terms can be calculated using the exact values of xi and
yi. Thus, this form of the formula is suitable for calculation purpose, since, it Descriptive Statistics-II
minimizes the rounding error in each the terms.
Let us now illustrate the steps which should be followed for computing the
value of r using the extended formula as obtained in (10). We use the
following example for this purpose:
Example3: Calculate Karl Pearson’s coefficient of correlation between price
and demand for the following data:
Price 17 18 19 20 22 24 26 28 30
Demand 40 38 35 30 28 25 22 21 20
Therefore,
r
9 5605 204 259
9 4794 204 204 9 7903 259 259
50445 52836
r
43146 41616 71127 67081
2391 2391
r 0.96 .
1530 4046 2488.0474
203
Statistical Analysis
6 di2
r 1
N(N 2 1)
204
where di is the difference between the two ranks given to each individual and Descriptive Statistics-II
N is the number of observations.
With the help of rank correlation, we find the association between two quality
characteristics. As we know that the Karl Pearson’s correlation coefficient
gives the intensity of linear relationship between two variables, Spearman’s
rank correlation coefficient gives the concentration of association between
two quality characteristics. In fact, Spearman’s rank correlation coefficient
measures the strength of association between two ranked variables.
8.6.1 Method of Calculation of Rank Correlation
Let us discuss some problems on rank correlation coefficient. We shall
consider three cases to compute the rank correlation coefficient as follows:
Case I: When Actual Ranks are given
Case II: When Ranks are not given
Case III: When Ranks are repeated.
Case I: When Actual Ranks are given
In this case the following steps are involved:
i) Compute d i R X R Y , i.e., the difference between the two ranks given
to each individual
ii) Compute d i2 , i.e., the squares of differences for each individual.
iii) Apply the formula of the rank correlation coefficient by putting the value
of sum of the squares of the difference of the each individuals and N:
Example 4: Suppose we have ranks of 8 students of B.Sc. in Statistics and
Mathematics. On the basis of rank we would like to know that to what extent
the knowledge of the student in statistics and mathematics is related.
Rank in Statistics 1 2 3 4 5 6 7 8
Rank in 2 4 1 5 3 8 7 6
Mathematics
Rank Rank
Difference of ranks
inStatistics inMathematics d i2
di R x R y
Rx Ry
1 2 −1 1
2 4 −2 4
3 1 2 4
205
Statistical Analysis 4 5 −1 1
5 3 2 4
6 8 −2 4
7 7 0 0
8 6 2 4
2
d i 22
Solution:
Rank of X Rank of Y
X Y d = Rx-Ry d2
(Rx) (Ry)
78 125 4 4 0 0
89 137 2 2 0 0
97 156 1 1 0 0
69 112 5 6 -1 1
59 107 7 7 0 0
79 136 3 3 0 0
68 124 6 5 1 1
N
d =2
i 1
2
i
206
Spearman’s Rank correlation formula is Descriptive Statistics-II
N
6 d i2
i 1 62 12
rs 1 2
1 1 0.96
N(N 1) 7(49 1) 7 48
Series: 50 70 80 80 85 90
Ranks: 6 5 3.5 3.5 2 1
In the above example 80 was repeated twice. It may also happen that two or
more values are repeated twice or more than that.
For example, in the following series there is a repetition of 80 and 110. You
observe the values, assign ranks and check with following:
m (m 2 1)
When there is a repetition of ranks, a correction factor is added to
12
d 2 in the Spearman’s rank correlation coefficient formula, where m is the
number of times a rank is repeated. It is very important to know that this
correction factor is added for every repetition of rank in both characters.
In the first example correction factor is added once which is 2(4-1)/12=0.5,
while in the second example correction factors are 2(4-1)/12=0.5 and 3(9-
2
1)/12=2 which are aided to d .
207
Statistical Analysis m(m 2 1)
6 d 2 ...
12
rs 1 2
N(N 1)
d 83.50
2
m(m 2 1)
6 d 2 ...
12
rs 1 2
N(N 1)
Here rank 6 is repeated three times in rank of X and rank 2.5 is repeated
twice in rank of Y so the correction factor is
3(32 1) 2(2 2 1)
12 12
Hence, rank correlation coefficient is
3(32 1) 2(2 2 1) 3 8 2 3
683.50 6 83.50
12 12 12 12
rs 1 1
8(64 1) 8X 63
6(83.50 2.50)
rs 1 1 1.024 0.024
504
There is a negative association between expenditure on advertisement and
profit.
208
CHECK YOUR PROGRESS 4 Descriptive Statistics-II
Note: i) Check your answers with those given at the end of the unit.
The term “regression” was first coined in the nineteenth century by Sir
Francis Galton (1822-1911), a British mathematician, statistician and
biometrician. He used this word while he was working on a biological
phenomenon regarding the height of parents and their offsprings. In fact,
during experimentation he observed the particular biological phenomenon
that if parents in families were very tall, their children tended to be tall but
shorter than their parents. On the other hand, if parents were very short, their
children tended to be short but taller than their parents were. This
phenomenon he named as "regression to the mean," with the word
"regression" meaning to come back to. Although, Galton used the word
regression only for this biological phenomenon, later on, his work was
extended by some other statisticians, like, Udny Yule, Karl Pearson and R. A.
Fisher to a more general statistical context.
209
Statistical Analysis CHECK YOUR PROGRESS 5
where yi and ŷi respectively denote the actual and predicted values of the
variable Y.
The two normal equations, obtained by minimizing U simultaneously with
respect to parameters b0 and b1 are
N N
y Nb0 b1 i 1x1i ;
i 1 i
… (13)
N N N
i 1 i
y x1i b 0 i 1x1i b1 i 1x1i2 … (14)
N
Multiplying the equation (13) by x and the equation (14) by n and then
i 1 1i
subtracting the resultant second equation from the resultant first equation, we
have
2
y x N y x b x
N
i 1 i
N
i 1 1i
N
i 1 i 1i 1
N
i 1 1i
N
N i 1x1i2
N y x y x
N N N
i 1 i 1i i 1 i i 1 1i
or, b 1 2
… (15)
N x x
N 2 N
i 1 1i i 1 1i
This is the estimated value of the parameter b1, denoted by b̂1 , obtained on the
basis of given (yi, x1i ) values.
N N
Now, since we know that y NY; i 1x1i NX1 , we have
i 1 i
1 N
y x X1Y Cov X , Y
i 1 i 1i 1 rσ Y σ X1
bˆ 1 N
1 N 2 2 V X σ X2 1
N i 1
x 1i X1
1
σY
or, b̂1 r ; r being the product moment correlation between X1 and Y and
σ X1
σ Y and σ X1 ,respectively, the standard deviation of Y and X1 variables.
Substituting the value of b̂1 in equation (13) and solving for b0, we get
σY
b̂0 = Y r X1.
σ X1 211
Statistical Analysis Finally, substituting these estimated values of constants b0 and b1 in the
regression model (11), we get
σY
Y Y r σ X 1 X1 . … (16)
X1
This is the single variable linear regression model, including the single
independent variable X1 and the dependent variable Y. It represents the linear
relationship between the variables Y and X1 where the degree of correlation
is given by r. The equation can be converted into a linear equation as
σ σ
Y Y r Y X1 r Y X1 bˆ 0 bˆ 1X1 ; … (17)
σ X1 σX1
where b̂0 and b̂1 are the estimated values of the parameters b0 and b1
respectively, obtained through the given data set (xi, yi), such that the
regression line (17) provides the best fit to the data, in the sense that the
predicted values of Y would be closest to its corresponding actual values.
Since, through this regression, we predict the values of the dependent
variable Y for given values of the variable X1; it is popularly termed as the
“Regression Line of Y on X1”.
ii) Regression Line of X1 on Y:
As described above, regression line of Y on X1 is used to predict the values
of the dependent variable Y when values of the independent variable X1 are
given. You may argue that for predicting the values of the X1 variable, given
Y values, the same regression line, (12) can be used if we convert the line in
the form
1 bˆ
X1 Y 0 or X1 b*0 b1*Y ;
bˆ
1 bˆ 1
bˆ 1
where b*0 0 and b1* .
ˆ b̂1
b1
If you observe, in fact, yi ŷi s are the vertical distances of points for
given X-values. Thus, U is the sum of squares of all such vertical distances of
points between the actual and predicted Y-values, which has been minimized
with respect to all the parameters.
Now, if instead of Y values, X1-values are to be predicted for known Y-
values, then, according to least square principle, we should minimize the sum
212
of squares of residuals given by x1i x̂1i s which would be horizontal Descriptive Statistics-II
distances between the actual and predicted X1-values for all the points.
Certainly then, we will get a new regression line which will be different than
X1 b*0 b1*Y, because the method of estimation of parameters b0 and b1 in the
two methods are quite different.
Now, since we wish to treat X1-variable as dependent and Y-variable as
independent in order to obtain regression line of X1 on Y using least square
principle, we write the single variable linear regression line of X1 on Y as
X1 c 0 c1Y … (18)
Using the principle of least squares, the sum of squares of residuals
N 2 N 2
V i 1 x1i c 0 c1 y i i 1 x1i x̂1i
c1
N
N i 1yi x1i y x
N
i 1 i
N
i 1 1i
… (21)
2
y y
N 2 N
N i 1 i i 1 i
This is the estimate of the parameter c1 , say ĉ1 , as obtained on the basis of
the given set of data.
The expression (16) can further be written as
Cov X1 , Y rσ X1 σ Y σ X1
cˆ 1 2
2
r .
σ Y σ Y σY
Further, substituting the value of ĉ1 in the equation (19) and solving for c0 ,
we get
σ X1
cˆ 0 X1 r Y
σY
formulae, but we avoid to use them for computation purpose since they are
not free from error of approximations. We have mentioned there that
alternative formulae for them should be used which are derived from their
direct formulae. For readiness of the text in this unit, we present below these
formulae which should be used for calculation
1 N
1 N 1 N x1i yi X1 Y
N i 1
Y i 1yi ; X1 i 1x1i ; r ;
N N 1 N 2 2 1 N 2 2
i 1x1i X1 i 1yi Y
N N
1 N 2 1 N 2
σY y Y 2 ; σ X1
i 1 i x1i X12 .
N N i 1
10. Explain why there are two lines of regression. Obtain the expression of
regression line of Y on X1 where Y and X1 are respectively the dependent
and independent variables.
……………………………………………………………………………….
…………………………………………………………………………………
……….………………………………………………………………………
………………….……………………………………………………………
214
Descriptive Statistics-II
8.9 REGRESSION COEFFICIENTS
We have seen that the regression line of Y on X1 is given by
σ σ
Y Y r Y X1 r Y X1
σ X1 σ X1
Where as that of X1 on Y is
σX σX
X1 X1 r 1 Y r 1 Y .
σY σY
σY σX
The multipliers r of X1 in the line of Y on X1 and r 1 of Y in the
σX σY
1
that is, the coefficient of correlation, r is the geometric mean of the two
regression coefficients. Thus, given the magnitude of regression coefficients,
the value of the correlation coefficient can be obtained. However, as far as
215
Statistical Analysis the sign of the correlation coefficient is concerned, we can observe that
whenever r is negative, both the coefficients b YX and b X Y would be 1 1
is,
1
b YX1 1 or 1.
b YX1
We know that
b YX1 . b X1Y r 2 1 (From Property 1)
1
b X1Y 1.
b YX1
Therefore, we have
b YX1
b X1Y 2
b YX1 . b X1Y 0
2
b YX1 b X1Y 0,
Height of Father 65 66 67 67 68 69 70 71
Height of Son 66 68 65 69 74 73 72 70
Find two lines of regression and calculate the estimated average height of son
when the height of father is 68.5 inches.
Solution: Let us denote the father’s height by X1 and son’s height by Y, then
for finding the two lines of regression, we do necessary calculations in the
following table:
X1 Y X12 Y2 X1Y
65 66 4225 4356 4290
66 68 4356 4624 4488
67 65 4489 4225 4355
67 69 4489 4761 4623
68 74 4624 5476 5032
69 73 4761 5329 5037
70 72 4900 5184 5040
71 70 5041 4900 4970
2 2
x 1i 543 y 557 x 1i 36885 y i 38855 x 1i yi 37835
1 N 1 N 2
Standard Deviation of X X (x1i X1 )2 x1i X12
1
N i 1 N i 1
1
36885 (67.88) 2
8
1 N 1 N 2
Similarly, Standard deviation of Y Y (yi Y) 2 yi Y 2
N i 1 N i 1
1
38855 (69.62)2
8
217
Statistical Analysis Now correlation coefficient
N N N
N x1i yi ( x1i )( y i )
i 1 i 1 i 1
r Corr(X1 , Y)
N N N N
2 2 2
1i 1i i
N x ( x ) N y ( yi )2
i 1 i 1 i 1 i 1
229 229
0.62
136521 369.49
Y 1.14X1 7.76
X1 0.34 Y 44.21
Obviously, estimate of height of son for the height of father = 68.5 inch is
obtained by the regression line of Y on X1
Y 1.14X1 7.76 .
Thus, the estimate of son’s height for the father’s height 68.5 inch is 70.33
inch.
CHECK YOUR PROGRESS 7
Note: i) Check your answers with those given at the end of the unit.
X Y X X (X X) 2 Y Y (Y Y) 2 X X Y Y
1 2 –2 4 –4 16 8
2 4 –1 1 –2 4 2
3 6 0 0 0 0 0
4 8 1 1 2 4 2
5 10 2 4 4 16 8
15 30 0 10 0 40 20
Here x i 15 X 3 and yi 30 Y 6
x
i 1
i X yi Y
20 20
r 1
N
2
N
2 10 40 20
x
i 1
i X y
i 1
i Y
r
6 3904 168 138 23424-23184
=1
6 4744 168 168 6 3214 138 138 240 240
6 X 26 26 9
rs 1 1 0.26
6(36 1) 35 35
7) First, we form the table showing needed calculations.
Rank of x Rank of y
x y d = Rx-Ry d2
(Rx) (Ry)
70 6.5 90 2 4.5 20.25
70 6.5 90 2 4.5 20.25
80 4 90 2 2 4
80 4 80 4 0 0
80 4 70 5 −1 1
90 2 60 6 −4 16
100 1 50 7 −6 36
2
d
97.5
Here, rank 4 and 6.5 is repeated thrice and twice respectively in rank
of X and rank 2 is repeated thrice in rank of Y, so the correction
factor is
221
Statistical Analysis 3(32 1) 2(2 2 1) 3(32 1)
12 12 12
11) For getting necessary values for computation purpose, we present the
calculations in the following table:
X1
x1i 15
3 and Y
y i
50
10
n 5 n 5
1 5 9 2 4 –1 1 –2
2 4 8 1 1 –2 4 –2
3 3 10 0 0 0 0 0
4 2 11 –1 1 1 1 –1
5 1 12 –2 4 2 4 –4
Total 15 50 0 10 0 10 –9
b X1Y
(x X )(y Y)
1i 1 i
9
0.9
2
(y Y)
i 10
b YX1
(x X )(y Y) 9 0.9 .
1i 1 i
2
(x X )
1i 10 1
223
Statistical Analysis
UNIT 9 SAMPLING DISTRIBUTIONS
Structure
9.1 Introduction
9.2 Objectives
9.3 Basics of Sampling
9.4 Sampling Distribution
9.4.1 Standard Error
9.4.2 Central Limit Theorem
9.1 INTRODUCTION
In general, extracting information from all the elements or items of a large
population may be time consuming and expensive if the population size is
infinitely large. But even then sometimes there are so many problems
attached to large size populations where it becomes necessary to draw
inference about population parameters.For example, one may wish to
estimate the average height of all the two thousand students in a college, a
businessman may be interested to estimate the proportion of defective items
in a production line, a manufacturer of car tyres may want to estimate the
variations took place in the diameter of produced tyres, a pharmacist may
want to estimate the difference of effect of two types of drugs, etc.In all such
cases, there is always an unknown population involved whose characteristics
are described through some parameters.
Due to large sizes of populations, in which we may be interested, for
drawing inferences about the population parameters, generally we draw a
224 sample and determine a function of sample values, which is called statistic.
Selection of a sample, that is, only a part of the population saves a lot of time, Sampling Distributions
money and labour and results drawn on sample values are quicklyavailable
forinterpretation and as good sometimes as obtained on the basis of the entire
population. The process of generalising sample results to the population is
called Statistical Inference. Sincethere might be a large number of samples of
same size drawn from the population, the values of statistic generally vary
from sample to sample and is associated with the probability of selection of
the particular sample. Therefore, the sample statistic is a random variable
following any probability distribution. Therefore, it may be a matter of
interest for a statistician to know what distribution the statistic follows if the
samples are assumed to be selected from a theoretical distribution like, a
normal distribution with given mean and variance; a binomial distribution; a
gamma distribution; a Poisson distribution and so on. In contrast to
theoretical distributions, probability distribution of a statistic in popularly
called a sampling distribution. In this unit we shall discuss the sampling
distribution of sample mean; of sample median; of sample proportion; of
difference between two sample means and sample proportions.
Due to this curiosity, Prof. R.A. Fisher, Prof. G. Snedecor and some other
statisticians worked in this area and obtained exact sampling distributions
which are followed by some of the important statistics. In present unit of this
block, we shall also discuss some important sampling distributions such as 2
(read as chi-square), t and F. Generally, these sampling distributions are
named on the name of the originator, for instance, Fisher’s F-distribution is
named like this after the name of its inventor Prof. R.A. Fisher.
9.2 OBJECTIVES
After studying this unit, you should be able to:
explain the concept of sampling and sampling distribution;
explain the concept of standard error and Central Limit Theorem;
define the sampling distribution; and
describe the sampling distribution of sample mean and difference of
two sample means.
As we have discussed in Section 9.1, in many practical and real situations the
population under consideration is either infinitely large in size or it is
unbounded in the sense that its boundaries are not well-defined or it is of
destructive nature. As for example, in ordert to determine average life of two
hundred produced electric bulbs, it would be necessary to light these bulbs
unless they all get fused, resulting destruction of the entire lot of production.
Similarly, in order to estimate the proportion of smokers in a large city,
information have to be gathered from each and every person dwelling in the
city which would be a very difficult task in terms of manpower, money and
time to be required. Sometimes, the population size is not known if its
boundaries are not well-defined, like number of fish in a pond or lake. In all
such cases, it is not feasible to gather information on all the units, rather it
becomes almost impossible. Keeping in view the difficulty in contacting each
and every unit of the population due to these reasons and also to save the
time, money and manpower to be required for this, generally a part of the
population, popularly known as a “sample” is selected in some pre-asigned
manner which in turn is used for drawing the inferences on the population
itself or on the parameters of the population. The results obtained from the
sample are projected in such a way that they are valid for concluding about
the entire population. Therefore, the sample works like a “Vehicle” to reach
(drawing) at valid conclusions about the population. In fact, a sample helps
us to reach to the “whole”(population) from a “part” (sample) in all types of
statistical studies. Thus, statistical inference can be defined as:
“It is a process of concluding (projecting or infering) something desired about
a given population on the basis of sample results”.
Parameter and Statistic
A parameter is a function of population values which is used to represent
certain characteristics of the population.For example, population total,
population mean, population variance, population coefficient of variation,
population proportion, population correlation coefficient, etc., are all
parameters since their calculation involve all the population values.
A statistic is a function of sample values only and does not contain any
unknown population parameter in it. For example, if X1 ,X2 , ..., Xn represent
values of the variable X in a random sample of size n taken from a population
1 n
then sample mean X X i is a statistic.
n i 1
of the statistic ̂ k and the probability of its appearance due to the kth
sample; k = 1, 2,…, nCN.
Generally, in practice onlya single random sample is taken from a given
population and its mean X is considered to be representative of the
population mean µ. This sample mean may or may not represent the
population mean. Since we cannot determine the proximity of sample mean
and population mean on the basis of single random sample, we can use the
concept of sampling distribution to bring the value of sample mean close to
that of Population mean. Being a random variable, the statistic must possess a
probability distribution which may be used to answer some questions
regarding the nature of the statistic, for example, what is the probability that
its value as obtained from the given sample is differing from the parameter
value by amargin of 10? The probability distribution of a statistic is called
sampling distribution of the statistic. Let us illustrate how the sampling
distribution of a statistic can be obtained given a population. For instance, let
we wish to estimate the population mean using the sample mean as an
estimator.
Suppose that a baby-sitter has 5 children under her supervision. The ages of
children are 2, 4, 6, 8 and 10 years. Supposing this group of children as a
population of size 5, we get the population mean as
1 N 2 4 6 8 10
X
N i 1
Xi
5
6
228
Therefore, the variance of this population is given by: Sampling Distributions
2 2 2
1 N
X i X
2 2 2 6 4 6 ....... 10 6 8
N i 1 5
Now, let us take all the possible simple random samples of size 2 without
replacement from this population. There are 5C2 = 10 such possible samples
which are listed below along with their respective means:
Sample
Sample Sample Mean
No.
1 2, 4 3
2 2, 6 4
3 2, 8 5
4 2, 10 6
5 4, 6 5
6 4, 8 6
7 4, 10 7
8 6, 8 7
9 6, 10 8
10 8, 10 9
Now, suppose that our selected sample is either (2, 4) with mean 3 or it is the
sample (8, 10) with mean 9. In both cases, the sample is not a good
representative of the population since sample mean is far from the population
mean 6. But if the selected samples are coincidently either (2, 10) or (4, 8),
then these are good representatives of the population in the sense that both
have sample means exactly equal to population mean. Thus, this example
illustrates that a single random sample may or may not be representative for
the decision maker to reach meaningful conclusion. However, the grand
mean of the distribution of these ten sample means, is calculated and
observed to be equal to the population mean as follows:
Hence, the mean of the sample means can be considered to represent the
population mean for analysis and decition making purposes.
Now let us put sample means along with their probabilities of occurrence as
follows:
229
Statistical Analysis Table 9.2: Probability Distribution
Probability
Sample
Frequency of
Mean
Occurrence
3 1 0.1
4 1 0.1
5 2 0.2
6 2 0.2
7 2 0.2
8 1 0.1
9 1 0.1
Total 10 1.0
This distribution which shows the distribution of probabilities over all the
possible values of the sample mean is refered to as sampling distribution of
the sample mean. Symbolically, it can be denoted as {( x , p( x )}. As
explained above, the sampling distribution of a statistic can be defined as:
“The probability distribution of all possible values of a statistic that would be
obtainedby drawing all possible samples of the same size from the population
is called sampling distribution of that statistic.”
The mean, variance and other measures of a sampling distribution can be
obtained in a similar way as computed in a frequency distribution, taking
probabilities as frequencies. Thus, mean of the above sampling distribution
will be
7
x x i p x i 3 x 0 . 1 4 x 0 . 1 5 x 0 . 2 6 x 0 . 2 7 x 0 . 1 8 x 0 . 1 9 x 0 . 1 6
i 1
This value is same as the population meanµ. The variance of the distribution
is obtained as:
2 2 2
1 N
x i x
2 2 3 6 4 6 ....... 9 6 28 4
x
7 i 1 7 7
SE X ,
n
σ12 σ 22
SE X Y =
n1 n 2
where, σ12 and σ 22 are the population variances of two different populations
and n1 and n2 are the sample sizes of two independent samples selected from
the two populations respectively.
P1Q1 PQ
SE( p1 p 2 ) = 2 2
n1 n2
From all the above formulae, we can understand that standard error is
inversely proportional to the sample size. Therefore, as sample size increases
the standard error decreases.
The standard error is used to express the accuracy or precision of the estimate
of population parameter because the reciprocal of the standard error is the
measure of reliability or precision of the statistic.Standard error also
determines the probable limits or confidence limitswithin which the
231
Statistical Analysis population parameter may be expected to lie with certain level of
confidence.Standard error is also applicable in testing of hypothesis.
9.4.2 Central Limit Theorem
The central limit theorem is the most important theorem of Statistics. It was
first introduced by De Movers in the early eighteenth century. The theorem
states that regardless of the nature of the distribution of the population, the
distribution of the sample mean approaches the Normal Probability
distribution as the sample size increases. In general, the larger the sample
size, the closer the proximity of the distribution of sample mean to the
Normal distribution. However, in practice, sample sizes of 30 or larger are
considered adequate for this purpose. It should be noted, however, that the
sampling distribution of sample mean would be always normally distributed
if the original population is normally distributed.
2
X ~ N ,
n
X
Z ~ N 0,1
/ n
follows normal distribution with mean 0 and variance unity, that is, the
variate Z follows standard normal distribution.
X X2 ... X n
Mean of X E X E 1 By defination of X
n
1
[E(X1 ) E(X 2 ) ... E(X n )]
n
Since E(Xi) = mean of Xi for all i=1, 2,…,n; therefore, we have E(Xi) = µ.
Similarly, Var(Xi) = σ2 for all i. We, therefore, have
1
E X ...... (n times) n
n n
and variance
1
Var(X) Var (X1 X 2 ... X n )
n
1
Var(X1 ) Var(X 2 ) ... Var(X n )
n2
1 2 n 2 2
n
2 2 2
2 ...... (n times) 2
n n
We, therefore, conclude that If X i ~ N , 2 then
2
X ~ N ,
n
and
SE X SD X Var X
n
Let us illustrate these results using some examples.
Example 1: Diameter of a steel ball bearing produced on a semi-automatic
machine is known to be distributed normally with mean 12 cm and standard
deviation 0.1 cm. If we take a random sample of 10 ball bearings then find
mean and variance of sampling distribution of mean.
Solution: Here, we are given that
= 12, σ = 0.1, n = 10
Since the sample is taken from the population of ball bearings in which
diameter follows normal distribution N(12, 0.01), we have 233
Statistical Analysis E X 12
2 (0.1) 2
Var X 0.001 .
n 10
2
54 602 56 602 ....... 66 60 2
104
20.08
5 5
Now, all possible samples (without replacement) are 10 which are shown in
the Table given below:
Table 9.3: Possible Samples
1 54, 56 55
2 54, 60 57
3 54, 64 59
4 54, 66 60
5 56, 60 58
6 56, 64 60
7 56, 66 61
8 60, 64 62
9 60, 66 63
10 64, 66 65
234
Table 9.4: Probability Distribution Sampling Distributions
Probability
Sample
Frequency of
Mean
Occurrence
55 1 0.1
57 1 0.1
58 1 0.1
59 1 0.1
60 2 0.2
61 1 0.1
62 1 0.1
63 1 0.1
65 1 0.1
Total 10 1.0
This value is same as the population meanµ. The variance of the distribution
is obtained as:
2 2 2
1 N
x i x
2 2 55 60 57 60 ....... 65 60 78 8.66
x
9 i 1 9 9
Note: i) Check your answers with those given at the end of the unit.
E X Y E X E Y =µ1− µ2
236
and variance Sampling Distributions
12 22
Var X Y Var X Var Y
n1 n 2
12 22
SE X Y Var X Y .
n1 n 2
and variance
2 2
12 22 200 100
Var(X Y) 400
n1 n 2 125 125
SE X Y Var X Y 400 20 .
If population variances 12 and 22 are unknown then we estimate 12 and 22
by the values of the sample variances of the samples taken from the first and
second population respectively. For large sample sizes n1 and n2 30 , the
sampling distribution of (X Y) is very closely normally distributed with
s2 s2
mean (µ1− µ2) and variance 1 2 .
n1 n 2 237
Statistical Analysis If population variances 12 and 22 are unknown and 12 22 2 then σ2 is
estimated by pooled sample variance s 2p where,
1
s 2p
n1 n 2 2
n 1s12 n 2 s 22
and variate
t
X Y 2
1
~ t ( n1 n 2 2 )
1 1
sp
n1 n 2
Note: i) Check your answers with those given at the end of the unit.
Carton Number of
Defectives Bulbs
A 2
B 4
C 1
D 3
From the above table, we can see that value of the sample proportion is
varying from sample to sample. So we consider all possible sample
proportions and calculate their probability of occurrence. Since there are 6
possible samples therefore the probability of selecting a sample is 1/8. Then
we arrange the possible sample proportions with their respective probability
in Table 2.7:
239
Statistical Analysis Table 9.7: Sampling Distribution of Sample Proportion
1 3 4 5 6 7 30 1 1
1 1 2 1 ... 1
6 40 40 40 40 40 40 6 8
Thus, we have seen that mean of sample proportion is equal to the population
proportion.
If a population whose elements are divided into two mutually exclusive
groups− one containing the elements which possess a certain attribute and
other containing elements which do not possess the attribute, then number of
successes (elements possess a certain attribute) follows a binomial
distribution with mean
E(X) nP
and variance
Var(X) nPQ where Q 1 P
where, P is the probability or proportion of success in the population.
Now, we can easily find the mean and variance of the sampling distribution
of sample proportion by using the above expression as
X 1 1
E(p) E E(X) nP P
n n n
and variance
X 1 Var aX a 2Var X
Var(p) Var 2 Var(X)
n n
1 PQ
nPQ . Var X nPQ
n 2
n
240
Also standard error of sample proportion can be obtained as Sampling Distributions
PQ
SEp Var( p)
n
If the sampling is done without replacement from a finite population then the
mean and variance of sample proportion is given by
E p P
and variance
N n PQ
Var p
N 1 n
where, N is the population size and the factor (N-n) / (N-1) is called finite
population correction.
If sample size is sufficiently large, such that np > 5 and nq > 5 then by central
limit theorem, the sampling distribution of sample proportion p is
approximately normally distributed with mean P and variance PQ/n where,
Q= 1 P.
Let us see an application of the sampling distribution proportion with the help
of an example.
Example 4: A machine produces a large number of items of which 15% are
found to be defective. If a random sample of 200 items is taken from the
population and sample proportion is calculated then find mean and standard
error of sampling distribution of proportion.
Solution: Here, we are given that
15
P= = 0.15, n = 200
100
We know that when sample size is sufficiently large, such that np > 5 and nq
> 5 then sample proportion p is approximately normally distributed with
mean P and variance PQ/n where, Q = 1– P. But here the sample proportion
is not given so we assume that the conditions of normality hold, that is, np >
5 and nq > 5. So mean of sampling distribution of sample proportion is given
by
E ( p ) P 0.15
and variance
PQ 0.15 0.85
Var(p) 0.0006
n 200
Therefore, the standard error is given by
241
Statistical Analysis CHECK YOUR PROGRESS 3
Note: i) Check your answers with those given at the end of the unit.
PQ PQ
p1 ~ N P1, 1 1 and p2 ~ N P2 , 2 2
n1 n2
and variance
P1Q1 P2Q 2
Var(p1-p2) = Var(p1)+Var(p2)
n1 n2
That is,
PQ P Q
p1 p 2 ~ N P1 P2 , 1 1 2 2
n1 n2
242
Sampling Distributions
Thus, standard error is given by
P1Q1 P2Q2
SE p1 p2 Var p1 p2
n1 n2
and variance
P1Q1 P2 Q 2 0.30 0.70 0.20 0.80
Varp1 p 2 0.0019
n1 n2 200 200
Note: i) Check your answers with those given at the end of the unit.
243
Statistical Analysis
9.6 EXACT SAMPLING DISTRIBUTION
As we have discussed in Section 9.1, some of the well known statisticians,
like Prof. R. A. Fisher, Prof. G. Snedecor, etc. worked on finding some of the
statistic and determined the exact sampling distributions their properties and
applications in differenct areas. These sampling distributions are named on
the nmame of its originator for example, F- distribution is named as Fisher’s
F-distribution and t-distribution as student’s t-distribution on the name of
Prof. W.S. Gosset. Before describing the Exact Sampling distribution first we
will discuss the term “Degree of Freedom” a very useful concept which is
necessarily to be understood before discussing and understanding the
concepts of Exact sampling distributions. The exact sampling distributions
are described with the help of degrees of freedom.
Degrees of Freedom (df)
The term degree of freedom (df) is related to the independency of sample
observations. In general, the number of degree of freedom is the total number
of observations minus the number of independent constraints or restrictions
imposed on the observations. For example, let x1, x2,…, xn be n independent
observations in a sample. Unless some condition is imposed on these x
values, it would have n df. Now, let one condition x1+x2+…+ xn = 100 be
imposed on this set, then it looses 1 df, that is, now df will be n – 1 since the
last value xn or any other value xi will be dependent on all other remaing
values and therefore, the number of independent values will be n-1. Further,
let x12+x22+…+ xn2 = 4000 be the another condition imposed, then now the df
will be n – 2, etc.
For a sample of n observations, if there are k restrictions among observations
(k < n), then the degrees of freedom will be (n–k).
9.6.1 Chi-square Distribution
The chi-square distribution was first discovered by Helmert in 1876 and later
independently explained by Karl- Pearson in 1900. The chi-square
distribution was discovered mainly as a measure of goodness of fit of any
model on the given frequency distribution or probability distribution. .
If a random sample X1, X2,…, Xn of size n is drawn from a normal
population having mean and variance σ2 then the sample variance can be
defined as
1 n n
s2 ( x i x ) 2 or (x i x ) 2 (n 1) s 2 s 2
n 1 i1 i 1
s 2
Then, the variate 2 which is the ratio of sample variance multiplied
2
by its degrees of freedom and the population variance follows the 2-
244 distribution with ν degrees of freedom.
The probability density functionof 2-distribution with ν df is given by Sampling Distributions
1 2 / 2 1
f 2 e /2
2
; 0 2 … (1)
2 / 2
2
where, ν = n −1.
2. Chi-square distribution has only one parameter n, that is, the degrees of
freedom.
3. Chi-square probability curve is highly positive skewed for smaller values
of n but becomes a symmetrical curve for larger values of n.
4. Chi-square-distribution is a uni-modal distribution, that is, it has single
mode.
5. The mean and variance of chi-square distribution with n df are n and 2n
respectively.
Note: i) Check your answers with those given at the end of the unit.
In general, the standard deviation σ is not known and in such a situation the
only alternative left is to estimate it from a sample. The value of sample
variance (S2) is used to estimate it where,
2 1 n1
s (x i x) 2
n 1 i1
X
But then in this case the variate is not normally distributed whereas it
S/ n
follows t-distribution with (n−1) df, that is,
X
t ~ t ( n 1) … (2)
s n
The t-variate is a widely used variable and its distribution is called student’s
t-distribution on the pseudonym name ‘Student’ of W.S. Gosset. The
probability density function of variable t with (n-1) = ν degrees of freedom is
given by
1
f t 1 / 2 ; t … (3)
2
1 t
B , 1
2 2
1
where, B , is known as beta function.
2 2
The probability curve of t-distribution is bell shaped and symmetric about t =
0 line. The probability curves of t-distribution is shown in Fig. 9.2 at two
different values of degrees of freedom n = 4 and 12.
Note: i) Check your answers with those given at the end of the unit.
9.6.3 F-Distribution
As we have mentioned in previous unit, F-distribution was introduced by
Prof. R. A. Fisher and defined as the ratio of two independent chi-square
variates when divided by their respective degrees of freedom. If we draw a
random sample X1 ,X 2 ,..., X n1 of size n1 from a normal population with mean
1 and variance σ 12 and another independent random sample Y1 , Y2 ,..., Yn2 of
size n2 from another normal population with mean 2 and variance 22
respectively then 1s12 / 12 is distributed as chi-square variate with ν1 df, that
1s12
is, 12 2
~ (21 ) … (1)
1
1 n1 1 n1
where, 1 n 1 1, X X i and s12 (X i X ) 2
n1 i1 n1 1 i1
2s 22
22 2
~ (2 2 ) … (2)
1
n2
1 1 n2
where, 2 n 2 1, Y
n2
X i and s 22
i 1
(Yi Y) 2
n 2 1 i1
Now, if we take the ratio of the above chi-square variates given in equations
(1) and (2), then we get
12 1s12 / 12 s12 / 12 12 / 1
~ F( 1 , 2 ) … (3)
22 2s 22 / 22 s 22 / 22 22 / 2
In the above expression F stands for the F-distribution. In the suffix, υ1 and υ2
are called the degrees of freedom of the F-distribution.
Now, if variances of both the populations are equal, that is, σ12 σ 22 , then F-
variate is written in the form of ratio of two sample variances which is as
follows:
249
Statistical Analysis s12
F ~ F( 1 , 2 ) … (4)
s 22
1 5 , 2 20
1 5 , 2 5
1 20 , 2 5
Fig. 9.3: Probability curves of F-distribution for (5, 5), (5, 20) and (20, 5)
degrees of freedom.
As it appears from the figure, F-distribution is uni-modal curve. It can be
seen from the figure that by increasing the first degrees of freedom from
1 5 to 1 20 the mean of the distribution (shown by vertical line) does
not change but probability curve shifs from the tail to the centre of the
distribution whereas increasing the second degrees of freedom from
2 5 to 2 20 the mean of the distribution (shown by vertical line)
decrease and the probability curve shifts from the tail to the centre of the
distribution. One can also get an idea about the skewness of the F-
distribution. We observe from the probability curve that it is positively
skewed curve and it becomes very highly positive skewed if ν 2 becomes
small. Now, we shall discuss some of the important properties of F-
distribution.
The F-distribution has the following important properties:
250
1. The probability curve of F-distribution is positively skewed curve. The Sampling Distributions
curve becomes highly positive skewed when ν2 is smaller than ν1.
2. F-distribution curve extends on abscissa from 0 to .
3. F-distribution is a uni-modal distribution, that is, it has single mode.
4. The square of t-variate with ν df follows F-distribution with 1 and ν
degrees of freedom.
2
5. The mean of F-distribution with (ν1,ν2) df is for 2 2.
2 2
222 1 2 2
2
for 2 4.
1 2 2 2 4
Women 40 45
Men 80 75
s12 / 12
F
s 22 / 22
F
45 / 40
1.27
1.93
2 2
75 / 80 0.88
For the above calculation, the degrees of freedom ν1 for women’s data
are7−1= 6 and the degrees of freedom ν2 for men’s data are 12 −1 =11.
Note: i) Check your answers with those given at the end of the unit.
13) For the purpose of a survey 15 students are selected randomly from
class A and 10 students are selected randomly from class B. At the
stage of the analysis of the sample data, the following information is
available:
15) What are the mean and variance of F-distribution with 1 5 and
2 12 degrees of freedom?
16) Write four applications of F-distribution.
We now end this unit by giving a summary of what we have covered in it.
253
Statistical Analysis Table 9.8: Calculation of Sample Mean
Sample Sample Sample
Number Observatio Mean ( X )
n
1 4, 6 5
2 4, 8 6
3 4,10 7
4 4, 12 8
5 6, 8 7
6 6, 10 8
7 6, 12 9
8 8, 10 9
9 8, 12 10
10 10, 12 11
Since the arrangement of all possible values of sample mean with their
corresponding probabilities is called the sampling distribution of mean,
thus,we arrange every possible value of sample mean with their respective
probabilities in the following Table 9.9 given below:
Table 9.9: Sampling distribution of sample means
= 200, σ = 4, n = 10
distribution is given by
E X 200
254
and variance Sampling Distributions
2
2 4 16
Var X 1.6
n 10 10
= 2550, n = 100, s = 54
First of all, we find the sampling distribution of sample mean. Since sample
size is large (n = 100 > 30) therefore, by the central limit theorem, the
sampling distribution of sample mean follows normal distribution. Therefore,
the mean of this distribution is given by
E X 2550
and variance
s 2 (54) 2 2916
Var ( X ) 29.16
n 100 100
1 = 68, σ1 = 2.3, n1 = 35
2 = 65, σ2 = 2.5, n2 = 50
To find the mean and standard error, first of all we find the sampling
distribution of difference of two sample means. Let X and Y denote the
mean height of male and female workers of hospital, respectively. Since n1
and n2 are large (n1, n2 > 30) therefore, by the central limit theorem, the
sampling distribution of (X Y) follows normal distribution with mean
E X Y 1 2 68 65 3
and variance
2 2
12 22 2.3 2.5
Var X Y
n1 n 2 35 50
0.2761 0.525
X 7000
N =10000, X = 7000 P 0.70 & n = 100
N 10000 255
Statistical Analysis First of all, we find the sampling distribution of sample proportion. Here, the
sample proportion is not given and n is large so we can assume that the
conditions of normality hold. So the sampling distribution is approximately
normally distributed with mean
E (p) P 0.70
and variance
PQ 0.70 0.30
Var(p) 0.0021 Q 1 P
n 100
25 20
P1 0.25, P2 0.20, n 1 250, n 2 200
100 100
Let p1 and p2be the sample proportions of alcohol drinkers in two cities A and
B respectively. Here the sample proportions are not given and n1 and n2 are
large n1 , n 2 30 so we can assume that conditions of normality hold. So the
sampling distribution of difference of proportions is approximately normally
distributed with mean
and variance
1 2 / 2 2 3
f 2 e ; 0 2
96
256
We have the probability density function of 2 distribution as: Sampling Distributions
1 2
f ( 2 ) e /2
( 2 ) ( / 2)1 ; 0 2
2 / 2
2
Where v = n = 1
by comparision we have
1 3 4 8
2 2
Thus, 8 n 1 8 n 9
Mean = n = 9 and Variance = 2n = 18.
9) Refer Sub-Section 9.6.1.
10) Here, we are given that
n
Mean = 0 and Variance ; n2
n 2
In our case, n = 8, therefore,
8 8
Mean = 0 and Variance 0.8
(80 2) 10
s12 / 12
F
s 22 / 22
257
Statistical Analysis Therefore, we have
2 2
F
60 / 65
0.85
0.69
2 2
50 / 45 1.23
2
Mean for 2 2
2 2
and
2 22 1 2 2
Variance 2
for 2 4.
1 2 2 2 4
2 12
Mean 1 .2
2 2 10
2(12) 2 (5 12 2) 30 144
Variance 10.8
5(12 2) 2 (12 4) 40 100
258
Statistical Analysis-I
UNIT 10 STATISTICAL ANALYSIS-I
Structure
10.1 Introduction
10.2 Objectives
10.3 Hypothesis
10.3.1 Null and Alternative Hypothesis
10.1 INTRODUCTION
In the previous unit, we have discussed about the very important concept of
‘Statistical Inference’ that is known as sampling distribution. We also
discussed some basic concepts like population sample, parameter, statistic
estimator and estimates. We have discussed that the probability distribution
of a sample statistic is known as sampling distribution of it. In previous unit
we have also discussed the sampling distribution of the statistics mean and
proportion as well as some of the exact sampling distributions, i.e., t, chi-
square and F – distribution along with its properties and applications. In this
unit we shall discuss the concept and meaning of testing of hypothesis. It
provides us significant conclusions on the assumption(s) made for any
259
Statistical Analysis population parameter on the basis of sample based estimated values. The
assumptions which are made for testing purpose are known as statistical
hypotheses.
In testing a hypothesis, we take decision about trueness of statement made for
a population parameter on the basis of observations. For example, a doctor
may be interested to know whether the new medicine is really effective for
controlling blood pressure, a manager may be interested to know whether one
brand of electric bulb is better than the other, a psychologist may be willing
to know whether the IQ of students studying in an Open University is up to
the standard of the students of IIT. Such statistical decisions can be
acceptable/unacceptable only if these are proved/disproved with the help of
sample data taken from the concerned population.
Generally, the calculated value of the sample statistic differs from the
assumed value of the population parameter since the statistic is based on a
part of the population and not on the entire population. Now, a question arises
whether this difference is actually significant or it appears only due to
fluctuations of samples. A little difference sometimes may occur due to a
main cause or due to sources of sampling errors.
It, therefore, seems that some theoretical backgrounds should be developed
for the testing of correctness/incorrectness of the assumed statement
(hypothesis) while the statements are to be verified on the basis of some
sample values (statistic) which themselves may vary from sample to sample
and, thus, may yield some kind of errors. Whatever procedure is applied for
this purpose is generally known as testing of hypothesis.
10.2 OBJECTIVES
After studying this unit, you should be able to
define a hypothesis, null hypothesis, alternative hypothesis, simple and
composite hypothesis;
define and explain the type-I and type-II errors;
explore the concept of critical regions, level of significance and degree of
freedom;
describe the procedure of testing a hypothesis; and
perform the test of hypothesis for large samples for population mean and
difference between two population mean.
10.3 HYPOTHESIS
In our daily life, each one of us comes across with the problems of testing
some kinds of assumptions which we may have in our mind. For example, a
housewife while cooking rice, makes an assumption in her mind that if only
few pieces of rice are checked and found well-cooked then the whole amount
260 of rice may be thought of fully cooked. Thus, she verifies and tests her
believe on the basis of few observations and not with the whole amount of Statistical Analysis-I
rice.; a manufacturer of some goods may think that advertisement of his
product will increase the sale and, therefore, may decide the trueness of his
thinking on the basis records of market evidences; a student, on the basis of
records of passed-out students, may have a belief that few subjects might help
him to pass out a bachelor degree course, etc. Each of these are examples
where any statement (assumption or belief) about some system needs to be
verified, that is, proved or disproved, on the basis of some data/information.
The above discussions and examples now help us to define a statistical
hypothesis. It may be defined in a number of ways: (i) A statistical
hypothesis is an assumption about a population or about its parameter. This
assumption may or may not be true; (ii) A tentative theory about the natural
world; a concept that is not yet verified but that if tried would explain certain
facts or phenomenon; (iii) It is a supposition, a proposition or principle which
is supposed or taken for granted, in order to draw a conclusion or inference
for proof of the point in question.
A statistical hypothesis stands in contrast to a simple assertion. If one makes
some statement and believes it to be true without its verification then it is an
assertion and not a hypothesis. For example, the statement “It is cloudy
weather outside” is an assertion, because the person believes it to be true and
he wants other persons also to believe it; whereas the statement “I think it is
cloudy weather outside” is a hypothesis since he/she involves in his/her
statement the possibility to prove or disprove the statement by others. In fact,
the intention is to determine at a later stage the truth or falsity or probability
of his/her statement.
In many cases, it is necessary to assume some theoretical probability
distributions for testing the hypothesis, particularly when there is no direct
knowledge of the population from which the observations are taken. For
example, a factory owner produces a product of weight exactly 10 g. The
producer finds that the weight of product has been reduced a little at present.
The owner is worried whether weight reduction is due to unavoidable reasons
or due to poor quality of raw material? He may then frame the hypothesis of
no change in weight and use a suitable theoretical probability distribution for
the variations in weight.
H0: = 0 ; H1: = 1
For instance,
H0: weight of tomato is 130 gms (i.e. H0: = 130gms)
H1: weight of tomato is 200 gms (i.e. H1: = 200gms)
The acceptance of H0 or H1, completely specify the population parameter
(tomato weight).
Case II: Simple null versus Composite alternative Hypothesis, such as
For example,
H0: weight of tomato is 130 gm. (i.e. H0: = 130gm).
H1: weight of tomato is above to 130 gm. (i.e. H1: > 130gm).
The acceptance of H0 specifies the population completely, but acceptance of
H1 does not, because a value greater to 130 will not be a unique value.
264 and conclude: reject H0 if t10 >750, accept H0 if t10 ≤ 750. Clearly, it
indicates that a basic structure of the procedure of testing of hypothesis needs Statistical Analysis-I
two regions.
The region of rejection has a pre-fixed area denoted by , corresponding to a
cut-off value in a probability distribution of test statistic. The prefixed
probability area, that is, is also called size of test or level of significance or
probability of type I error.
From the table we see that the decision relating to rejection of H0, when, in
fact, it is true is called type-I error. The probability of committing type -I
error is called “size of the test” or “level of significance” and is generally
denoted by Thus,
= P [Rejecting H0 when H0 true]
= P [Rejecting H0 | H0 true]
= P [ x ω / H0 ] where x stands for the value of the
statistic under study. Obviously then we have
1- = 1-P [Rejecting H0 | H0 true]
=P [Accepting H0| H0 true] = P [correct decision].
(1-) is probability of correct decision and it correlates to the concept of
100(1-)% confidence interval used in the estimation. Theoretically, the test
procedures are so constructed that the risk of rejecting H0 when it is true is
small.
Type -II error:
The decision relating to acceptance of H0 when it is false, (that is H1 true) is
called type-II error. The probability of committing type-II error is denoted by
. Thus,
= P [Accepting H0 when H0 false]
= P [Accepting H0 when H1 true]
= P [Accepting H0 | H1 true] = P [ x ω / H1 ].
267
Statistical Analysis If it is like so, then actually, there are now only 49 values which are
independent, because any value (say, the last one) can be obtained using the
equation
49
x 50 1000 x i
i 1
It implies that by putting one condition on the sample values, we lose one
degree of freedom and in general if for a sample of size n, there are k
restrictions ( k<n), the degrees of freedom will be (n-k).
3. Define a Statistic.
……………………………………………………………………………….
……………………………………………………………………………….
……….………………………………………………………………………
………………….……………………………………………………………
……………………………………………………………………………….
……………………………………………………………………………….
……….………………………………………………………………………
………………….……………………………………………………………
……………………………………………………………………………….
……………………………………………………………………………….
……….………………………………………………………………………
………………….……………………………………………………………
268
7. Describe the degree of freedom with example. Statistical Analysis-I
……………………………………………………………………………….
……………………………………………………………………………….
……….………………………………………………………………………
………………….……………………………………………………………
270
Figure 10.4 show the situation of critical values and critical regions for right- Statistical Analysis-I
tailed test.
Likewise, if the null hypothesis is formulated for testing the equality of two
parameters as H0: 1 = 2; we may have alternative hypothesis either H1: 1 >
2 or H1: 1 < 2. Then, in view of the discussions made above and from the
above table, we can see that the test would be right-tailed test when H1: 1 >
2 and left-tailed test when H1: 1 < 2 with level of significance α in each
271
Statistical Analysis case. Further, if in conjunction with H0: 1 = 2, we have H1: 1 ≠ 2, then it
would be a two-tailed test with level of significance α/2 on both sides.
9. Describe the One tail and Two tail tests with example.
……………………………………………………………………………….
……………………………………………………………………………….
……….………………………………………………………………………
………………….……………………………………………………………
In Statistics, it has been theoretically proved that if sample taken from any
population is of large size, whatever be the parent distribution, the sampling
distribution of the statistic based upon this sample can be approximated by a
normal distribution. Therefore, one can apply normal distribution-based test
procedures to obtain the cut off points at the pre-defined level of significance
in such cases. Accordingly, such tests are called large sample tests.
Suppose x1, x2,…, xn is a random sample of size n selected from a population
having unknown parameter . Generally, for assuring n to be large enough,
we assume n 30. Let with this assumption, we require a test procedure for
the significance of parameter . In this Section, we shall describe here a
number of tests which are as follows:
272 H 1 : μ μ 0 or H 1 : μ μ 0 or H 1 : μ μ 0
If x1, x2,…,xn be a random sample of size n 30 taken from a population Statistical Analysis-I
with mean and finite variance σ2 then by central limit theorem the sample
mean is asymptotically normally distributed with mean and variance σ2/n
irrespective of parent population be normal or non-normal.
Therefore, for mean x = (1/n) Σxi , i=1,2,3,…. N
2
E(x) Var ( x ) V ( x )
n
Table 10.1
Nature of Test Level of Significance and Critical (Tabulated)
Value
α=0.01 α=0.05 α=0.10
Right Tail Test zα=2.33 zα=1.645 zα=1.28
Right Tail Test zα= -2.33 zα= -1.645 zα= -1.28
Two Tail Test zα=2.58 zα=1.96 zα=1.645
One Tail Test: When H0: µ = µ0 and either H1: µ > µ0 or H1: µ < µ 0
If Z z and H1: µ > µ0, (right tail test) the null hypothesis is rejected and
significant evidence of departure from standard value is confirmed.
Otherwise, for Z z the H0 is not rejected with conclusion that no
significant evidence exists against the null hypothesis. On the other hand, if
Z z and H1: µ < µ0 (left tail test), we reject H0, otherwise do not reject
for Z z .
273
Statistical Analysis Two Tail Test: When H0: µ = µ0 and H1: µ µ0
x μ0 72 70 2
Z 5 .0
σ/ n 4 / 100 0 .4
Since calculated value of test statistic Z is greater than 3, we reject our null
hypothesis H0 at every level of significance (i.e. at 5%, 1% both). We
conclude that sample is not from population with mean 70 and standard
deviation 4.
Example 2: A manufacturer of ball point pens claims that a certain pen he
manufactures has a mean writing-life of 500 pages. A purchasing agent
selects a sample of 100 pens and put them for the test. The mean writing-life
of the sample found 490 pages with standard deviation 50 pages. Should the
purchasing agent reject the manufacturer’s claim at 1% level of
significance?
274
CHECK YOUR PROGRESS 4 Statistical Analysis-I
Note: i) Check your answers with those given at the end of the unit.
10) A sample of 900 bolts has a mean length 3.4cm. Is the sample
regarded to be taken from a large population of bolts with mean 3.25
cm. and standard deviation 2.61 cm. at 5% level of significance?
11) A big company uses thousands of CFL lights every year. The brand
that the company has been using in the past has average life 1200
hours. A new brand is offered to the company at a price lower than
they are paying for the old brand. Consequently, a sample of 100 new
brand CLF light is tested and yields an average life of 1180 hours
with standard deviation 90 hours. Should the company accept to the
new brand?
x 11 , x 12 , x 13 ,..., x 1n 1 x 21 , x 22 , x 23 ,..., x 2n 2
Mean x 1 Mean x2
Normal N μ 1 , σ12 /n 1
Normal N μ 2 , σ 22 /n 2
Normal Population of mean x 1 Normal Population of mean x 2
Fig 10.6: Two independent samples from two different populations having
normal distribution of means
275
Statistical Analysis Thus, the Test statistic
(x 1 x 2 ) E(x 1 x 2 )
Z ~ N(0, 1) .
V(x 1 x 2 )
We assume that population variances σ12, σ22 are known to us. Thus we have
E ( x 1 x 2 ) E ( x1 ) E ( x 2 ) μ1 μ 2
and
σ12 σ 22
Var ( x 1 x 2 ) Var ( x 1 ) Var ( x 2 )
n1 n 2
For H1: µ1> µ2 or H1: µ1< µ2 (right tail test), if Z z the null hypothesis is
not rejected and we conclude that, there is no significant evidence against the
equality of means of two populations. So, both means will be treated as same
at the α-level of significance otherwise if Z z H0 is rejected at α-level of
significance and the difference between the two means µ1and µ2 is
established and concluded that one mean is greater than the other.
For H1: µ1< µ2 (left tail test), if Z z the null hypothesis is rejected at α
level of significance, otherwise for Z z , H0 is not rejected.
276
For Two Tail Test: When H0: θ1 = θ2 and H1: θ1 θ2 Statistical Analysis-I
For H1: θ1 θ2 and if Z z α / 2 , the null hypothesis is not rejected and for
Z z α / 2 the null hypothesis is rejected at α-level of significance.
x1 x 2
Z ~ N[0, 1]
1 1
σ
n1 n 2
1 n1 1 n2
s12 x1i x1 2 ; s 22 x 2i x 2 2
n 1 1 i 1 n 2 1 i 1
2 n 1 1s12 n 2 1s 22
σ̂
n1 n 2 2
x1 x 2
and Z ~ N0, 1
2 1 1
σ̂
n1 n 2
Example 3: In two samples of women from Punjab and Tamil Nadu, the
mean height of 1000 and 2000 women are 67.5 and 68.0 inches respectively.
Can the samples be regarded as drawn from the same population having
standard deviation 2.5 inches? In other words, can the mean height of Punjab
and Tamil Nadu women be regarded as same?
Solution: Given n1= 1000, n2 = 2000, x1 67.5, x2 68.0, σ 2.5 inches
We wish to test the null hypothesis that the samples are drawn from the same
population, that is, mean height of Punjab and Tamil Nadu women are same.
H0: 1 = 2 against H1: 1 2 277
Statistical Analysis It is two tail test and test statistic Z is
x1 x 2 67.5 68.0
Z 5.16
1 1 1 1
2.5
n1 n 2 1000 2000
We wish to test the null hypothesis that the two means of teaching methods
are statistically equal, that is, no difference in both type of teaching methods,
or
H0: 1 = 2 against H1 : μ1 μ 2
x1 x 2 80.4 78.3 2 .1
Z 0.77
2 2
s 12 s 22
12.8
20.5 3.28 4.20
n1 n 2 50 100
278
CHECK YOUR PROGRESS 5 Statistical Analysis-I
Note: i) Check your answers with those given at the end of the unit.
12) Two brands of electric bulbs are quoted at the same price. A buyer
tested a random sample of 200 bulbs of each brand and found the
following information:
Brand A 1300 41
Brand B 1280 46
Is there significant difference in the quality(life) of two brands of
electric bulb at 1% level of significance?
13) A marketing company has undertaken an advertisement campaign for
a popular breakfast food and claimed that it is an improved product. A
survey is conducted to find out the monthly demand of 100 consumers
before and after the campaign. We have following information.
Mean S.D.
Before 120 25
After 138 36
Analyze the above data and test whether the campaign was successful?
279
Statistical Analysis POPULATION of Size N
M units
(N –M) units
Possessing A
Sample n (n < N)
x
n -x
One-Tail Test: When to test H0: P = P0 against H1: P > P0, (right-tail test) we
280 reject H0 if Z >Z. It is equivalent to conclude that in the population
proportion possessing A has been significantly improved compare to earlier Statistical Analysis-I
P0. In contrary to this, when H0: P = P0 against H1: P < P0 (left-tail test) is
tested, we reject H0 if Z <-Z at level of significance. It infers there might
be significant reduction in the proportion in the present scenario.
Note: i) Check your answers with those given at the end of the unit.
Suppose there are two populations, each having persons (or items) possessing
attribute A. We are interested to test whether proportions possessing A in
both the population are same?
Sample n1 Sample n2
Let n1 has x1 units and n2 has x2 units possessing the attribute A (say A =
282 smoking-habit) in samples. Therefore, from the theory of sampling
distribution of (p1 - p2), following are the expected values, variances and Statistical Analysis-I
standard errors of statistic respectively:
p1 = (x1 / n1) ; p2 = (x2 / n2)
E (p1) = P1 ; E (p2) = P2 E (p1 - p2) = P1- P2
P1Q1 PQ PQ P Q
V (p1) = V (p2) = 2 2 ; V(p1 p 2 ) 1 1 2 2
n1 n2 n1 n2
P 1 Q1 P2 Q2 P1 Q1 P 2 Q 2
SE(p1 ) ; SE(p 2 ) ; SE(p1 p 2 )
n1 n2 n1 n2
Under the assumptions of large population and sample sizes due to central
limit theorem, the term (p1-p2) is distributed asymptotically normal with mean
E(p1-p2) and variance V(p1-p2) and the Z-statistic is:
Z
p1 p 2 Ep1 p 2 ~ N [ 0, 1]
Vp1 p 2
Under H0: P1= P2 = P (given value), and H1: P1≠ P2, the test statistic Z is
p1 p 2
Z ~ N [ 0, 1]
1 1 where Q = 1-P.
PQ
n1 n 2
For H1: P1 P2 and if Z z α / 2 , the null hypothesis is not rejected and for
Z z α / 2 the null hypothesis is rejected at α-level of significance.
p1 p 2 n 1 p1 n 2 p 2
Z where p̂ ;
q̂ 1 p̂
1 1 n1 n 2
p̂ q̂
n1 n 2
In such case for alterative hypothesis H1: P1> P2 the test statistic Z is used
(instead of Z ) and H0: P1=P2 is rejected when Z z (right-tail test).
Otherwise, if H1: P1< P2 the H0: P1=P2 is rejected when Z z (left-tail
test).
From the data the corresponding sample proportions are obtained as:
x1 55 x 2 45
p1 0.55 and p2 0.50
n 1 100 n 2 90
n 1 p1 n 2 p 2 x 1 x 2 55 45 10 10 9
p̂ q̂ 1 .
n1 n 2 n 1 n 2 100 90 19 19 19
As per assumption of H1, it is two tail test. The value of test statistic Z as
p1 p 2 p1 p 2
Z
S.E.p1 p 2 1 1
p̂q̂
n1 n 2
Solution: Here the population is produced articles and let attribute A stands
for defective articles. Let P1 and P2, respectively proportions of defective
articles before and after overhauling. Since null hypothesis always negate the
things, we assume that the proportions of defective articles are same before
and after overhauling, that is, H0: P1 = P2 = P (says) against H1: P1 > P2
(which means that proportion of defective articles would be smaller than
earlier before overhauling).
x1 80 x2 45
p1 0.20 and p2 0.15
n 1 400 n 2 300
Note: i) Check your answers with those given at the end of the unit.
2. Type-I and Type-II errors, Critical region, One tailed and two tailed test;
6. Large sample tests for Population Mean and difference between two
Population Means; and
7. Large sample tests for Population Proportion and difference between two
Population Proportions.
285
Statistical Analysis
10.8 KEY WORDS
Null Hypothesis: A statistical hypothesis that usually asserts that nothing
special is happening with respect to some characteristic of the underlying
population.
One-tailed (or directional) Test: The rejection region is located in just one
tail of the sampling distribution.
3.40 3.25
Z 1.73
2.61 / 900
Since ׀Z<׀1.96, we conclude that H0 is not rejected at 5% level of
significance.
11) We given that
n 2 200, x 2 1280, σ 2 46
We wish to test the null hypothesis
H0: 1 = 2 against H1 : μ1 μ 2 (i.e., mean life of both brands differ)
n 2 100, x 2 138, s 2 36
We wish to test the null hypothesis
H0: 1 = 2 against H1 : μ1 μ 2
The calculated value ׀Z = ׀0.80 is smaller than the tabulated value z =
2.58 at 1% level of significance then we do not reject null hypothesis.
80 20 x 18
P0 0.80 ; Q0 0.20 p 0 .9
100 100 n 20
Given
x1 50 40
p1 0.50 and p2 0.40
n 1 100 100
n 1 p1 n 2 p 2 50 40
Also p̂ 0.45 q̂ 1 p̂ 0.55
n1 n 2 100 100
p1 p 2 p1 p 2
Z
S.E. p1 p 2 1 1
p̂q̂
n 1 n 2
0.50 0.40 0.10
2.04
1 1 0.049
0.45 0.55 100 100
The calculated value Z 2.04 2.58 (critical value in two tail test at
1% level of significance), then the null hypothesis is not rejected.
288
Statistical Analysis-II
UNIT 11 STATISTICAL ANALYSIS-II
Structure
11.1 Introduction
11.2 Objectives
11.3 Procedure for Small Sample Test
11.3.1 Test for Population Mean
11.3.2 Test for Difference of Two Population Means
11.3.3 Paired t-test
11.5 F-Test
11.5.1 Test for equality of Two Variances
11.1 INTRODUCTION
The entire large sample theory, as applied to testing of hypothesis in the
previous unit, is based on the assumption that sample size being large enough
(generally more than 30) the sampling distribution of the test statistic can
always be approximated by the normal distribution owing to the results of
central limit theorem.
You might have observed there that due to this reason, the test statistics used
everywhere was Z, where Z stands for the normal distribution with mean zero
and variance one. However, if the sample size n is small (practically less
than 30), the sampling distribution of many test statistics are far from the
normal, and as such, normal distribution is not the basis of those tests. For
small samples, therefore, the assumption of normality of the distributions of
test statistic is not valid and exact sampling distribution of the test statistic
has to be known. Fortunately, in the literature of sampling distributions,
many such distributions are obtained which can be applied for small sample
problems of testing of hypotheses. In this Unit, we will study various tests of
significance based on these statistics.
The small sample tests for testing the significance of population mean and the
test for the equality of two population means are very frequent in real life
problems. We shall discuss here the relevant testing procedures, namely,
t- tests, derived on the basis of t- distribution. 289
Statistical Analysis In testing of hypothesis or even in the theory of estimation, we generally
assume that the random variable under study follows a particular probability
distribution, such as, normal, binomial, Poisson distribution, etc., but such
assumptions need to be verified as to whether our assumptions are true or not.
We have already discussed and have also seen that the genesis of the problem
of testing the hypotheses lies in this fact.
Although, in many testing problems, we frequently deal with certain
variables (that is, quantitative characteristics), but in many real-world
situations of business and other areas, the collected data are qualitative by
nature (that is, attributes), classified into different categories or groups
according to one or more attributes. Such types of data are known as
“categorical data”. For example, the number of persons in a sample may be
categorized into different categories of their age, income, job, etc. Then a
question arises “how the inference problems, particularly, the problems of
testing of hypothesis, arising out of categorical data can be tackled?” In this
unit, we shall see that chi- square test, one of the sampling distributions, is
helpful for such problems. We shall discuss here two most widely used tests
of chi-square.
Sometimes, it is required to test whether the variances of two different
populations are equal or not; similar to the situation of testing the equality of
population means discussed in the previous unit. We shall discuss this
problem and will see how the F-distribution can be applied for this purpose.
11.2 OBJECTIVES
After studying this unit, you should be able to:
describe the small sample test procedures for single population mean
and difference of two population means on the basis of t-distribution;
explain the chi-square test for the goodness of fit and independence
of attributes; and
H 0 : 0 against H1 : 0 or H1 : 0 or H1 : 0
Let X be the statistic computed on the basis of the random sample x1, x2,…,
xn; E(X) and V(X) respectively be the mean and variance of X, then under
the null hypothesis, the test- statistic t is given by
X E(X) x μ0
t ~ t (n 1)d.f.
VX s/ n
x i 1
and s 2 x i x 2 is an
where, E(X) is sample mean x
n n -1
unbiased estimate of the unknown population variance σ2. It is well known
that statistic t follows a t- distribution with (n-1) degrees of freedom (df).
For testing H0: =0 (=60) against H1:> 60 dozens (right- tail test) the test
statistic t is:
x μ 0 64 60 4
t 2.67
s/ n 6/ 16 1.5
The critical value (or tabulated value) of t for (n-1) = 15 df at 5% level of
significance [(one- tail right- hand test)] is t15,0.05 = 2.602 (obtained from the
t- table). Since the calculated value is greater than tabulated value, we reject
H0 at 5% level of significance, that is, H1is accepted and we conclude that
advertising campaign is effective for increasing the bath-soap sale in the
market.
Example 2: The mean Share price of Real estate companies is Rs.68. After a
month, these Share prices of all companies have been changed. A sample of
10 Pharma companies was taken and their share prices were noted as: 70, 76,
75, 69, 70, 72, 68, 65, 75, 72. Test, whether mean share price is still the
same?
Solution: Given sample size n = 10 (< 30), so it is a small sample case and
we wish to test the null hypothesis that the population mean is still 70.0
besides all changes, that is,
2 76 6 36
3 75 5 25
4 69 -1 1
5 70 0 0
6 72 2 4
293
Statistical Analysis 7 68 -2 4
8 65 -5 25
9 75 4 16
10 72 2 4
Total 11 115
x 70
d 70 11 70.1
n 10
2 1 2
d
2
1 2
11 1 121
s d 115 115 11.43
n -1 n 10 1 10 9 10
s 11.43 3.38
Note: i) Check your answers with those given at the end of the unit.
2) Describe the procedure of small sample test for single population mean.
294
11.3.2 Tests for Equality of Two Population Means Statistical Analysis-II
Suppose we have to compare means of two groups when sample sizes taken
from these two groups are small. For example, performance of English-
medium and Hindi-medium school children in mathematics paper are to be
compared or performance of two brand-promotion campaigns of a product in
two different markets are to be compared.
Let us assume that two independent random samples x1, x 2 , ..., x n1 and
y1 , y2 , ..., yn2 of sizes n1 and n2 respectively are drawn from two normal
populations N(1,σ12) and N(2,σ22). Further, suppose the variances of the
both populations are unknown but are equal, i.e., σ12 = σ22 = σ2 (say) where σ2
is an unknown value.
In this situation, we want to test the null hypothesis
H 0 : 1 2 against H1 : 1 2 or H1 : 1 2 or H1 : 1 2
Here, x and y are the means of first and second sample respectively and
2 2
s2
n 1 1s12 n 2 1) s 22 where, s12
x x and s 22
y y
n1 n 2 2 n1 1 n 2 1
Diet A Diet B
x d1 = (x-a); with d12 y d2 = (y-b); with d22
a=12 b=16
12 0 0 14 -2 4
8 -4 16 13 -3 9
14 2 4 12 -4 16
16 4 16 15 -1 1
13 1 1 16 0 0
12 0 0 14 -2 4
8 -4 16 18 2 4
14 2 4 17 1 1
10 -2 4 21 5 25
9 -3 9 15 -1 1
2 2
d 1 4 d 1 70 d 2 5 d 2 65
xa
d 1
12
4
11.6, yb
d 2
16
5
15.5
n1 10 n2 10
2 1
d 12
d 1
2
2
d 2
2
s d 2
n 1 n 2 2 n1 n2
2 2
1 4 5 68.4 62.5
70 65 7.27
10 10 2 10 10 18
s 7.27 2.70
296
Putting these values in test statistic t, we get Statistical Analysis-II
Example 4: The means of two random samples of sizes 10 and 8 are 210.40
and 208.42 respectively. The sum of squares of the deviations from means is
26.94 and 24.50, respectively. Can samples be considered to have been drawn
from the normal populations having equal mean.
So
1 1
s2
n1 n 2 2
x i
2
x y i y
2
16
51.44 3.215
s 3.215 1.79
We wish to test the null hypothesis that both the samples are drawn from the
normal populations having the same mean. Let μ1 and μ2 be means of the first
and second populations respectively, then the null hypothesis will be
H0: 1 = 2 against H1: 1 ≠2.
297
Statistical Analysis CHECK YOUR PROGRESS 3
Drug A 5 8 7 10 9 6 -
Drug B 9 10 15 12 14 8 12
Do the two drugs differ significantly with regard to their mean weight
increment?
7) To test the effect of fertilizer on wheat production, 26 plots of land
with equal areas were chosen. Half of these plots were treated with
fertilizer and the other half were untreated. Other conditions were the
same. The mean yield of wheat on the untreated plots was 4.6 quintals
with a standard deviation of 0.5 quintals, while the mean yield of the
treated plots was 5.0 quintals with standard deviations of 0.3 quintal.
Can we conclude that there is significant improvement in wheat
production because of the effect of fertilizer at 1% level of
significance?
H0: 1 = 2 or D 1 2 0
1
where d i x i y i i 1, 2, ..., n and similarly , d d i x y
n
2
s
1
d d 2
1 2
d i
di
2
d i
n 1 n 1 n
We need to calculate only Σdi , Σdi2 for obtaining the value of t. Then we
obtain the critical value (or cut-off value or tabulated value) of test statistic t
from the t-table.
The methods of taking decision about the rejection of H0 or not rejecting H0,
on the basis of calculated and tabulated values of t-statistic, are exactly
similar as mentioned in sub-section 4.2.2 for H1: μ1 > μ2 ; μ1 < μ2 and μ1 ≠ μ2
except that here degrees of freedom (df) is (n – 1) instead of (n1 + n2-2).
Note: Sometimes data are recorded in the form of increments/decrements
only, like; weight of child increased by 2 kg, marks increased by 3%,
temperature reduced to 30C ,etc. Then actually for n objects, this set of data
show [d1 , d2 ,d3 , .., dn]. Accordingly, our hypothesis will be
H 0 : D 0 against H1 : D 0 or H1 : D 0
The sample with larger mean may be treated as the first sample and hence is
denoted by x. Therefore, this has to be checked before attempting the paired-
t test that whether values (xi, yi) are given or increments/ decrements are
given in the data set.
Note: In good practice, if we talk about increments then we take di = (yi- xi),
and alternative hypothesis H1 : D 0 ( one sided right-tail test). For
decrements we take di = (xi-yi) and same alternative H1 : D 0
Example 5: A group of 12 children was tested to find out how many digits
they would repeat from memory after hearing them once. They were given
practice session for this test. Next week they were retested. The results
obtained were as follows:
299
Statistical Analysis
1 1
Child no. 1 2 3 4 5 6 7 8 9 10
1 2
Recall
6 4 5 7 6 4 3 7 8 4 6 5
Before
Recall
6 6 4 7 6 5 5 9 9 7 8 7
After
1 6 6 0 0
2 4 6 -2 4
3 5 7 -2 4
4 7 7 0 0
5 6 8 -2 4
6 4 5 -1 1
7 3 5 -2 4
8 7 9 -2 4
9 8 9 -1 1
10 4 7 -3 9
11 6 8 -2 4
12 5 7 -2 4
2
d 19 d 35
d
d 19
1.58
n 12
1 1 di
2
d i d n 1 d i n
2 2 2
s
d
n 1
300
2 Statistical Analysis-II
1
35
19 1
4.92 0.45
11 12 11
s d 0.45 0.67
Note: i) Check your answers with those given at the end of the unit.
The chi-square ( χ 2 ) test for goodness of fit was given by Karl Pearson in
1900, which is the oldest non-parametric method of testing of hypothesis. In
this test, given a set of observed frequencies for values of a discrete variable,
we test whether the data follows a specific probability distribution or not.
301
Statistical Analysis Under the assumption of a particular distribution for the given data (as for
example, binomial, Poisson, etc.), which is considered to be the null
hypothesis to be tested; expected frequencies for the given variable values are
obtained which would be expected to occur if the data follows the assumed
distribution. This test is known as “goodness of fit test” since the aim is to
observe how close are the given values of the variable to the frequencies
yielded by the assumed probability distribution, or, in other words, we judge
the fitness of the data to the assumed theoretical distribution. This is a non-
parametric test since we do not assume any specific probability distribution
beforehand for the data.
Assumptions for the test:
1. Sample observations are independent.
2. The measurement scale is at least nominal.
3. The observations may be classified into non-overlapping categories.
Let a random sample of size N be drawn from a population with unknown
distribution of k characteristics and the data categorized into k groups or
classes. Also, let O1, O2, …, Ok be the observed frequencies (as given in the
data set) and E1, E2, …,Ek be the corresponding expected frequencies as are
expected if the data follows the assumed distribution. We generally put a
linear constraint of equality on the observed and expected frequencies which
is O i E i .
To perform the chi- square goodness of fit test the steps are as follows:
Step 1: First of all, we form the null and alternative hypothesis. Null
hypothesis is that the given data set follows the given probability distribution
or the pattern specified in the question. Alternative hypothesis contradicts this
assumption.
Step 2: For computing expected frequencies corresponding to given values of
the variable, for the assumed distribution (mentioned in the null hypothesis),
we compute those required sample statistic(s) using the sampled data which
are necessary to obtain an estimate of the corresponding population
parameter(s) and assume that it is equal to the theoretical value of the
corresponding parameter. This, in fact, provides us an estimated value of the
parameter which is generally not known in most of the cases. However, if
value(s) of parameters are given, they can be used directly for the purpose
and there is no need to estimate it on the basis of the given sample data. For
instance, let mean (and/or variance) of the theoretical distribution is needed
for calculating the expected frequencies, we compute mean (and/or variance)
of the sample for this purpose.
Step 3: In the next step, we find the probability that an observation falls
(belongs) to a particular category (to a particular value of the variable) using
the assumed probability distribution.
302
Step 4: The calculated probabilities are then used to find the corresponding Statistical Analysis-II
expected frequencies using the result
Ei Npi ; for all i =1, 2, …, k
Step 5: For testing the null hypothesis, compute the test statistic
2
k
Oi Ei 2 ~ 2k r 1
i 1 Ei
Brand
A B C D E
Preferred
Number of
194 205 204 196 202
Customers
Test the hypothesis that the preference is uniform over the five brands at the
5% level of significance.
Solution: Here, we are interest to test the null hypothesis that the preference
of customers over five brands is uniform, that is, all brands are equally
preferred by customers. Therefore, we have
H0: The preference of customers over the five brands of bath soap is uniform
H1: The preference of customers over the five brands of bath soap is not
uniform
Here, equivalently we can say that a uniform distribution is assumed for the
data., that is, the proportions of customers for each brand is same. In other
1
words, p1 p2 p3 p4 p5 p
5
where pi ( i = 1, 2,…, 5) is the proportion of customer for the ith brand.
For testing the null hypothesis, the test statistic χ2 as mentioned above.
303
Statistical Analysis The theoretical or expected number of customers or frequency for each brand
is obtained by multiplying the appropriate probability by total number of
customers, that is, sample size N. Therefore,
1
E1 E 2 E 3 E 4 E 5 Np 1000 200
5
For calculating the value of test statistic, we prepare the following table:
Note: i) Check your answers with those given at the end of the unit.
Test whether the accidents are uniformly distributed over the week.
Table
B B1 B2 … Bj … Bq Total of
rows
A
Ri Cj
P AiB j .
N N
To obtain expected frequency Eij for each cell, we multiply this estimated
probability by the total sample size. Thus,
Ri Cj
Eij N. .
N N
which reduces to
2
p q
O ij Eij
2
Here the test statistic follows as a chi-square distribution with (p−1) (q−1)
degrees of freedom.
The decision about rejecting or not rejecting the null hypothesis of
independence of the two attributes against the alternative hypothesis is
exactly same as mentioned in the sub-section 11.4.1 above.
Example 7: Calculate the expected frequencies for the following data
presuming the two attributes condition of child and condition of home and
check whether these are independent.
Condition of Child Condition of Home
Clean Dirty
Clear 75 45
Fairly clean 85 15
306 Dirty 40 40
Solution: Here, we want to test the null hypothesis that condition of home Statistical Analysis-II
and condition of the child are independent, that is,
H0: Condition of child is independent of condition of home against
H1: Condition of child depends upon the condition of home.
For the calculation of χ2 statistic, we do the following computations for
expected frequencies:
Condition of Condition of Home Total
Child
Clean Dirty
Clear 75 45 120
Dirty 40 40 80
R 3 C1 80 200 R C 2 80 100
E 31 53.33 53; E 32 3 26.67 27;
N 300 N 300
For calculating the value of χ2 test statistic, we prepare the following table:
Therefore,
2
χ
p q O ij E ij
2
25.0375
i 1 j1 E ij
307
Statistical Analysis The degrees of freedom will be (p –1)(q –1) = (3 – 1)(2 – 1) = 2.
The tabulated value of chi square at 2 df and 5% level of significance is 5.99.
Since calculated value of test statistic is greater than tabulated value, we
reject our null hypothesis and conclude that condition of home does affect the
condition of child.
Note: i) Check your answers with those given at the end of the unit.
Use chi-square test at 1% level to state whether the two attributes are
independent.
11.5 F-TEST
11.5.1 TEST FOR EQUALITY OF TWO VARIANCES
While applying t-distribution for testing H0: 1 = 2, the basic assumption
was: the population variances of the two populations are equal. But, in fact, it
is a very rare case in practice. Therefore, sometimes it is necessary to test the
equality of population variances first. The F distribution is used for
ascertaining the equality of two population variances. We shall describe here
the F-statistic for this purpose which follows the F distribution.
Let x 1 , x 2 , ..., x n1 be a sample of size n1 from a normal population with
mean 1 and variance σ12.Similarly, let y1 , y 2 , ..., y n 2 be a sample of size n2
from another normal population with mean 2 and variance σ22. We will
show how the two samples are used for defining F-statistic. According to the
objective of the test, the null hypothesis will be
H 0 : σ12 σ 22
s12
F 2 ~ F(n1 1, n 2 1) ; where
s2
1 1
s12 x x 2 and s 22 y y 2 .
n1 1 n 2 1
Sample I 9 59 26
Sample II 11 60 32
Test whether both samples are from the same normal population?
Solution: Since we have to test whether both the samples are from same
normal population, therefore, we will test two things separately:
(i) The equality of two population means,
(ii) The equality of two population variances.
That is H 0 : 1 2 first and then test H 0 : 12 22 309
Statistical Analysis The equality of two means will be tested using t-test whereas equality of two
variances will be tested using F-test. But since t-test is based on the prior
assumption that both population variances are same, therefore, first we apply
F-test and later the t-test (when F-test accepts equality of variance
hypothesis).
n 2 11, y 60, y y 2
32
we get
1 1 1
s12 x x 2 26 3.25 s 22 y y
2
n1 1 9 1 n 2 1
1
36 3.60
11 1
s2
For H 0 : 12 22 , H1 : 12 22 (right-tail test), the test statistic is: F 12
s 2
2 2 s 22
Since s s therefore we take reverse F
2 1 ~ Fn 2 1,n1 1 = (3.60/3.25) =1.1
s12
The tabulated values at 1% and 5% level of significance are F10, 80(.01) = 3.34
and F10, 8(0.05) = 5.81 (one sided left-tail test) respectively. Since calculated F
is smaller than tabulated F at both level of significance; H0 is not rejected
and we conclude that the variances of two populations are same. Now,
applying t-test, the equality of two population means can be tested as
discussed earlier in 4.2.2.
Example 9: The following data relate to the number of items produced in a
shift by two workers A and B for some days:
A 26 37 40 35 30 30 40 26 30 35 45
B 19 22 24 27 24 18 20 19 25
For calculating s12 and s 22 , we first find two means which are found to be 34
and 22 respectively. Then we prepare following table:
310
Items x x x x 2 Items y y y y 2 Statistical Analysis-II
produced produced by
by A x 34 B (variable y 22
(variable x) y)
26 -8 64 19 −3 9
37 3 9 22 0 0
40 6 36 24 2 4
35 1 1 27 5 25
30 -4 16 24 2 4
30 -4 16 18 −4 16
40 6 36 20 −2 4
26 -8 64 19 −3 9
30 -4 16 25 3 9
35 1 1
45 11 121
Total 374 0 380 198 0 80
s12
F
s 22
1
s12 x x 2 1 380 38 s 22 1 y y 2 1 80 10
n1 1 10 n 2 1 8
Note: i) Check your answers with those given at the end of the unit.
15) Two samples are drawn from two different normal populations.
Sample I 60 65 71 74 76 82 85 87 - -
Sample II 61 66 67 85 78 63 85 86 88 91
1. Small sample tests for testing the hypotheses using t-tests for (i)
significance of population mean and (ii) equality of two population
means;
2. Paired t-test when the two samples are selected from the same population
on same number of units with the aim to test the equality of means;
4) Given that past value of mean weight is 60 kg. which is like an old
information. We wish to test the null hypothesis that
H 0 : μ μ 0 60 against H1 : μ μ 0 60
1
s2 x x 2 1252 139.11 s 139.11 11.79
n 1 9
t
63 60 3 0.80
11.79 / 10 37.26
2 1
d12
d
1
2
2
d
2
2
s d 2
n1 n 2 2 n1 n2
1
19
32 39 42 17.5 36.71 4.93
6 7 2 6 7 11
s 4.93 2.22
7) Given that:
24
s 0.18 0.43
d
t ~ t (n 1)d.f.
sd / n
d
d 70
7.0
n 10
2 1 2
d 1
2
702 1 300 33.33
s
d d 790
n 1 n 9 10 9
s d 33.33 5.77
2
k
Oi Ei 2 ~ 2k 1
i 1 Ei
315
Statistical Analysis The theoretical or expected frequency for each day is obtained by
multiplying the appropriate probability by the total number of
accidents, that is, sample size N. Therefore,
1
E1 E 2 E 3 E 4 E5 Np 91 13
7
Therefore, the test statistic is
2
k
Oi Ei 2 7.6923
i 1 Ei
2
p q
O ij Eij
2
2
p q
O ij Eij
2
220.98
i 1 j1 Eij
The degrees of freedom will be (p –1)(q –1) = (2 – 1)(2 – 1) =1. The
tabulated value of chi square 1df and α=1% is 6.63. Since calculated
value of χ 2 is greater than tabulated value at 1% level of significance
therefore, we reject our null hypothesis.
316
14) Given that n 1 12, s12 125 ; n 2 10, s 22 112 Statistical Analysis-II
H 0 : σ12 σ 22 H 1 : σ12 σ 22
against
s12
F
s 22
Therefore,
125
F 1.116
112
The tabulated value of F11, 9, (0.01) = 3.10(one tail right) at 1% level of
significance. Since the calculated value of F is smaller than the
tabulated value of F, therefore, we do not reject the null hypothesis.
15) We wish to test the null hypothesis
H 0 : σ 22 σ12 H 1 : σ 22 σ12
against
The test statistic is
s12
F 2
s2
x
x 600 75 , y
y 770 77
n1 8 n2 10
1
s12 x x 2 1 636 90.86
n1 1 7
1
s 22 y y2 1 1200 133.33
n 2 1 9
s 22 133.33
F 1.47
s12 90.86
Now the tabulated value of F9,7 (0.05) = 5.21 (one tail right at 5% level).
Since the calculated value of F is smaller than the tabulated value of
F, therefore, we do not reject the null hypothesis.
317
Statistical Analysis
UNIT 12 ANALYSIS OF VARIANCE TESTS
Structure
12.1 Introduction
12.2 Objectives
12.3 Analysis of Variance (ANOVA)
12.3.1 Significance ofAnalysis of Variance
12.3.2 Degrees of Freedom
12.3.3 Uses of ANOVA
12.1 INTRODUCTION
In Unit 4 of this block, we tested the equality of means of two independent
groups by using t-test. Sometimes situations may arise where testing of more
than two means is required. As an example, in crop-cutting experiments it
may be required to test whether under similar conditions the average yield of
some crop in a number of fields is same or not. For obvious reasons, in such
cases, t-test cannot be applied.Generally, for such situations, the technique of
Analysis of Variance (ANOVA) is used, in which the testing of equality of
several means is doe by dividing the population variability into different
components. The usual F-test is used to test the equality of means of several
groups.
As its name suggests, the analysis of variance focuses on variability. It
involves the calculation of several measures of variability, when the total
variability of the population is divided into many components, like,
variability within the smaller sub-groups, variability between the smaller sub-
groups, etc. In other words, ANOVA is a technique which split up the total
variation of data which may be attributed to various “sources” or “causes” of
variation. There may be variation between variables and also within different
levels of variables. In this way, Analysis of Variance is used to test the
318
homogeneity of several population means by comparing the variances Analysis of Variance
Tests
between the sample and within the sample.
In this unit, we shall discuss the one-way as well as two-way Analysis of
Variance. One-way Analysis of Variance is a technique where only one
independent variable at different levels is considered which affects the
response variable whereas in two-way Analysis of Variance technique, we
will consider two variables at different levels which affect the response
variables.
12.2 OBJECTIVES
After studying this unit, you should be able to:
understand the Analysis of Variance technique;
describe various types of assumptions underlying the Analysis of
Variance technique and applications of it;
define various types of linear models used in Analysis of Variance
technique;
understand how to test the hypothesis under One-way Analysis of
Variance; and
explain the method of performing Two-way ANOVA Test.
The second approach to calculate the estimate of 2 is based upon the Central
Limit Theorem and is valid only under the null hypothesis assumption that all
the population means are equal. This means that in fact, if there are no
differences among the population means, then the computed value of 2 by
the second approach should not differ significantly from the computed value
of 2 by the first approach. Hence, if these two values of 2 are
approximately the same, then we can decide to accept the null hypothesis of
equality of means.
The second approach results in the following computation:
Based upon the Central Limit Theorem, we have previously found that the
standard error of the sample means is calculated by:
X
320 n
or, the variance would be: Analysis of Variance
Tests
2 2
x 2
n or, 2 n x
Thus, by knowing the square of the standard error of the mean x 2 , we
could multiply it by n and obtain a precise estimate of 2 . This approach of
estimating 2 , is known as 2between . Now, if all population means are equal
then, 2between value should be approximately the same as 2between value. A
significant difference between these two values would lead us to conclude
that this difference is the result of difference between the population means.
But, how do we know that any difference between these two values is
significant or not? How do we know whether this difference, if any, is simply
due to random sampling error or due to actual differences among the
population means?
R. A. Fisher developed a Fisher test or F-test to answer the above question.
He determined that the difference between the 2between and 2within could be
expressed as a ratio to be designated as the F-value, so that
2between
F
2within
In the above, case, if the population means are exactly the same, then 2between
will be equal to the 2within and the value of F will be equal to 1.
In fact, this model decomposes the responses into a mean for each level of a
factor and error term, that is,
Response = A mean for each level of a factor + error term
The analysis of variance provides estimates for each level mean. These
estimated level means are the predicted values of the model and the
difference between the response variable and the estimated/predicted level
means are the residuals.
That is,
y ij i eij implying that eij y ij i
12.4.1 Assumptions
The following are the basic assumptionsof one-way ANOVA:
1. Dependent variable measured on interval scale;
2. k samples are independently and randomly drawn from the population;
3. Population can be assumed reasonably to have a normal distribution;
4. k samples have approximately equal variance;
5. Various effects are additive in nature; and
6. eij are independently and identically distributed (i.i.d.) normal variables
with mean zero and variance σe2.
Null Hypothesis
We want to test the equality of the population means, that is, to test the
homogeneity of effect of different levels of a factor. Hence, the null
hypothesis is given by
H0: μ1 = μ2 = . . . = μk
against the alternative hypothesis
H1: μ1 ≠ μ2 ≠ . . . ≠ μk (or some μi's are not equal)
Since here F ratio contains only two elements, which are the variances
between the samples and within the samples respectively, as discussed
before, let us recapitulate the calculation of these variances.
If all the means of samples were exactly equal and all samples were exactly
representative of their respective populations so that all the sample means
were exactly equal to each other and to the population mean, then there will
be no variance. However, this can never be the case. We always have
variation both between samples and within samples, even if we take these
samples randomly and from the same population. This variation is known as
the total variation.
2
The total variation designated by X X , where X represents individual
observations for all samples and X is the grand mean of all sample means and
equals µ, the population mean, is also known as the total sum of squares or
SST, and is simply the sum of squared differences between each observation
and the overall mean. This total variation represents the contribution of two
elements. These elements are:
a) Variance between samples. The variance between samples may be due
to the effect of different treatments, meaning that the population means
may be affected by the factor under consideration, thus making the
population means actually different, and some variance may be due to the
inter-sample variability. This variance is also known as the sum of
squares between samples. Let this sum of squares be designed as SSB.
Then SSB is calculated by the following steps:
Step-I: Take k samples of size n each and calculate the mean of each
sample, designated as X1 , X 2 , X 3 ,... , X k .
X
i 1
i
X
k
Step-III:Take the difference between the means of the various samples
and the grand mean, which can be denoted as
( X1 X ), ( X 2 X ), ( X 3 X ), ... , ( X k X )
n X
Statistical Analysis 2
i i X , where n i size of the ith sample.
i 1
SSB
2between
(k 1)
Step-II: Take one sample at a time and take the deviation of each item in
the sample from its mean. Do this for all the samples, so that we
would have a difference between each value in each sample and
their respective means for all values in all samples.
Step-III:Square these differences and take a total sum of all these
squared differences (or deviations). This sum is also known as
SSW or sum of squares within samples.
Step-IV: Divide this SSW by the corresponding degrees of freedom. The
degrees of freedom are obtained by subtracting the total number
of samples from the total number of items.
Thus if N is the total number of items or observations, and k is
the number of samples, then, df = (N – k) which are the degrees
of freedom within samples (If all samples are of equal size n,
then df = k(n – 1), since (n – 1) are the degrees of freedom for
each sample and there are k samples).
X
X 9 11 12 12 11
n 4
c) Calculate the value of SSB.
SSB n X X
2
MSB
SSB
30 (30) 10
df (k 1) 3
4) Calculate the variance within samples, as follows:
The find the sum of squares within samples (SSW) we square each
deviation between the individual value of each sample and its mean, for
all samples and then sum these squares deviations, as follows:
X X
2 2 2 2 2
i Xi 1 X1 X 2 X 2 X 3 X 3 X 4 X 4
SSW SSW 72 72
MSW 4.5
df ( N K ) 20 4 16
MSB 10
Then the F-ratio 2.22 .
MSW 4.5
We can construct an ANOVA table for the problem solved above as follows:
ANOVA Table
30 = 2.22
MSB 10
3
72
MSW 4.5
16
Now, we check for the critical value of F from the table for α = 0.05 and
degrees of freedom as follows:
df(numerator) = ( k – 1) = (4 – 1) = 3
df (denominator) = (N – k) = (20 – 4) = 16
This value of F from the table is given as 3.24. Now, since our calculated
value of F = 2.22 is less than the critical value of F = 3.24, we cannot reject
the null hypothesis.
329
Statistical Analysis Annual Household Income ($ 1, 000s)
70 100 60
72 110 65
75 108 57
80 112 84
83 113 84
- 120 70
- 100 -
Test if the average income per household in all these localities can be
considered as the same at α = 0.01.
Solution: If μi denotes the mean of the ith area ( i= 1, 2, 3) then the null
hypothesis is:
H 0 : 1 2 3
The null hypothesis can be tested by computing the F-ratio for the data given
and then comparing it with the critical ratio of F from the table.
As before, let us first calculate the values of SSB and SSW.
380 763 420
Here: X1 76 X 2 109 X3 70
5 7 6
76 109 70 255
so that, X 85.
3 3
Then, SSB 5(76 85) 2 7(109 85) 2 6(70 85) 2
where n i 5 n1 7 n3 6 k 3
5 7 6
SSW (X i1 X1 ) 2 (X i 2 X 2 ) 2 (X i 3 X 3 ) 2
i 1 i 1 i 1
(70 76) 2 (72 76) 2 (75 76) 2 (80 76) 2 (83 76) 2
330
(100 109) 2 (110 109) 2 (108 109) 2 (112 109) 2 Analysis of Variance
Tests
(113 109) 2 (120 109) 2 (100 109) 2
(60 70) 2 (65 70) 2 (57 70) 2 (84 70) 2 (84 70) 2 (70 70) 2
= 36 + 16 + 1 + 16 + 49 +81 + 1 + 1 + 9 + 6 + 121 + 81
+100 + 25 +169 + 196 + 196 + 0
= 118+310 + 686
Then, SSW = 118 + 310 + 686 = 1114.
SSB 5787
Now, MSB 2893.5
(k 1) 2
SSW 1114
MSW 74.26
(N k) 15
MSB 2893.5
Then, F 38.96
MSW 74.26
We can construct an ANOVA table for the problem solved above as follows:
ANOVA Table
Source of Sum of Degrees of Mean Square F
Variation Squares Freedom
5787 = 38.96
2893.5
2
1114
74.26
15
The critical value of F from table for α = 0.01 and df 2 and 15 respectively is
6.36. Since our calculated value of F is higher than the table value of F, we
cannot accept the null hypothesis.
CHECK YOUR PROGRESS 2
Note: i) Check your answers with those given at the end of the unit.
3) There are three sections of an introductory course in Statistics. Each
section is being taught by a different professor. There are some
complaints that at least one of the professors does not cover the 331
Statistical Analysis necessary material. To make sure that all the students receive the same
level of material in a similar manner, the chairperson of the department
has prepared a common test to be given to students of the three
sections. A random sample of seven students is selected from each class
and their test scores out of a total of 20 points are tabulated as follows;
Students Section (1) Section (2) Section (3)
1 20 12 16
2 18 11 15
3 18 10 18
4 16 14 16
5 14 15 16
6 18 12 17
7 15 10 14
Salesman
1 8 6 14
2 9 8 12
3 11 10 18
4 12 4 8
Totals 40 28 52
332
Analysis of Variance
12.5 TWO-WAY ANALYSIS OF VARIANCE Tests
(ANOVA)
In the previous Section we considered the case where only one predictor/
independent variable/explanatory was categorized at different levels. In this
section, let us consider the case with two categorical predictors, each
categorized at different levels and a continuous response variable. Then it is
called two-way classification and the analysis is called Two-way ANOVA.
In such an ANOVA, generally we have an experiment in which we
simultaneously study the effect of two factors in the same experiment. For
each factor, there will be a number of classes/groups or levels. In the fixed
effect model, there will be only fixed levels of the two factors. We shall first
consider the case of one observation per cell. Let the factors be A and B and
the respective levels be A1,A2,…,Ar and B1, B2, …, Bqs. Let yij be the
observation/response/dependent variable under the ith level of factor A and
jthlevel of factor B. Further, let μ1A, μ2A,…,μrAbe the means of levels
A1,A2,…,Ar and μ1B, μ2B,…,μsB be the means of levels B1, B2, …, Bsin the
population. The observations then can be represented in a table as follows:
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Ar Yr1 Yr2 … Yrj … Yrs Yr. r.
Total y.1 y.2 … y.j … y.s y. .= G
Mean .1 .2 … .j … .s - ..
Mathematical Model
Here, the mathematical model may be written as
yij = μij+ eij whereeij’s are error terms.
Step 2:We calculate the Correction factor (CF) and the raw sum of squares
(RSS) using the formulae given below:
G2
Correction Factor (CF) andRaw Sum of Squares
N
r s
RSS y ij2
i 1 j1
1 s 2
Sum of Squares due to Factor B (SSB) y i Correction Factor (CF)
p i 1
Sum of Squares due to Error (SSE) = TSS – SSA – SSB
where y i the sum of the observations of the ithlevel of factor A.
SSB
Mean sum of squares due to factor B (MSSB)
s 1
SSE
Mean sum of squares due to error (MSSE )
(r 1) (s 1)
Step 6: We calculate the value of the test statistics using the formulae given
below:
MSSA
FA
MSSE
MSSB
FB
MSSE
ANOVA TABLE
Sources of DF SS MSS F-Test
Variation
Variation (r-1) SSA SSA MSSA
(MSSA ) FA
Due to r 1 MSSE
335
Statistical Analysis Factor A
Variation (s-1) SSB SSB MSSB
( MSSB) FB
Due to s 1 MSSE
Factor B
Variation (r-1)(s-1) SSE SSE
(MSSE )
Due to (r 1) (s 1)
Error
Total rs-1 TSS
Step 7:We take decisions about the null hypotheses for factor A and factor B
as explained below:
(i) Compare the calculated value of FAwith tabulated value of FAat
the respective df’s. If calculated value is greater than the tabulated
value then reject the hypothesis H0A, otherwise it may be
accepted.
(ii) Compare the calculated value of FBwith tabulated value of FB at
the respective df’s. If calculated value is greater than the tabulated
value then reject the hypothesis H0B, otherwise it may be
accepted.
Now after discussing the procedure of Two-way ANOVA test let us practice
to solve some examples:
Example 3: Future group wishes to enter the frozen shrimp market. They
contract a researcher to investigate various methods of groups of shrimp in
large tanks. The researcher suspects that temperature and salinity are
important factor influencing shrimp yield and conducts a two-way analysis of
variance with their levels of temperature and salinity. That is each
combination of yield for each (for identical gallon tanks) is measured.The
recorded yields are given in the following chart:
Categorical variable Salinity (in pp)
Temperature 700 1400 2100 Total Mean
600 F 3 5 4 12 4
700 F 11 10 12 33 11
800 F 16 21 17 54 18
Total 30 36 33 99 11
Compute the ANOVA table for the model.
Solution: Since in each all there is one observation. So we will use the
model.
yij =μ+ i+βj+ eij
336
where, yij is the yield corresponding to ith temperature and jthsalinity, μ is the Analysis of Variance
Tests
general mean, i is the effect due to ith temperature, βj is the effect due to
jthsalinity and eij ~ i i d N (0, σ2). The hypotheses to be tested are
H0A: α1= α2 = α3against H1A: α1≠ α2 ≠ α3 ≠ 0
H0B: 1=2 = 3against H1B: 1≠ 2 ≠ 3 ≠ 0.
The computations are as follows:
Grand Total (G) = 99
No. of observations(N) = 9
Correction Factor CF = (99 99) / 9 = 1089
Raw Sum of Square (RSS) = 1401
Total Sum of Square (TSS) = RSS - CF= 1401-1089 = 312
Sum of Square due to Temperature (SST) = (12)2/3+(33)2/3+(54)2/3-1089 =
294
Sum of Square due to Salinity (SSS)= (30)2/3+(36)2/3+(33)2/3-1089 = 6
Sum of Square due to Error= TSS – SST – SSS= 312 -294 -6 = 12
ANOVA TABLE
Sources of DF SS MSS F-Test
Variation
Due to 2 294 147 FT=MSST/MSSE=147/3
Temperature =49
Due to Salinity 2 6 3 FS = 3/3 = 1
Due to error 4 12 3
Total 8 312
Note: i) Check your answers with those given at the end of the unit.
40 28 52
Here X1 10 X2 7 X3 13
4 4 4
10 7 13 30
so that, X 10.
3 3
Then, SSB 4(10 10) 2 4(7 10) 2 4(13 10) 2 = 0 + 36 + 36 = 72
Degrees of freedom = df = (k – 1) = ( 3 – 1) = 2.
4 4 4
SSW (X i1 X1 ) 2 (X i 2 X 2 ) 2 (X i 3 X 3 ) 2
i 1 i 1 i 1
(6 7) 2 (8 7) 2 (10 7) 2 (4 7) 2
H 1 B : 1 2 3 4
340
G= y ij = Total of all observations Analysis of Variance
Tests
=7.10+3.69+4.70+1.90+10.29+4.79+4.58+2.64+8.30+
3.58+4.90+1.80
=58.28
N = No. of observations = 12
G2 58.28 58.28
Correction Factor = CF = 283.0465
N 12
Raw Sum of Square (RSS) =
M12 M 22 M32
Sum of Square to Method of Planting (SSM) CF
4 4 4
2 2 2
=
17.39 22.31 15.58
283.0465
4 4 4
= 286.3412-283.0465 = 3.2947
Sum of Square due to Error (SSE)= TSS-SSD-SSM
= 72.4631-3.2947-65.8917 = 3.2767
65.8917 3.2947
MSSD 21.9639 MSSM 1.6473
3 , 2 ,
3.2767
MSSE 0.5461
6
MSSM 1.6473
For testing H 0 A : FM is 3.02
MSSE 0.5461
MSSD 21.9639
For testing H 0 B : FD is 40.22
MSSE 0.5461
The tabulated value of F2, 6 at 5% level of significance is 5.14 which
is greater than the calculated value of FM (3.02) so H0A is accepted.
So, we conclude that there is no significant difference among the
different methods of planting.
341
Statistical Analysis The tabulated value of F3, 6 at 5% level of significance is 4.76 which is
less than calculated value of FD (40.22). So, we reject the null
hypothesis H 0 B . Hence there is a significant difference among the
dates of planting.
6) Null Hypotheses are:
H01: There is no significant difference between mean effects of diets.
H02: There is no significant difference between mean effects of
different blocks.
Against the alternative hypothesis
H11: There is significant difference between mean effects of diets
H12: There is significant difference between mean effects of different
blocks.
Blocks Treatments/Diets
A B C D Totals
I 12 8 6 5 31
II 15 12 9 6 42
III 14 10 8 5 37
Totals T1.=41 T2.=30 T3.=23 T4.=16 110
Squares of observations
Blocks Treatments/Diets Totals
A B C D
I 144 64 36 25 269
II 225 144 81 36 486
III 196 100 64 25 385
Totals 565 308 181 86 1140
G 2 110
2
= 31 42 37 1008.3333
344