0% found this document useful (0 votes)
57 views

Elementary Statistics: Frequency Distribution " "

The document discusses organizing raw data into a frequency distribution by placing the data into classes and recording the number of data points in each class. It provides an example of lighting bulb lifetime data organized into an 8-class frequency distribution table, along with calculations of relative frequency and a cumulative relative frequency distribution. Various ways of depicting frequency distributions through histograms, polygons, and ogives are also described.

Uploaded by

Harsh Upadhayay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Elementary Statistics: Frequency Distribution " "

The document discusses organizing raw data into a frequency distribution by placing the data into classes and recording the number of data points in each class. It provides an example of lighting bulb lifetime data organized into an 8-class frequency distribution table, along with calculations of relative frequency and a cumulative relative frequency distribution. Various ways of depicting frequency distributions through histograms, polygons, and ogives are also described.

Uploaded by

Harsh Upadhayay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

ELEMENTARY STATISTICS

Frequency Distribution

The source of our scientific knowledge lies in “data”.


It is therefore appropriate to organize data so as to make them more
“comprehensible”, rather than leave them in the raw state in which they
have been collected.
Table:1
Raw Data: Bulb life in hours (resulted from testing the lifetimes of 60W light bulbs)
963.4 874.8 901.3 822.5 1066.2 939.0 822.9 1023.9
1175.9 1001.7 988.8 950.1 900.0 1092.7 1114.4 1056.2
1074.7 1198.2 1074.1 932.9 1142.2 1132.7 1166.2 1002.4
887.1 1003.4 1109.8 810.1 1152.8 1083.8 1122.8 1187.0
1078.3 1130.2 1087.3 1042.1 1093.1 989.8 1085.1 1023.8
1065.4 1092.7 1114.1 1129.8 1049.9 1021.0 951.9 909.2
1124.8 1143.8 1089.5 995.3 1078.4 1114.1 1001.2 1059.2
1083.8 1021.7 1133.5 1129.8 1021.3 1130.0 956.5 1078.7
882.6 976.1 1072.8 1021.3 1045.0 1121.9 1089.3 1092.1
1055.5 987.2 866.0 1102.1 1123.6 1058.4 1033.2 1066.3
1132.5 1108.4 922.0 970.0 1121.8 1149.9 949.9 984.3
1052.1 1099.3 1121.8 910.0 962.1 1028.1 1043.7 1112.1
1092.2 1075.6 1142.0 1060.1
One way of organizing data is to construct a frequency distribution by placing the
data into classes and recording the number of data in each class. This latter
quantity is the class frequency. If there are 𝑚 classes , then the individual class
frequencies will be denoted by 𝑓1 , 𝑓2 , 𝑓3 … … … . , 𝑓𝑚 .

A possible frequency representation of the foregoing bulb data may be given as:

Table:2
Class 𝒇 Cumulative 𝑭
Class Mark Frequency 𝒓𝒊 = 𝑵𝒊 Frequency
𝑹𝒊 = 𝑵𝒊
𝒇𝒊
800-850 825 3 0.03 3 0.03
850-900 875 4 0.04 7 0.07
900-950 925 8 0.08 15 0.15
950-1000 975 12 0.12 27 0.27
1000-1050 1025 16 0.16 43 0.43
1050-1100 1075 28 0.28 71 0.71
1100-1150 1125 24 0.24 95 0.95
1150-1200 1175 5 0.05 100 1.00
Here eight classes have been chosen in which we have placed the data. Each
class is of length 50(hrs) and the extent of the classes is sufficient to include all
of the data. Although the selection of equal length classes is not necessary, it is
certainly convenient and quite common.
The center of each class is characterized by a class mark 𝑥𝑖 . Since the actual raw
values are not present in the frequency distribution, the class mark 𝑥𝑖 will often
be used to represent each datum in the 𝑖 −th class during computations.
Histogram of the bulb life distribution is shown below (Fig.1).
It is sometimes useful to consider the frequency distribution derived by
accumulating the individual class frequencies. The resulting compilation is
known as cumulative frequency distribution. Fig.2 illustrates the cumulative
frequency histogram for the bulb life distribution.

Histogram Histogram
30 150
Frequency

20 cum.freq.distr. 100
10
0 50
Frequency 0
Frequency

Bulb-life(hrs.)
Bulb-life(hrs.)
Sometimes line graphs are employed instead of bar graphs to depict both
frequency and cumulative frequency distributions. To construct the frequency
Polygon corresponding to the distribution, we first extend the frequency
distribution class in either direction and assign each extra classes the frequency
0. We then plot the points 𝑥𝑖 , 𝑓𝑖 , 𝑖 = 0,1,2, … . . , 𝑚, 𝑚 + 1, where 𝑚 is the
original number of classes, and connect the pairs of adjacent points by straight
lines.
The cumulative frequency polygon, or ogive, is a piecewise linear representation
of the cumulative frequency distribution in which the cumulative frequency of a
class is plotted at the upper class boundary. The frequency polygon and ogive are
shown in Fig.3 and 4 respectively.
Ogive -bulb life
30 freq. polygon 120
25 100
20
80
15
60 Ogive -bulb
10
freq. polygon 40 life
5
0 20
875
825

925
975
1025
1075
1125
1175

0
825
875
925
975
1025
1075
1125
1175
Fig.3
Fig.4
Relative –Frequency Distributions

Given the frequencies, 𝒇𝟏 , 𝒇𝟐 , … . , 𝒇𝒎 of a frequency distribution possessing 𝒎


classes , we can construct a new distribution utilizing the relative frequencies
𝒇
𝒓𝒊 = 𝑵𝒊, 𝒊 = 𝟏, 𝟐, 𝟑, … . 𝒎
where 𝑵 = σ𝒊 𝒇𝒊 ,is the total frequency of the distribution. The resulting relative-
frequency distribution gives the proportion of the data belonging to each class.
Moreover, a cumulative relative-frequency distribution is defined by the
cumulative relative frequencies
𝑭
𝑹𝒊 = 𝑵𝒊 , 𝒊 = 𝟏, 𝟐, 𝟑, … . 𝒎
Where 𝑭𝒊 𝐢𝐬 𝐭𝐡𝐞 𝐜𝐮𝐦𝐮𝐥𝐚𝐭𝐢𝐯𝐞 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 𝐨𝐟 𝐭𝐡𝐞 𝒊 − 𝒕𝒉 𝒄𝒍𝒂𝒔𝒔. Observe that the
sum of the relative frequencies is 1. The relative frequencies and cumulative
relative frequencies are given in Table:2.
Relative Frequencies and Probability

The relative –frequency distribution can be given a probabilistic interpretation,


albeit one that is full of potential pitfalls. Intuitively , at least , a set of
measurements represents a sample from the collection of all possible
measurements, the latter being known as population of measurements. For
instance, the 100 bulb-life measurements of Table:1 represent a sample taken
from the bulb lives of all bulbs produced by the given manufacturer.

Consider the outcome of a measurement process as a random quantity 𝑿. If the


sample data faithfully reflect the population ( a risky assumption) and if the 𝒊 −
𝒕𝒉 class has lower and upper boundaries 𝒍𝒊 and 𝒖𝒊 respectively, then the
probability that any (future) measurement 𝑿 lies between the two boundary
values is given by
𝒇
𝑷 𝒍𝒊 < 𝑿 < 𝒖𝒊 = 𝑵𝒊 = 𝒓𝒊
the relative frequency of the 𝒊 − 𝒕𝒉 class. The relative frequency
𝒓𝒊 𝐫𝐞𝐩𝐫𝐞𝐬𝐞𝐧𝐭𝐬 𝐭𝐡𝐞 𝐩𝐫𝐨𝐩𝐨𝐫𝐭𝐢𝐨𝐧 𝐨𝐟 𝐭𝐡𝐞 𝐨𝐛𝐬𝐞𝐫𝐯𝐞𝐝 𝐯𝐚𝐥𝐮𝐞𝐬 𝐭𝐡𝐚𝐭 𝐚𝐫𝐞 𝐢𝐧 𝐭𝐡𝐞 𝒊 −
𝒕𝒉 𝐜𝐥𝐚𝐬𝐬, 𝐚𝐧𝐝 𝐰𝐞 𝐚𝐫𝐞 𝐬𝐢𝐦𝐩𝐥𝐲 𝐞𝐱𝐭𝐫𝐚𝐩𝐨𝐥𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐢𝐬 𝐩𝐫𝐨𝐩𝐨𝐫𝐭𝐢𝐨𝐧 𝐭𝐨 𝐝𝐞𝐧𝐨𝐭𝐞 𝐭𝐡𝐞
𝐩𝐫𝐨𝐩𝐨𝐫𝐭𝐢𝐨𝐧 𝐨𝐟 𝐭𝐡𝐞 𝐩𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 𝐟𝐚𝐥𝐥𝐢𝐧𝐠 𝐰𝐢𝐭𝐡𝐢𝐧 𝐭𝐡𝐞 𝒊 − 𝐭𝐡 𝐢𝐧𝐭𝐞𝐫𝐯𝐚𝐥.
The problem with the foregoing reasoning is that it involves an inductive leap on a
grand scale. The frequency distribution represents merely a sample set of
observations, and we have no certainty as to whether or not it yields a faithful
representation of the population.

Given that we understand the intuitive nature of the discussion, suppose we wish
to employ relative frequencies to give meaning to the statement 𝑷 𝒂 < 𝑿 < 𝒃 ,
where 𝒂 𝐚𝐧𝐝 𝒃 are not class boundaries. Assuming the data in each class are
uniformly distributed throughout the class interval, it seems reasonable to
estimate the desired probability by finding the fraction of the data that lies
between a and b.
Referring to Fig.5 and letting c denote the common length of the classes, 𝒓𝒊 the
relative frequency of the 𝒊 −th class , and 𝒖𝒊 𝒂𝒏𝒅 𝒍𝒊 the upper and lower
boundaries of the 𝒊-th class , a straightforward geometrical analysis yields,
𝒖𝟐 − 𝒂 𝒃 − 𝒍𝟒
𝑷 𝒂<𝑿<𝒃 = 𝒓𝟐 + 𝒓𝟑 + 𝒓𝟒
𝒄 𝒄

Fig.5
a 𝑢2 𝑥3 𝑙4 𝑏
The step function 𝒔(𝒙)depends solely on the sample data. Given a new set of
observations, it is likely that a different histogram , and thus a different step
function, would be obtained. Indeed, the result could be a markedly different step
function. The problem is that we really would like some function that would , at
least in theory, give the actual values of the probability statements, not simply
estimates based upon this or that sample. We desire a function 𝒇(𝒙) such that
𝒃
𝑷 𝒂 < 𝑿 < 𝒃 = න 𝒇 𝒙 𝒅𝒙
𝒂
For all values of a and b, 𝒂 < 𝒃. Of course , the “actual” function that is
appropriate for the population might look very different from the step function
𝒔(𝒙) derived from a given set of observations. One role of statistical analysis is to
provide a methodology for deciding on an appropriate 𝒇 𝒙 .
Probability as an integral

Again consider the relative –frequency histogram of Fig. , only this time let the 𝒚
axis be rescaled by dividing each relative frequency by c, the common class
length, to obtain the normalized relative frequencies
𝒓𝒊 𝒇𝒊
𝒑𝒊 = =
𝒄 𝒄𝑵
If the tops of the bars of this normalized relative-frequency histogram are viewed
as portions of a step function, 𝒔 𝒙 ,

𝒃
න 𝒔 𝒙 𝒅𝒙 = 𝒖𝟐 − 𝒂 𝒑𝟐 + 𝒄𝒑𝟑 + (𝒃 − 𝒍𝟒 )𝒑𝟒
𝒂

(𝒖𝟐 − 𝒂)𝒓𝟐 𝒄𝒓𝟑 (𝒃 − 𝒍𝟒 )𝒓𝟒


= + +
𝒄 𝒄 𝒄
Which is precisely the estimate obtained for 𝑷 𝒂 < 𝑿 < 𝒃 𝐨𝐛𝐭𝐚𝐢𝐧𝐞𝐝 𝐞𝐚𝐫𝐥𝐢𝐞𝐫.
That is the estimate of the probability that 𝑿 lies between a and b is given by the
integral of the normalized relative-frequency histogram over the interval (a,b)
The step function 𝒔(𝒙) depends solely on the sample data. Given a new set of
observations, it is highly likely that a different histogram, and thus a different
step function, would be obtained. Indeed , the result could be a markedly
different step function. The problem is that we really would like some function
that would , at least in theory, give the actual values of the probability
statements, not simply estimates based upon this or that sample. We desire a
function 𝒇(𝒙) such that
𝒃
𝑷 𝒂 < 𝑿 < 𝒃 = න 𝒇 𝒙 𝒅𝒙
𝒂
for all values of 𝒂 and 𝒃, 𝒂 < 𝒃.
Fig. illustrates a function that appears to be reasonable for the histogram of
Fig. (after normalization of the relative frequencies). Of course, the “ actual “
function that is appropriate for the population might look very different from
the step function 𝒔(𝒙) derived from a given set of observations. One role of
statistical analysis is to provide a methodology for deducing on an appropriate
𝒇 𝒙 .
Measure of central tendency

The frequency distribution provides a coherent organization of data, and


histogram gives a geometrical perspective; however, quantitatvie description is
also needed.

Empirical Mean ( Raw Data)

For a data sample


𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , … … , 𝒙𝑵
the empirical mean (arithmetic mean) of the sample is defined by
𝑵
𝟏 𝟏
ഥ=
𝒙 𝒙𝟏 + 𝒙𝟐 + ⋯ … … + 𝒙𝑵 = ෍ 𝒙𝒊
𝑵 𝑵
𝒊=𝟏
# One performance criterion for the speed of a computer’s CPU is the number
of floating-point operations that can be performed per second (flops). For a
supercomputer, this rate is measured in megaflops. An estimate of this
measure can be obtained by averaging the numbers of megaflops achieved
when employing some collection of benchmark routines. The resulting rates
can be interpreted as constituting a sample from some population of rates
corresponding to a large class of routines.

Megaflop observations (Probability and Statistics for the engineering , Computing and Physical sciences; Edward
R.Dougherty; PH,1990)

3.9 4.7 3.7 5.6 4.3 4.9 5.0 6.1 5.1 4.5
5.3 3.9 4.3 5.0 6.0 4.7 5.1 4.2 4.4 5.8
3.3 4.3 4.1 5.8 4.4 3.8 6.1 4.3 5.3 4.5
4.0 5.4 3.9 4.7 3.3 4.5 4.7 4.2 4.5 4.8

The empirical mean for the sample is


1 3.9+4.7+⋯…+4.8
𝑥ҧ = 40 σ40
𝑖=1 𝑥𝑖 = = 4.7
40
Empirical mean ( Frequency distribution)

For a frequency distribution with 𝒎 classes and total frequency 𝑵, the empirical
mean is defined as
𝒎 𝒎
𝟏
ഥ = ෍ 𝒇𝒊 𝒙𝒊 = ෍ 𝒓𝒊 𝒙𝒊
𝒙
𝑵
𝒊=𝟏 𝒊=𝟏
Table:3
Class Class Mark Frequency Relative Freq.
3.25-3.75 3.5 3 0.075
3.75-4.25 4.0 8 0.200
4.25-4.75 4.5 14 0.350
4.75-5.25 5.0 6 0.150
5.25-5.75 5.5 4 0.100
5.75-6.25 6.0 5 0.125

Employing the frequency information results in empirical mean (or simply mean) as
above
1
𝑥ҧ = (3 × 3.5 + 8 × 4.0 + 14 × 4.5 + 6 × 5.0 + 4 × 5.5 + 5 × 6.0 = 4.6875
40
Median

For raw data sample 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , … … , 𝒙𝑵 the median is defined in one of the two ways depending on whether 𝑵 is odd
or even. Let
𝒚𝟏 ≤ 𝒚𝟐 ≤ ⋯ … . . ≤ 𝒚𝑵
be a relisting of the data according to increasing magnitude.

If 𝑵 is odd, then the median is defined to be the middle value in the relisting:
෥ = 𝒚(𝑵+𝟏)/𝟐
𝒙
If 𝑵 is even , then the median is defined to be the mean of the two “middle” value in the relisting:
𝒚𝑵/𝟐 + 𝒚(𝑵+𝟐)/𝟐
෥=
𝒙
𝟐

To adapt the above definition to frequency distributions, we find the point 𝐨𝐧 𝐭𝐡𝐞 𝒙 axis that divide the area of the
histogram in two. Suppose 𝑭𝒌−𝟏 ≤ 𝑵Τ𝟐 ≤ 𝑭𝒌 , then the point on the 𝒙 axis that divides the area of the histogram into
two equal portions lies within the 𝒌th class-called median class.

The median for a frequency distribution is defined to be


𝑵
− 𝑭𝒌−𝟏
෥ = 𝒍𝒌 + 𝟐
𝒙 𝒄
𝒇𝒌
Where 𝑵 is the total frequency, 𝑭𝒌−𝟏 is the cumulative frequency for the class immediately prior to the median class,
and 𝒄,𝒇𝒌 and 𝒍𝒌 𝐚𝐫𝐞 𝐭𝐡𝐞 𝐥𝐞𝐧𝐠𝐭𝐡, 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲, 𝐚𝐧𝐝 𝐥𝐨𝐰𝐞𝐫 𝐛𝐨𝐮𝐧𝐝𝐚𝐫𝐲 ,
respectively of the median class.
Variance
For a data sample
𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , … … , 𝒙𝑵
the variance of the sample is defined by
𝑵
𝟐
𝟏
𝝈 = ഥ)𝟐
෍(𝒙𝒊 − 𝒙
𝑵−𝟏
𝒊=𝟏

In case of a frequency distribution with 𝒎 classes


𝒎
𝟏
𝝈𝟐 = ෍ 𝒇𝒋 (𝒙𝒋 − 𝒙ഥ)𝟐
𝑵−𝟏
𝒋=𝟏
ഥ = 𝟒. 𝟕, we find that
Referring to the megaflop data and using 𝒙
𝝈𝟐
𝟏
= (𝟑 × 𝟏. 𝟒𝟒 + 𝟖 × 𝟎. 𝟒𝟗 + 𝟏𝟒 × 𝟎. 𝟎𝟒 + 𝟔 × 𝟎. 𝟎𝟗 + 𝟒 × 𝟎. 𝟔𝟒 + 𝟓
𝟑𝟗

You might also like