Frequency Distribution
Frequency Distribution
Statistics are a set of numerical statements and facts collected from any field of
enquiry for drawing valid inferences for decision making. Data collection is in fact, the
most important aspect of a research experiment/statistical survey. After data have been
collected, the next step is to present the data in some orderly and logical form so that
their essential features may become explicit. The need for proper presentation of data
arises because the mass of collected data in their raw form is often so voluminous which
cannot be easily comprehended and analyzed. Therefore, after the collection of data, it
is imperative that data are classified and presented in such a way so as to bring out
points of similarities and dissimilarities in the data.
After studying this module, student shall be able to organize and describe distributions of
data by using a number of different methods, including frequency tables, histograms,
standard line and bar graphs, stem-and-leaf displays, scatter plots, and box-and-whisker
plots.
V. LESSON CONTENT
Collection of Data
To study any problem by means of statistical methods first, the relevant data are
collected. Sometimes the data is to be collected from some research experiment or the primary
sampling units (households). Sometimes, the relevant data may exist in a published or
unpublished form, being collected by a private body or by the Government agency or by some
research organization, for its own use or for supplying popular information. In making use of
such data (called secondary data), one has to be particularly careful about the definitions of
terms and concepts used by the collecting authority and also about the method of collection and
Classification of Data
Classification is the process of arranging the data into different groups or classes
according to some common characteristics. According to Connor Classification may be defined
as the process of arranging things in groups or classes according to their resemblances and
affinities. The functions of classification may be summarized as follows:
It condenses the data.
It facilitates comparisons.
It helps to study the relationships.
It facilities the statistical treatment of data.
Count the number of observations that fall into each category. The number associated
with each category is called the frequency and the collection of frequencies over all categories
gives the frequency distribution of that variable. Generally, a frequency distribution has 5 to 15
classes.
It presents data in a useful form and allows for a visual interpretation.
It enables analysis of the data set including where the data are concentrated / clustered,
the range of values, and observation of extreme values,
Table 1
Frequency Distribution of Time (min)
Example 1. The following data pertain to first lactation milk yield (in kg) of 100 cows
1630 1648 1663 1665 1671 1677 1680 1687 1690 1695
1787 1788 1790 1800 1862 1855 1815 1835 1845 1818
1974 1998 2000 2000 2005 2031 2045 2045 2050 2056
2168 2171 2180 2187 2200 2218 2245 2323 2372 2397
2063 2069 2085 2098 2100 2100 2100 2105 2117 2131
1736 1743 1760 1765 1763 1767 1775 1775 1776 1780
1695 1754 1698 1700 1742 1732 1711 1713 1718 1728
The data given in example 1 are called the raw or ungrouped data which does not give
us any useful information. Our objective will be to express the huge data in a suitable
condensed form which will highlight the significant facts and comparisons and furnish more
useful information without sacrificing any information of interest about the important
characteristics of the distribution.
Array
A much better way of the presentation of the data is to express in the form of a discrete or
ungrouped frequency distribution, where we count the number of times each value of the
variable occurs in the data. The number of times a variate value is repeated is called frequency
of the variate value e.g. suppose there are seven Karan Fries cows having first lactation milk
yield equal to 1900 kg, 7 is the frequency of first lactation yield of 1900 kg.
It is a statistical table which shows the values of the variable in groups and also the
corresponding frequencies side by side. In this type of set up, the condensation of data consists
in classifying the data into different classes (or class intervals) by dividing the entire range of the
values of the variable into a suitable number of groups, called classes and then recording the
number of observations in each group. The type of such representation of data is called a
grouped frequency distribution. The groups are called the classes and the boundary ends are
called class limits e.g. for a class interval 0 10, 0 is the lower limit and 10 is the upper limit. The
difference between upper and lower limit is called magnitude of the class. The number of
observations falling within a particular or defined class is called its frequency or class
frequency. The variate value which lies midway between the upper and lower limits is called
mid value or midpoint of that class.
While preparing the frequency distribution the following points must be kept in mind:
1. The class interval should be uniform i.e. it should be of equal width. A comparison of
different frequency distributions is facilitated if the same class interval is used for all.
The class interval should be an integer as far as possible.
2. The class interval should be so chosen that all the observations should be reflected by
the frequency distribution.
3. The class interval should be continuous open end classes less than < a or greater
than >b should be avoided. These classes create difficulty in analysis and
interpretation.
4. The observations corresponding to the common point between two classes should
always be put in the higher class e.g. a number corresponding to the variate 30 is to be
put in the class 30-40 and not in 20-30.
Number of classes
The following formula due to Sturges may be used to determine the number of classes k =
1+3.322 log10N where k is the number of classes and N is the total frequency.
The choice of class interval depends on the number of classes for a given distribution and size
of the data. As far as possible the class intervals should be of equal size. Prof. Sturges has
given the following formula for determining the size of class intervals
Example 2: If we consider the data given in example 1 let us find its size of class interval and
prepare its frequency distribution
Taking class intervals as 1630-1730, 1730-1830, ----, 2330-2430 the frequency distribution of
first lactation milk yield of 100 Karan Swiss cows is given below in Table 2.1
Advantages of grouping
The cumulative frequency of a class is the total frequency up to and including that class. The
table of cumulative frequencies is called a cumulative frequency distribution table. There are
two types of cumulative frequency distribution. The cumulative frequency distribution of all
values greater than or equal to the lower limit of each class is called more than cumulative
frequency distribution. The cumulative frequency of all values less than or equal to the upper
limit of each class is called less than cumulative frequency distribution. Let us illustrate this
through example 3
Example 3: Prepare the cumulative frequency distribution of the frequency distribution of first
lactation milk yield of Karan Swiss cows given in table 2.1.
Solution: The less than cumulative frequency and more than cumulative frequency distribution
are shown in table 2.2
Graphical Presentation
Histograms
Example:
Mercury contamination can be particularly high in certain types of fish. The mercury content
(ppm) on the hair of 40 fishermen in a region thought to be particularly vulnerable are given
The first step is to group the data. A reasonable choice of class intervals is:
0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.
The frequency table that results from the use of these intervals is:
Interval Frequency
N.B. By convention, any
0-10 5
observation that is at a
10-20 11
20-30 10 boundary of a class will be put
30-40 9 into the higher class. For
40-50 2 example, an observation of 10
50-60 1 above would be put into the 10-
60-70 2 20 category.
To construct the histogram in this situation (i.e. all class widths equal):
Mark boundaries of the class intervals on the horizontal axis.
The height of the bars above each interval can be taken as the frequency for that interval.
10
Frequency
0 10 20 30 40 50 60 70
Mercury content (ppm)
Instead of using frequencies to give the heights of the rectangles in a histogram, relative
frequencies may be used. The relative frequency for an interval is that interval's frequency
divided by the total frequency.
20
10
0 10 20 30 40 50 60 70
Mercury content (ppm)
Notice that the shape of the histograms, whether using frequencies or relative frequencies, is
the same.
There is no hard and fast rule as to how many intervals should be used. Too
many classes produce an uneven distribution, but having too few loses information.
Usually the number of classes is about 6-20. The more observations we have, the more
classes we will usually use.
The width of the intervals defining the histograms need not all be equal. It is
often sensible to choose short intervals where the data is quite dense but intervals with a
longer width where the data is more sparse. This will ensure that we don’t have too
many intervals with zero frequency, yet keeps as much information about the
distributional shape of the data as possible.
When unequal interval widths are used, then the frequency density should be
used on the vertical scale on the histogram, where
Frequency density = Frequency class width.
Example:
The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the
following table:
120
100
80
60
40
20
0
2
NVSU-FR-ICD-05-00 (081220) 3 4 5 6 7 8 __
Page 8 of
Vehicle length (m)
Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
Notice that if we had simply defined the heights of the rectangles to be the frequencies, then
the histogram would exaggerate, for example, the incidence of cars between 3 and 4 metres
in length.
An alternative way of producing a histogram in situations were not all class widths are equal is
to set the bar height to be the relative frequency density. This is given by:
If the histogram is produced in this way, then the total area of all the bars is 1.
Example (continued)
The relative frequency densities for the car vehicle length data are as follows:
0.7
0.6
Frequency density
0.5
0.4
0.3
0.2
0.1
0.0
Histograms are very useful for giving some idea of the shape of a density by approximating the
histogram to a smooth curve.
Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are
generally used for continuous data. However, when the number of possible values for the
observations is large, a bar diagram would become uninformative. In this case it is acceptable to
group the values into class intervals, much as you would for continuous data.
Example:
Suppose we have the following data:
1 1 2 2 2 3 3 4 4 5 5 5 5 6 6 7 7 7
8 9 9 9 9 10 10 10 10 10 11 11 11 11 12 12 12 12
13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 16 16 16
17 17 17 18 18 19 19 20 21 21 22 22 23 23 24 26 27 29
As there are a large number of different values here, to get a better idea of the shape of the
distribution, we can group data into classes. Let's consider grouping all observations between 1
- 3, 4 - 6 and so on. To draw a histogram we need a continuous scale and so we need to define
our histogram intervals to be 0.5 - 3.5, 3.5 - 6.5, and so on. (Remember: a histogram never has
gaps between the bars).
Stem-and-leaf plots are an effective way of providing a visual display of quantitative data with
very little effort. The idea of the plots is to separate each observation into 2 parts - the first part
being the stem and the second the leaf.
Example:
To investigate the efficiency of new air-conditioning equipment installed on Boeing 720 aircraft,
the times (in hours) to first failure of the equipment were obtained from 28 different aircraft:
79 90 10 60 61 49 14 24 56 20 84 44 25 59
46 37 32 76 26 35 29 53 75 25 44 23 27 33
For these data an obvious choice for the stems is the leading digit (tens) and the leaves are
then the second digits (units). So, for example, the first observation of 79 has stem 7 and leaf 9.
The data values range from 10 up to 90, so we have the stem values 1-9.
1 0 4 An unordered stemandleaf diagram for
2 4 0 5 6 9 5 3 7 the Boeing data
3 7 2 5 3
4 9 4 6 4
5 6 9 3
6 0 1 Leaves- these
7 9 6 5 should be in
8 4 columns
9 0
Stem
1 0 4 An ordered stemandleaf diagram for
2 0 3 4 5 5 6 7 9 the Boeing data
3 2 3 5 7
4 4 4 6 9
5 3 6 9
6 0 1 Leaves have
7 5 6 9 now been put in
8 4 order
9 0
N.B. Rearranging the leaves in ascending order clarifies things and is useful for producing
numerical summaries.
N.B.2 One advantage that stem-and-leaf diagrams have over histograms is that they retain the
detail of the raw data.
Stem-and-leaf plots give a visual display of the rough shape of the distribution of the
variable being measured. We can identify whether the density is a) unimodal or multimodal;
b) symmetric, negatively or positively skewed; c) normal, heavy- or light-tailed.
Stem-and-leaf plots are useful for informal inference. We can find medians and quartiles
easily from the diagrams and obtain estimates of probabilities. For example, in the Boeing
data 10 pieces of equipment lasted under 30 hours so we could estimate the probability of a
new piece of equipment failing within the first 30 hours as 10/28.
Stem-and-leaf plots are useful for identifying outliers- these are unusually large or small
observations. For example, for the Boeing example, if there had been an extra observation
of 119, then this might be an outlier:
1 0 4
2 0 3 4 5 5 6 7 9
3 2 3 5 7
4 4 4 6 9
5 3 6 9
6 0 1
7 5 6 9
8 4
9 0 This could be
10 considered an
11 9 outlying value
Example:
To determine the age of a pre-historic settlement in North Wales, 24 small fragments from a
wooden boat found at the settlement were independently radio-carbon dated. The radio-carbon
determiniations (in years) of age of fragments are:
4969 5163 5052 5144 4965 5152 4967 4934 4895 5078 5019 4908
5009 5046 4912 5012 4889 5034 4914 5117 4931 5081 4984 4881
Possibility 1: We could round each observation to the nearest one hundred years:
5000 5200 5100 5100 5000 5200 5000 4900 4900 5100 5000 4900
5000 5000 4900 5000 4900 5000 4900 5100 4900 5100 5000 4900
Taking the stem unit to be 1000 years gives the following diagram:
Scale: Stem = 1000's
4 9 9 9 9 9 9 9 9
5 0 2 1 1 0 2 0 1 0 0 0 0 0 1 1 0 Leaves = 100's
Because we have so few stem values here, we lose a lot of information. We can’t say anything
for example about the shape of the distribution.
4970 5160 5050 5140 4970 5150 4970 4930 4900 5080 5020 4910
5010 5050 4910 5010 4890 5030 4910 5120 4930 5080 4980 4880
48 9 8
Scale: Stem = 100's
Leaves = 10's
This plot is a little more informative, but we could still do with having slightly more stems.
The diagram is now quite informative about the distribution- there is evidence of a positive skew.
[Note that if the stem unit was taken to be 10s, then the diagram we would get would be poor-
we would then have too many stem values (a lot of the rows would have no values in them).]
If there are 2 sets of data which you wish to compare, then both of these can be put on the
same stem-and-leaf plot with the leaves for one dataset going to the right and the leaves of the
other dataset going to the left.
Example:
Using a technique involving chromium dioxide, the protein assimilation efficiencies (i.e.
percentage of protein intake actually absorbed) were measured on field mice and voles fed on
their natural diets. The assimilation efficiencies (in percentages) are given below:
A.E.'s of voles:
51.7 66.7 72.0 69.8 63.7 77.2 62.6 63.5 69.2 67.5
70.1 67.3 75.2 73.8 59.6 69.9 77.6 74.1 73.7
It is not a good idea to do a back-to-back plot if the 2 variates are not independent. Consider
the following example.
Example:
Fifteen people participated on a short typing course. Their typing speeds (words/min) before
and after the course were recorded:
Subjec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
Before 15 18 23 27 36 12 8 19 32 22 17 21 16 15 33
After 26 28 27 26 28 24 26 42 32 36 20 29 21 22 28
These data are an example of matched-pair data (there are two measurements recorded on
each participant). Matched-pair data are likely to be dependent (a person with a fast typing
speed before the course is also likely to have a fast typing speed after the course). By drawing
a stem-and-leaf diagram you lose information about how the measurements pair up. You could
draw a scatter diagram (this would show the pairings). Alternatively, you could produce a stem-
and-leaf diagram of the differences:
Subjec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
Chang 11 10 4 -1 -8 12 18 23 0 14 3 8 5 7 -5
e
- 1 8 5
0 Scale: Stem = 10’s
0 4 0 3 8 5 7 Leaves = units.
1 1 0 2 8 4
2 3
A slightly more informative diagram can be obtained by splitting each stem up into two parts
(one for the lower leaves and the other for higher leaves):
-0H 8 5
-0L 1
0L 4 0 3
0H 8 5 7 Scale: Stem = 10’s
1L 1 0 2 4 Leaves = units.
1H 8
2L 3
Problems
Stem-and-leaf plots cannot be used for displaying qualitative data and they become impractical
for large numbers of observations.
A cumulative frequency plot also uses classes and frequencies. The cumulative frequency for a
class is the number of observations with values less than the upper boundary for that class.
Example:
Consider the mercury example again. The cumulative frequencies are given in the table below:
In a cumulative frequency polygon the cumulative frequencies are plotted against the upper
class boundaries of the classes. These points are then joined with a straight line.
Example (continued)
For the mercury example we want to plot the points (0, 0), (10, 5), (20, 16),…, (70, 40) and then
join these points:
40
Cumulative frequency
30
20
10
0 10 20 30 40 50 60 70
Mecrcury level
A cumulative frequency plot is useful for giving us some idea of the shape of the
distribution function of the variable. They can also be used to obtain estimates of the
median and other quantiles for grouped data.
Scatter Plots.
Scatter plots are useful for assessing relationships between 2 variables. To draw a scatter plot
we represent one of the variables by the horizontal axis and the other variable by the vertical
axis. We then simply plot the pairs of data points on the graph.
Example:
Fifteen children were given a visual-discrimination (V) test during the first week at primary
school and a reading-achievement (R) test at the end of their first year of schooling. Scores out
of 100 were calculated for each test.
Child no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
V-score 75 69 70 62 52 45 42 39 37 34 34 66 54 58 63
R-score 95 90 82 69 58 49 38 35 30 20 31 75 61 64 77
To draw a scatter plot we now want to plot the points (75, 95), (69, 90), (70, 82), …, (63, 77).
100
90
80
70
R-score
60
50
40
30
20
30 40 50 60 70 80 90 100
V-score
The plot would suggest that there is a positive relationship between the V-score and the R-
score.
The following graphs give illustrations of variables that are (a) positively and (b) negatively
correlated with each other. Correlation can also be categorised as strong or weak depending
upon how close the points are to lying on a straight line.
15 15
Strong, positive Weak, positive
10
10
5
y
5
0
0 5
0 5 10 15 0 5 10 15
x x
15 20
Strong, negative Weak, negative
15
10
10
y
5
5
0
0 5
0 5 10 15 0 5 10 15
x x
It is important to realise that scatter plots point to associations between variables. They do not
necessarily show a causal relationship.
Example:
Information about two variables (life expectancy and the number of people per television set) is
available for 12 countries:
80
Life expectancy 70
60
50
40
0 100 200
Number of people per TV
It is clear that the two variables are negatively correlated. However, it clearly would be wrong to
conclude that simply sending more televisions to countries with low life expectancies would
cause their inhabitants to live longer.
This example illustrates the very important distinction between causation and association. Two
variables may be strongly correlated without a cause-and-effect relationship existing between
them. Often the explanation is that both variables are related to a third variable not being
measured. In the example above for instance both life expectancy and the number of
televisions in the population will both be related to the country’s wealth.
There is one further type of graph that we will consider later in the chapter (namely box-and-
whisker plots). We first however need to look at numerical summary measures for data.
In the next few sections we will look at some numerical ways of summarising data.
Some notation
Suppose that we would like to learn about the random variable X. To do this we will observe a
random sample of n observations, X 1 , . .. , X n , such that each X i has the same distribution
as X. The observed values of X 1 , . .. , X n are then denoted x 1 ,. . ., x n .
Example:
Suppose we are interested in the number of units of alcohol students at UKC consumed last
week. To do this we could randomly select 50 students to form a random sample X 1 , . .. , X 50 ,
where Xi is the random variable representing the number of units of alcohol consumed by
the ith student. The observed value of Xi is denoted xi .
Example:
Suppose that we have the observations:
x 1=5, x 2 =10 , x 3=2, x 4 =7 .
Then
x( 1)=x 3=2, x (2 )=x 1 =5, x( 3)=x 4 =7, x( 4 )=x 2=10 .
When we have frequency data, we will denote the frequency of the kth class by f k for k = 1,
K
∑ f k =n .
…, K, where K is the number of classes. Then k =1
Interval Frequency
0-10 5
10-20 11
20-30 10
30-40 9
40-50 2
50-60 1
60-70 2
Here we have 7 classes, so that K = 7. Then f 1 =5, f 2 =11 , and so on, such that
7
∑ f k =40=n
k =1 .
VI. EVALUATION (Note: Not to be included in the student’s copy of the IM)
VII. ASSIGNMENT
VIII. REFERENCES
School Year
Semester
Course Number
e.g.:
IM-COURSE NO-SEMESTER-SCHOOL YEAR
IM-MCB180-1STSEM-2020-2021