2 LESSON 2 Freq Graphs FQ
2 LESSON 2 Freq Graphs FQ
Fig1
FREQUENCY DISTRIBUTION
AND GRAPHICAL PRESENTATION
2.1What is frequency distribution
Collected and classified data are presented in a form of frequency distribution. Frequency
distribution is simply a table in which the data are grouped into classes on the basis of common
characteristics and the number of cases which fall in each class are recorded. It shows the
frequency of occurrence of different values of a single variable. A frequency distribution is
constructed to satisfy three objectives :
(i) to facilitate the analysis of data,
(ii) to estimate frequencies of the unknown population distribution from the distribution of
sample data, and
(iii) to facilitate the computation of various statistical measures.
Frequency distribution can be of two types :
1. Univariate Frequency Distribution.
2. Bivariate Frequency Distribution.
In this lesson, we shall understand the Univariate frequency distribution. Univariate distribution
incorporates different values of one variable only whereas the Bivariate frequency distribution
incorporates the values of two variables. The Univariate frequency distribution is further
classified into three categories :
(i) Series of individual observations,
(ii) Discrete frequency distribution, and
(iii) Continuous frequency distribution.
Series of individual observations is a simple listing of items of each observation. If marks of 14
students in statistics of a class are given individually, it will form a series of individual
observations.
Marks obtained in Statistics :
Roll Nos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Marks : 60 71 80 41 81 41 85 35 98 52 50 91 30 88
Marks in Ascending Order Marks in Descending Order
30 98
35 91
41 88
41 85
50 81
52 80
60 71
71 60
80 52
81 50
85 41
88 41
91 35
98 30
Discrete Frequency Distribution: In a discrete series, the data are presented in such a way that
exact measurements of units are indicated. In a discrete frequency distribution, we count the
number of times each value of the variable in data given to you. This is facilitated through the
technique of tally bars.
In the first column, we write all values of the variable. In the second column, a vertical bar called
tally bar against the variable, we write a particular value has occurred four times, for the fifth
occurrence, we put a cross tally mark ( / ) on the four tally bars to make a block of 5. The
technique of putting cross tally bars at every fifth repetition facilitates the counting of the
number of occurrences of the value. After putting tally bars for all the values in the data; we
count the number of times each value is repeated and write it against the corresponding value of
the variable in the third column entitled frequency. This type of representation of the data is
called discrete frequency distribution.
We are given marks of 42 students:
55 51 57 40 26 43 46 41 46 48 33 40 26 40 40 41
43 53 45 53 33 50 40 33 40 26 53 59 33 39 55 48
15 26 43 59 51 39 15 45 26 15
We can construct a discrete frequency distribution from the above given marks.
Marks of 42 Students
Marks Tally Bars Frequency
15 3
26 5
33 4
39 2
40 5
41 2
43 3
45 2
46 2
48 2
50 1
51 2
53 3
55 3
57 1
59 2
Total 42
The presentation of the data in the form of a discrete frequency distribution is better than
arranging but it does not condense the data as needed and is quite difficult to grasp and
comprehend. This distribution is quite simple in case the values of the variable are repeated
otherwise there will be hardly any condensation.
Continuous Frequency Distribution: If the identity of the units about a particular information
collected, is neither relevant nor is the order in which the observations occur, then the first step
of condensation is to classify the data into different classes by dividing the entire group of values
of the variable into a suitable number of groups and then recording the number of observations in
each group. Thus, we divide the total range of values of the variable (marks of 42 students) i.e.
59_15 = 44 into groups of 10 each, then we shall get (42/10) 5 groups and the distribution of
marks is displayed by the following frequency distribution:
Marks of 42 Students
Marks (×) Tally Bars Number of Students (f)
15—25 3
25—35 9
35—45 12
45—55 12
55—65 6
Total 42
The various groups into which the values of a variable are classified are known classes, the
length of the class interval (10) is called the width of the class. Two values, specifying the class,
are called the class limits. The presentation of the data into continuous classes with the
corresponding frequencies is known as continuous frequency distribution. There are two methods
of classifying the data according to class intervals :
(i) exclusive method, and
(ii) inclusive method
In an exclusive method, the class intervals are fixed in such a manner that upper limit of one
class becomes the lower limit of the following class. Moreover, an item equal to the upper limit
of a class would be excluded from that class and included in the next class. The following data
are classified on this basis.
Income. No.ofPersons
(Rs.)
200—250 50
250—300 100
300—350 70
350—400 130
400—450 50
450—500 100
Total 500
It is clear from the example that the exclusive method ensures continuity of the data in as much
as the upper limit of one class is the lower limit of the next class. Therefore, 50 persons have
their incomes between 200 to 249.99 and a person whose income is 250 shall be included in the
next class of 250—300.
According to the inclusive method, an item equal to upper limit of a class is included in that class
itself. The following table demonstrates this method.
Income. No.ofPersons
(Rs.)
200—249 50
250—299 100
300—349 70
350—399 130
400—449 50
450—499 100
Total 500
Hence in the class 200—249, we include persons whose income is between Rs. 200 and Rs. 249.
2.2 Principles for Constructing Frequency Distributions
Inspite of the great importance of classification in statistical analysis, no hard and fast rules are
laid down for it. A statistician uses his discretion for classifying a frequency distribution and
sound experience, wisdom, skill and aptness for an appropriate classification of the data.
However, the following guidelines must be considered to construct a frequency distribution:
1. Type of classes: The classes should be clearly defined and should not lead to any ambiguity.
They should be exhaustive and mutually exclusive so that any value of variable corresponds to
only class.
2. Number of classes: The choice about the number of classes in which a given frequency
distribution should be divided depends upon the following things;
(i) The total frequency which means the total number of observations in the distribution.
(ii) The nature of the data which means the size or magnitude of the values of the variable.
(iii) The desired accuracy.
(iv) The convenience regarding computation of the various descriptive measures of the frequency
distribution such as means, variance etc.
The number of classes should not be too small or too large. If the classes are few, the
classification becomes very broad and rough which might obscure some important features and
characteristics of the data. The accuracy of the results decreases as the number of classes
becomes smaller. On the other hand, too many classes will result in a few frequencies in each
class. This will give an irregular pattern of frequencies in different classes thus makes the
frequency distribution irregular. Moreover a large number of classes will render the distribution
too unwieldy to handle. The computational work for further processing of the data will become
quite tedious and time consuming without any proportionate gain in the accuracy of the results.
Hence a balance should be maintained between the loss of information in the first case and
irregularity of frequency distribution in the second case, to arrive at a suitable number of classes.
Normally, the number of classes should not be less than 5 and more than 20. Prof. Sturges has
given a formula :
k = 1+ 3.322 log n
where k refers to the number of classes and n refers to total frequencies or number of
observations. The value of k is rounded to the next higher integer :
If n = l00 k = 1 + 3.322 1og l00 = 1 + 6.644 = 8
If n =10,000 k = 1 + 3.22 log 10,000 = 1 + 13.288 = 14
However, this rule should be applied when the number of observations are not very small.
Further, the number or class intervals should be such that they give uniform and unimodal
distribution which means that the frequencies in the given classes increase and decrease steadily
and there are no sudden jumps. The number of classes should be an integer preferably 5 or
multiples of 5, 10, 15, 20, 25 etc. which are convenient for numerical computations.
3. Size of Class Intervals : Because the size of the class interval is inversely proportional to the
number of classes in a given distribution, the choice about the size of the class interval will
depend upon the sound subjective judgment of the statistician. An approximate value of the
magnitude of the class interval say i can be calculated with the help of Sturge's Rule :
where i stands for class magnitude or interval, Range refers to the difference between the largest
and smallest value of the distribution, and n refers to total number of observations.
If we are given the following information; n = 400, Largest item = 1300 and Smallest item =
340.then,
Another rule to determine the size of class interval is that the length of the class interval should
not be greater than 1/4th of the estimated population standard deviation. If 6 is the estimate of
population standard deviation then the length of class interval is given by: i £ 6/4.
The size of class intervals should be taken as 5 or multiples of 5, 10, 15 or 20 for easy
computations of various statistical measures of the frequency distribution, class intervals should
be so fixed that each class has a convenient mid-point around which all the observations in that
class cluster. It means that the entire frequency of the class is concentrated at the mid value of the
class. It is always desirable to take the class intervals of equal or uniform magnitude throughout
the frequency distribution.
4. Class Boundaries: If in a grouped frequency distribution there are gaps between the upper
limit of any class and lower limit of the succeeding class (as in case of inclusive type of
classification), there is a need to convert the data into a continuous distribution by applying a
correction factor for continuity for determining new classes of exclusive type. The lower and
upper class limits of new exclusive type classes are called class boundaries.
If d is the gap between the upper limit of any class and lower limit of succeeding class, the class
boundaries for any class are given by:
d/2 is called the correction factor.
Let us consider the following example to understand :
Marks Class Boundaries
20—24 (20—0.5, 24 + 0.5) i.e., 19.5—24.5
25—29 (25—0.5, 29 + 0.5) i.e., 24.5—29.5
30—34 (30—0.5, 34 + 0.5) i.e., 29.5—34.5
35—39 (35—0.5, 39 + 0.5) i.e., 34.5—39.5
40—44 (40—0.5, 44 + 0.5) i.e., 39.5—44.5
5. Mid-value or Class Mark: The mid value or class mark is the value of a variable which is
exactly at the middle of the class. The mid-value of any class is obtained by dividing the sum of
the upper and lower class limits by 2.
Mid value of a class = [Lower class limit + Upper class limit]
The class limits should be selected in such a manner that the observations in any class are evenly
distributed throughout the class interval so that the actual average of the observations in any
class is very close to the mid-value of the class.
6. Open End Classes : The classification is termed as open end classification if the lower limit of
the first class or the upper limit of the last class or both are not specified and such classes in
which one of the limits is missing are called open end classes. For example, the classes like the
marks less than 20 or age above 60 years. As far as possible open end classes should be avoided
because in such classes the mid-value cannot be accurately obtained. But if the open end classes
are inevitable then it is customary to estimate the class mark or mid-value for the first class with
reference to the succeeding class. In other words, we assume that the magnitude of the first class
is same as that of the second class.
Example: Construct a frequency distribution from the following data by inclusive method taking
4 as the class interval:
10 17 15 22 11 16 19 24 29 18
25 26 32 14 17 20 23 27 30 12
15 18 24 36 18 15 21 28 33 38
34 13 10 16 20 22 29 19 23 31
Solution: Because the minimum value of the variable is 10 which is a very convenient figure for
taking the lower limit of the first class and the magnitude of the class interval is given to be 4,
the classes for preparing frequency distribution by the Inclusive method will be 10—13, 14—17,
18—21, 22—25, ..................... 38—41.
Frequency Distribution
Class Interval Tally Bars Frequency (f)
10—13 5
14—17 8
18—21 8
22—25 7
26—29 5
30—33 4
34—37 2
38—41 1
Example: Prepare a statistical table from the following :
Weekly wages (Rs.) of 100 workers of Factory A
88 23 27 28 86 96 94 93 86 99
82 24 24 55 88 99 55 86 82 36
96 39 26 54 87 100 56 84 83 46
102 48 27 26 29 100 59 83 84 48
104 46 30 29 40 101 60 89 46 49
106 33 36 30 40 103 70 90 49 50
104 36 37 40 40 106 72 94 50 60
24 39 49 46 66 107 76 96 46 67
26 78 50 44 43 46 79 99 36 68
29 67 56 99 93 48 80 102 32 51
Solution: The lowest value is 23 and the highest 106. The difference between the lowest and
highest value is 83. If we take a class interval of 10, nine classes would be made. The first class
should be taken as 20—30 instead of 23—33 as per the guidelines of classification.
Frequency Distribution of the Wages of 100 Workers
Wages (Rs.) Tally Bars Frequency (f)
20—30 13
30—40 11
40—50 18
50—60 10
60—70 6
70—80 5
80—90 14
90—100 12
100—110 11
Total 100
2.3 Graphs of Frequency Distributions
The guiding principles for the graphic representation of the frequency distributions are same as
for the diagrammatic and graphic representation of other types of data. The information
contained in a frequency distribution can be shown in graphs which reveals the important
characteristics and relationships that are not easily discernible on a simple examination of the
frequency tables. The most commonly used graphs for charting a frequency distribution are :
1. Histogram
2. Frequency polygon
3. Smoothed frequency curves
4. Ogives or cumulative frequency curves.
2.3.1. Histogram
The term `histogram' must not be confused with the term `historigram' which relates to time
charts. Histogram is the best way of presenting graphically a simple frequency distribution. The
statistical meaning of histogram is that it is a graph that represents the class frequencies in a
frequency distribution by vertical adjacent rectangles.
While constructing histogram the variable is always taken on the X-axis and the corresponding
frequencies on the Y-axis. Each class is then represented by a distance on the scale that is
proportional to its class-interval. The distance for each rectangle on the X-axis shall remain the
same in case the class-intervals are uniform throughout; if they are different the width of the
rectangles shall also change proportionately. TheY-axis represents the frequencies of each class
which constitute the height of its rectangle. We get a series of rectangles each having a class
interval distance as its width and the frequency distance as its height. The area of the histogram
represents the total frequency.
The histogram should be clearly distinguished from a bar diagram. A bar diagram is one-
dimensional where the length of the bar is important and not the width, a histogram is two-
dimensional, where both the length and the width are important. However, a histogram can be
misleading if the distribution has unequal class intervals and suitable adjustments in frequencies
are not made.
The technique of constructing histogram is explained for :
(i) distributions having equal class-intervals, and
(ii) distributions having unequal class-intervals.
When class-intervals are equal, take frequency on the Y-axis, the variable on the X-axis and
construct rectangles. In such a case the heights of the rectangles will be proportional to the
frequencies.
Histograms
It is often useful to look at the distribution of the data, or the frequency with which certain values
fall between pre-set bins of specified sizes. The selection of these bins is up to you, but
remember that they should be selected in order to illuminate your data, not obfuscate it.
A histogram is similar to a bar chart. However histograms are used for continuous (as opposed to
discrete or qualitative) data. The defining property of a histogram is:
The area of each bar is proportional to the frequency.
If each bin has an equal width, then this can be easily done by plotting frequency on the
vertical axis. However histograms can also be drawn with unequal bin sizes, for which one
can plot frequency density.
To produce a histogram with equal bin sizes:
Select a minimum, a maximum, and a bin size. All three of these are up to you. In the
histogram data used above the minimum is 1, the maximum is 110, and the bin size is 10.
Calculate your bins and how many values fall into each of them. For the histogram data
the bins are:
1 ≤ x < 10, 16 values.
10 ≤ x < 20, 4 values.
20 ≤ x < 30, 4 values.
30 ≤ x < 40, 2 values.
40 ≤ x < 50, 2 values.
50 ≤ x < 60, 1 values.
60 ≤ x < 70, 0 values.
70 ≤ x < 80, 0 values.
80 ≤ x < 90, 0 values.
90 ≤ x < 100, 0 value.
100 ≤ x < 110, 0 value.
110 ≤ x < 120, 1 value.
Plot the counts you figured out above. Do this using a standard bar plot.
Worked Problem
Let's say you are an avid roleplayer who loves to play Mechwarrior, a d6 (6 sided die) based
game. You have just purchased a new 6 sided die and would like to see whether it is biased
(in combination with you when you roll it).
What We Expect
So before we look at what we get from rolling the die, let's look at what we would expect.
First, if a die is unbiased it means that the odds of rolling a six are exactly the same as the
odds of rolling a 1--there wouldn't be any favoritism towards certain values. Using the
standard equation for the arithmetic mean find that μ = 3.5. We would also expect the
histogram to be roughly even all of the way across--though it will almost never be perfect
simply because we are dealing with an element of random chance.
What We Get
Here are the numbers that you collect:
15641355641566451436
13642416422434116355
43534225654353315445
12516543242133346113
66146665315634555244
Analysis
Referring back to what we would expect for an unbiased die, this is pretty close to what
we would expect. So let's create a histogram to see if there is any significant difference in
the distribution.
The only logical way to divide up dice rolls into bins is by what's showing on the die
face:
1 23 4 5 6
16 9 17 21 20 17
If we are good at visualizing information, we can simple use a table, such as in the one
above, to see what might be happening. Often, however, it is useful to have a visual
representation. As the amount of variety of data we want to display increases, the need
for graphs instead of a simple table increases.
Looking at the above figure, we clearly see that sides 1, 3, and 6 are almost exactly what
we would expect by chance. Sides 4 and 5 are slightly greater, but not too much so, and
side 2 is a lot less. This could be the result of chance, or it could represent an actual
anomaly in the data and it is something to take note of keep in mind. We'll address this
issue again in later chapters.
Frequency Density
Another way of drawing a histogram is to work out the Frequency Density.
Frequency Density
The Frequency Density is the frequency divided by the class width.
Frequency Polygon
This is a histogram with an overlaid frequency polygon.
Midpoints of the interval of corresponding rectangle in a histogram are joined together by
straight lines. It gives a polygon i.e. a figure with many angles.
It is used when two or more sets of data are to be illustrated on the same diagram such as death
rates in smokers and non smokers, birth and death rates of a population etc.
One way to form a frequency polygon is to connect the midpoints at the top of the bars of a
histogram with line segments (or a smooth curve). Of course the midpoints themselves could
easily be plotted without the histogram and be joined by line segments. Sometimes it is
beneficial to show the histogram and frequency polygon together.But sometimes, the frequency
polygon is much more accurate than the histogram because you can evaluate which is the low
point and the high point.
Unlike histograms, frequency polygons can be superimposed so as to compare several frequency
distributions.
Frequency polygons were created in the 9th century as a way of not only storing data, but
making it easily accessible for people who are illiterate
Frequency polygon has an advantage over the histogram. The frequency polygons of several
distributions can be drawn on the same axis, which makes comparisons possible whereas
histogram cannot be used in the same way. To compare histograms we need to draw them on
separate graphs.
2.3.3. Smoothed Frequency Curve
A smoothed frequency curve can be drawn through the various points of the polygon. The curve
is drawn by free hand in such a manner that the area included under the curve is approximately
the same as that of the polygon. The object of drawing a smoothed curve is to eliminate all
accidental variations which exists in the original data, while smoothening, the top of the curve
would overtop the highest point of polygon particularly when the magnitude of the class interval
is large. The curve should look as regular as possible and all sudden turns should be avoided. The
extent of smoothening would depend upon the nature of the data. For drawing smoothed
frequency curve it is necessary to first draw the polygon and then smoothen it. We must keep in
mind the following points to smoothen a frequency graph:
(i) Only frequency distribution based on samples should be smoothened.
(ii) Only continuous series should be smoothened.
(iii) The total area under the curve should be equal to the area under the histogram or polygon.
2.3.4. Cumulative Frequency Curves or Ogives
We have discussed the charting of simple distributions where each frequency refers to the
measurement of the class-interval against which it is placed. Sometimes it becomes necessary to
know the number of items whose values are greater or less than a certain amount. We may, for
example, be interested in knowing the number of students whose weight is less than 65 lbs. or
more than say 15.5 lbs. To get this information, it is necessary to change the form of frequency
distribution from a simple to a cumulative distribution. In a cumulative frequency distribution,
the frequency of each class is made to include the frequencies of all the lower or all the upper
classes depending upon the manner in which cumulation is done. The graph of such a
distribution is called a cumulative frequency curve or an Ogive. There are two method of
constructing ogives, namely:
(i) less than method, and
(ii) more than method.
In less than method, we start with the upper limit of each class and go on adding the frequencies.
When these frequencies are plotted we get a rising curve.
In more than method, we start with the lower limit of each class and we subtract the frequency of
each class from total frequencies. When these frequencies are plotted, we get a declining curve.
This example would illustrate both types of ogives.
Q4: Draw ogives by both the methods from the following data.
Distribution of weights of the students of a college (lbs.)
Weights No. of Students
90.5—100.5 5
100.5—110.5 34
110.5—120.5 139
120.5—130.5 300
130.5—140.5 367
140.5—150.5 319
150.5—160.5 205
160.5—170.5 76
170.5—180.5 43
180.5—190.5 16
190.5—200.5 3
200.5—210.5 4
210.5—220.5 3
220.5—230.5 1
Less than (Weights) Cumulative Frequency
100.5 5
110.5 39
120.5 178
130.5 478
140.5 845
150.5 1164
160.5 1369
170.5 1445
180.5 1488
190.5 1504
200.5 1507
210.5 1511
220.5 1514
230.5 1515
Plot these frequencies and weights on a graph paper. The curve formed is called an Ogive.
Although the graphs are a powerful and effective method of presenting statistical data, they are
not under all circumstances and for all purposes complete substitutes for tabular and other forms
of presentation. The specialist in this field is one who recognizes not only the advantages but also
the limitations of these techniques. He knows when to use and when not to use these methods
and from his experience and expertise is able to select the most appropriate method for every
purpose.
Q5. The following distribution is with regard to weight in grams of mangoes of a given variety.
If mangoes of weight less than 443 grams be considered unsuitable for foreign market, what is
the percentage of total mangoes suitable for it? Assume the given frequency distribution to be
typical of the variety:
Weight in gms. No. of mangoes Weight in gms. No. of mangoes
410—419 10 450—459 45
420—429 20 460—469 18
430—439 42 470—479 7
440—449 54
Draw an ogive of `more than' type of the above data and deduce how many mangoes will be
more than 443 grams.
The component parts i.e. sectors have been cut beginning from top in clockwise order.
Note that the percentages in a list may not add up to exactly 100% due to rounding. For example
if a person spends a third of their time on each of three activities: 33%, 33% and 33% sums to
99%.
Warning: Pie charts are a poor way of communicating information. The eye is good at judging
linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of
displaying this type of data.
Cleveland (1985), page 264: "Data that can be shown by pie charts always can be shown by a dot
chart. This means that judgments of position along a common scale can be made instead of the
less accurate angle judgments." This statement is based on the empirical investigations of
Cleveland and McGill as well as investigations by perceptual psychologists.
Three-dimensional (3d) pie charts compound perceptual misinterpretation of statistical
information by altering the relative angle of pie slices to create the impression of depth into
a vanishing point. Angles and areas at the bottom of the chart must be exaggerated and the angles
and areas at the top of the chart reduced in order to create the dimensional effect; a specifically
false depiction of the data.
The comparative pie charts are very difficult to read and compare if the ratio of the pie chart is not
given.
Examine our example of color preference for two different groups. How much work does it take to
see that it is quite challenging to work out who ate the pie? First, we have to find Fingerprints on
either pie, and then remember how many sensirivity vectors it has. If we did not include the share for
blue in the label, then we would probably be approximating the comparison. So, if we use multiple
pie charts, we have to expect that comparisions between charts would only be approximate.
What is the most popular color in the left graph? Red. But note, that you have to look at all of the
colors and read the label to see which it might be. Also, this author was kind when creating these
two graphs because he used the same color for the same object. Imagine the confusion if one had
made the most important color get Red in the right-hand chart?
If two shares of data should not be compared via the comparative pie chart, what kind of graph
would be preferred? The stacked bar chart is probably the most appropriate for sharing of the total
comparisons. Again, exact comparisons cannot be done with graphs and therefore a table may
supplement the graph with detailed information.
2.4 QUESTIONS: