0% found this document useful (0 votes)
52 views

Frequency Distribution

This instructional module provides content on organizing and presenting quantitative data through classification, frequency distributions, and graphical methods. It discusses collecting and classifying data, creating frequency tables to summarize categorical and continuous variables, and using histograms, bar graphs, and other visualizations to depict distributions. The goal is for students to learn how to properly arrange and display data to facilitate analysis and draw valid statistical inferences.

Uploaded by

jackblack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Frequency Distribution

This instructional module provides content on organizing and presenting quantitative data through classification, frequency distributions, and graphical methods. It discusses collecting and classifying data, creating frequency tables to summarize categorical and continuous variables, and using histograms, bar graphs, and other visualizations to depict distributions. The goal is for students to learn how to properly arrange and display data to facilitate analysis and draw valid statistical inferences.

Uploaded by

jackblack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Republic of the Philippines

NUEVA VIZCAYA STATE UNIVERSITY


Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

College: College of Industrial Technology


Campus :Bambang Campus

DEGREE BSInte COURSE NO.


PROGRAM
SPECIALIZATION Web/ Network COURSE Quantitative Methods
Management TITLE
System
YEAR LEVEL 3 TIME FRAME WK IM
NO. NO.

I. UNIT TITLE/CHAPTER TITLE


Classification of Data, Frequency Distribution and Graphical Presentation

II. LESSON TITLE (Topics)

III. LESSON OVERVIEW

Statistics are a set of numerical statements and facts collected from any field of
enquiry for drawing valid inferences for decision making. Data collection is in fact, the
most important aspect of a research experiment/statistical survey. After data have been
collected, the next step is to present the data in some orderly and logical form so that
their essential features may become explicit. The need for proper presentation of data
arises because the mass of collected data in their raw form is often so voluminous which
cannot be easily comprehended and analyzed. Therefore, after the collection of data, it
is imperative that data are classified and presented in such a way so as to bring out
points of similarities and dissimilarities in the data.

IV. DESIRED LEARNING OUTCOMES

After studying this module, student shall be able to organize and describe distributions of
data by using a number of different methods, including frequency tables, histograms,
standard line and bar graphs, stem-and-leaf displays, scatter plots, and box-and-whisker
plots.

V. LESSON CONTENT

Collection of Data

To study any problem by means of statistical methods first, the relevant data are
collected. Sometimes the data is to be collected from some research experiment or the primary
sampling units (households). Sometimes, the relevant data may exist in a published or
unpublished form, being collected by a private body or by the Government agency or by some
research organization, for its own use or for supplying popular information. In making use of
such data (called secondary data), one has to be particularly careful about the definitions of
terms and concepts used by the collecting authority and also about the method of collection and

NVSU-FR-ICD-05-00 (081220) Page 1 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
the reliability of the data. More often, one has to collect data directly from the field of enquiry.
The data are then said to be of the primary type. The collection of primary data may be done by
interviewing a number of persons and filling in questionnaires relevant to the problem e.g., in the
family income and expenditure survey, one will generally interview the head of each family. The
data collected should be carefully scrutinized before they are subjected to statistical treatment.

Classification of Data

Classification is the process of arranging the data into different groups or classes
according to some common characteristics. According to Connor Classification may be defined
as the process of arranging things in groups or classes according to their resemblances and
affinities. The functions of classification may be summarized as follows:
 It condenses the data.
 It facilitates comparisons.
 It helps to study the relationships.
 It facilities the statistical treatment of data.

The classification of data is generally done on geographical, chronological, qualitative or


quantitative basis on the following lines:

a) In geographical classification, data are arranged according to places, areas or


regions.
b) In chronological classification, data are arranged according to time i.e. weekly,
monthly, quarterly, half-yearly, annually, etc.
c) In qualitative classification the data are arranged according to attributes like sex,
marital status, educational standard, region, farm, breed, disease etc.
d) Quantitative classification means arranging data according to certain characteristic that
has been measured e.g. according to height, weight or milk yield, fat contents in a dairy
product etc. In this type of classification, certain classes are formed and the units
belonging to these classes are attached to them. The quantitative phenomenon under
study is known as variable and hence this classification is also sometimes called
classification by variables.

2.4 Frequency Distribution

A frequency or relative frequency table is used to summarize categorical, nominal, and


ordinal data. It is also be used to summarize continuous data when the data set has been
divided into meaningful groups.

Count the number of observations that fall into each category. The number associated
with each category is called the frequency and the collection of frequencies over all categories
gives the frequency distribution of that variable. Generally, a frequency distribution has 5 to 15
classes.
 It presents data in a useful form and allows for a visual interpretation.
 It enables analysis of the data set including where the data are concentrated / clustered,
the range of values, and observation of extreme values,

NVSU-FR-ICD-05-00 (081220) Page 2 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
The frequency distribution is a statistical table which shows the value of a variable in order of
magnitude, either individually or in groups, along with the corresponding frequencies side by
side. The data pertaining to a quantitative phenomenon can be classified in four ways:
 The set or series of individual observations- ungrouped (raw) or arranged (arrayed)
data.
 Discrete or ungrouped frequency distribution.
 Grouped frequency distribution.
 Continuous frequency distribution.

Frequency Table for Qualitative Data


Color Preferences of Customers

Frequency Distribution for Quantitative Data

Table 1
Frequency Distribution of Time (min)

Time Note Table1 Count


110 1
115 There are 8 classes. The frequency 2 of
120 the first class is 1; i.e. there is 14 value
125 within the class; the class has a midpoint
3
130 of 110. 5
135 3
140 4
145 2
150 1

The relative frequency is a number which describes the proportion of


observations falling in a given category. Instead of counts, we report relative frequencies
or percentages.

Example 1. The following data pertain to first lactation milk yield (in kg) of 100 cows

1630 1648 1663 1665 1671 1677 1680 1687 1690 1695
1787 1788 1790 1800 1862 1855 1815 1835 1845 1818
1974 1998 2000 2000 2005 2031 2045 2045 2050 2056
2168 2171 2180 2187 2200 2218 2245 2323 2372 2397
2063 2069 2085 2098 2100 2100 2100 2105 2117 2131
1736 1743 1760 1765 1763 1767 1775 1775 1776 1780
1695 1754 1698 1700 1742 1732 1711 1713 1718 1728

NVSU-FR-ICD-05-00 (081220) Page 3 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
1854 1850 1855 1856 1857 1860 1863 1863 1875 1880
1890 1900 1910 1912 1915 1918 1928 1916 1915 1947
1950 1958 1951 1960 1963 1968 1965 1967 1970 1969

The data given in example 1 are called the raw or ungrouped data which does not give
us any useful information. Our objective will be to express the huge data in a suitable
condensed form which will highlight the significant facts and comparisons and furnish more
useful information without sacrificing any information of interest about the important
characteristics of the distribution.

Array

A better presentation of above raw data would be to arrange them in an ascending or


descending order of magnitude which is called arraying of data. However, this method is better
than raw data but does not reduce the volume of the data.

Discrete or ungrouped frequency distribution

A much better way of the presentation of the data is to express in the form of a discrete or
ungrouped frequency distribution, where we count the number of times each value of the
variable occurs in the data. The number of times a variate value is repeated is called frequency
of the variate value e.g. suppose there are seven Karan Fries cows having first lactation milk
yield equal to 1900 kg, 7 is the frequency of first lactation yield of 1900 kg.

Grouped frequency distribution

It is a statistical table which shows the values of the variable in groups and also the
corresponding frequencies side by side. In this type of set up, the condensation of data consists
in classifying the data into different classes (or class intervals) by dividing the entire range of the
values of the variable into a suitable number of groups, called classes and then recording the
number of observations in each group. The type of such representation of data is called a
grouped frequency distribution. The groups are called the classes and the boundary ends are
called class limits e.g. for a class interval 0 10, 0 is the lower limit and 10 is the upper limit. The
difference between upper and lower limit is called magnitude of the class. The number of
observations falling within a particular or defined class is called its frequency or class
frequency. The variate value which lies midway between the upper and lower limits is called
mid value or midpoint of that class.

While preparing the frequency distribution the following points must be kept in mind:
1. The class interval should be uniform i.e. it should be of equal width. A comparison of
different frequency distributions is facilitated if the same class interval is used for all.
The class interval should be an integer as far as possible.
2. The class interval should be so chosen that all the observations should be reflected by
the frequency distribution.
3. The class interval should be continuous open end classes less than < a or greater
than >b should be avoided. These classes create difficulty in analysis and
interpretation.
4. The observations corresponding to the common point between two classes should
always be put in the higher class e.g. a number corresponding to the variate 30 is to be
put in the class 30-40 and not in 20-30.

NVSU-FR-ICD-05-00 (081220) Page 4 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
5. There should not be too many or too small number of classes. The number of classes
should never be less than 6 and not more than 30 i.e. the number of class intervals
should lie between 6 and 30. With less number of classes the accuracy may be lost,
and with more number of classes the computations become tedious. The optimum
number of classes is generally considered as 15.

Number of classes

The following formula due to Sturges may be used to determine the number of classes k =
1+3.322 log10N where k is the number of classes and N is the total frequency.

Size of class intervals

The choice of class interval depends on the number of classes for a given distribution and size
of the data. As far as possible the class intervals should be of equal size. Prof. Sturges has
given the following formula for determining the size of class intervals

Example 2: If we consider the data given in example 1 let us find its size of class interval and
prepare its frequency distribution

Solution: The size of class interval is given by


N=100 Largest value =2397 and Smallest value =1630, Range =767
Number of classes k = 1+3.322 log10(100)=7.644

Taking class intervals as 1630-1730, 1730-1830, ----, 2330-2430 the frequency distribution of
first lactation milk yield of 100 Karan Swiss cows is given below in Table 2.1

Table. Frequency distribution of First Lactation milk yield of cows


Class Interval frequency
(in kg) (fi)
1630-1730 17
1730-1830 19
1830-1930 23
1930-2030 16
2030-2130 14
2130-2230 7
2230-2330 2
2330-2430 2

Advantages of grouping

(i) First advantage of grouping is that in subsequent calculations, much labour is


saved in numerical computation by treating all individuals in a class interval as
having the value at the centre of that interval.

NVSU-FR-ICD-05-00 (081220) Page 5 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
(ii) The second advantage of grouping is where the observed sample is of
moderate size and from a large population. In such a case the frequency table is
more likely to exhibit a rise or fall of frequency against class interval.

Cumulative Frequency Distribution

The cumulative frequency of a class is the total frequency up to and including that class. The
table of cumulative frequencies is called a cumulative frequency distribution table. There are
two types of cumulative frequency distribution. The cumulative frequency distribution of all
values greater than or equal to the lower limit of each class is called more than cumulative
frequency distribution. The cumulative frequency of all values less than or equal to the upper
limit of each class is called less than cumulative frequency distribution. Let us illustrate this
through example 3

Example 3: Prepare the cumulative frequency distribution of the frequency distribution of first
lactation milk yield of Karan Swiss cows given in table 2.1.

Solution: The less than cumulative frequency and more than cumulative frequency distribution
are shown in table 2.2

Table. Cumulative frequency distribution of first lactation milk yield


Class Interval frequency Cumulative frequency (c.
(in kg) (fi) f.)
Less than More than
1630-1730 17 17 100
1730-1830 19 36 83
1830-1930 23 59 64
1930-2030 16 75 41
2030-2130 14 89 25
2130-2230 7 96 11
2230-2330 2 98 4
2330-2430 2 100 2

Graphical Presentation

Histograms

Histograms give a visual representation of continuous data. We consider two


separate cases corresponding to when (i) all the bars in the histogram have the same
width; (ii) the intervals are of variable widths.

Histograms with equal class widths

 Example:
Mercury contamination can be particularly high in certain types of fish. The mercury content
(ppm) on the hair of 40 fishermen in a region thought to be particularly vulnerable are given

NVSU-FR-ICD-05-00 (081220) Page 6 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
below (From paper “Mercury content of commercially imported fish of the Seychelles, and hair
mercury levels of a selected part of the population.” Environ. Research, (1983), 305-312.)
13.26 32.43 18.10 58.23 64.00 68.20 35.35 33.92 23.94 18.28
22.05 39.14 31.43 18.51 21.03 5.50 6.96 5.19 28.66 26.29
13.89 25.87 9.84 26.88 16.81 38.65 19.23 21.82 31.58 30.13
42.42 16.51 21.16 32.97 9.84 10.64 29.56 40.69 12.86 13.80

 The first step is to group the data. A reasonable choice of class intervals is:
0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.
The frequency table that results from the use of these intervals is:

Interval Frequency
N.B. By convention, any
0-10 5
observation that is at a
10-20 11
20-30 10 boundary of a class will be put
30-40 9 into the higher class. For
40-50 2 example, an observation of 10
50-60 1 above would be put into the 10-
60-70 2 20 category.

To construct the histogram in this situation (i.e. all class widths equal):
 Mark boundaries of the class intervals on the horizontal axis.
 The height of the bars above each interval can be taken as the frequency for that interval.

A histogram showing mercury contamination in hair

10
Frequency

0 10 20 30 40 50 60 70
Mercury content (ppm)

Instead of using frequencies to give the heights of the rectangles in a histogram, relative
frequencies may be used. The relative frequency for an interval is that interval's frequency
divided by the total frequency.

 So for the mercury example…

Interval Frequency Relative


frequency
0-10 5 .125
10-20 11 .275
20-30 10 .250
30-40 9 .225
40-50 2 .050
50-60 1 .025
60-70 2 .050
Total 40 1

NVSU-FR-ICD-05-00 (081220) Page 7 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

The relative frequencies can be expressed as percentages:

A relative frequency histogram for the mercury data

Relative frequency (%) 30

20

10

0 10 20 30 40 50 60 70
Mercury content (ppm)

Notice that the shape of the histograms, whether using frequencies or relative frequencies, is
the same.

Histograms with unequal class widths

There is no hard and fast rule as to how many intervals should be used. Too
many classes produce an uneven distribution, but having too few loses information.
Usually the number of classes is about 6-20. The more observations we have, the more
classes we will usually use.

The width of the intervals defining the histograms need not all be equal. It is
often sensible to choose short intervals where the data is quite dense but intervals with a
longer width where the data is more sparse. This will ensure that we don’t have too
many intervals with zero frequency, yet keeps as much information about the
distributional shape of the data as possible.

When unequal interval widths are used, then the frequency density should be
used on the vertical scale on the histogram, where
Frequency density = Frequency  class width.

 Example:
The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the
following table:

Vehicle length (m) Class width Frequency


A histogram showing the lengths of 250 vehicles Frequency density
3.0-4.0
200 1 90 90
4.0-4.5 0.5 80 160
180
4.5-5.0 0.5 40 80
5.0-5.5
160
0.5 24 48
5.5-7.5 2 16 8
140
Frequency density

120

100

80

60

40

20

0
2
NVSU-FR-ICD-05-00 (081220) 3 4 5 6 7 8 __
Page 8 of
Vehicle length (m)
Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

Notice that if we had simply defined the heights of the rectangles to be the frequencies, then
the histogram would exaggerate, for example, the incidence of cars between 3 and 4 metres
in length.

An alternative way of producing a histogram in situations were not all class widths are equal is
to set the bar height to be the relative frequency density. This is given by:

Relative freq. density = Relative freq.  class width.

If the histogram is produced in this way, then the total area of all the bars is 1.

 Example (continued)
The relative frequency densities for the car vehicle length data are as follows:

Vehicle length Class Frequenc Relative Rel. freq.


(m) width y freq. density
3.0-4.0 1 90 0.36 0.36
4.0-4.5 0.5 80 0.32 0.64
4.5-5.0 0.5 40 0.16 0.32
5.0-5.5 0.5 24 0.096 0.192
5.5-7.5 2 16 0.064 0.032

The corresponding histogram can then be produced:

A histogram showing vehicle lengths

0.7

0.6
Frequency density

0.5

0.4

0.3

0.2

0.1

0.0

3.0 4.0 4.5 5.0 5.5 7.5


Length (m)

NVSU-FR-ICD-05-00 (081220) Page 9 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
Histogram shapes

Histograms are very useful for giving some idea of the shape of a density by approximating the
histogram to a smooth curve.

Densities can take many different shapes:

Unimodal Bimodal Multimodal

Symmetric Positive skew Negative skew

Normal Heavy-tailed Light-tailed

NVSU-FR-ICD-05-00 (081220) Page 10 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

Histograms for discrete data

Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are
generally used for continuous data. However, when the number of possible values for the
observations is large, a bar diagram would become uninformative. In this case it is acceptable to
group the values into class intervals, much as you would for continuous data.

Example:
Suppose we have the following data:

1 1 2 2 2 3 3 4 4 5 5 5 5 6 6 7 7 7
8 9 9 9 9 10 10 10 10 10 11 11 11 11 12 12 12 12
13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 16 16 16
17 17 17 18 18 19 19 20 21 21 22 22 23 23 24 26 27 29

As there are a large number of different values here, to get a better idea of the shape of the
distribution, we can group data into classes. Let's consider grouping all observations between 1
- 3, 4 - 6 and so on. To draw a histogram we need a continuous scale and so we need to define
our histogram intervals to be 0.5 - 3.5, 3.5 - 6.5, and so on. (Remember: a histogram never has
gaps between the bars).

We then get the following frequency distribution:


Interval Frequency
0.5 - 3.5 7
3.5 - 5.5 8
5.5 - 9.5 8
9.5 - 12.5 13
12.5 - 15.5 14
15.5 - 18.5 8
18.5 - 21.5 5
21.5 - 24.5 5
24.5 - 27.5 2
27.5 - 30.5 1

The histogram can now be drawn in the normal way.

NVSU-FR-ICD-05-00 (081220) Page 11 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
Stem-and-leaf plots

Stem-and-leaf plots are an effective way of providing a visual display of quantitative data with
very little effort. The idea of the plots is to separate each observation into 2 parts - the first part
being the stem and the second the leaf.

To construct a stem-and-leaf plot:


 Select one or more leading digits for the stem values. The following digit or digits become
the leaves.
 List possible stem values in a vertical column.
 Record the leaf value for every observation beside the corresponding stem value.
 Indicate the units for stems and leaves.

 Example:
To investigate the efficiency of new air-conditioning equipment installed on Boeing 720 aircraft,
the times (in hours) to first failure of the equipment were obtained from 28 different aircraft:
79 90 10 60 61 49 14 24 56 20 84 44 25 59
46 37 32 76 26 35 29 53 75 25 44 23 27 33

For these data an obvious choice for the stems is the leading digit (tens) and the leaves are
then the second digits (units). So, for example, the first observation of 79 has stem 7 and leaf 9.
The data values range from 10 up to 90, so we have the stem values 1-9.

1 0 4 An unordered stem­and­leaf diagram for
2 4 0 5 6 9 5 3 7 the Boeing data
3 7 2 5 3
4 9 4 6 4
5 6 9 3
6 0 1 Leaves- these
7 9 6 5 should be in
8 4 columns
9 0
Stem

Scale: Stem = 10s Leaves = units

1 0 4 An ordered stem­and­leaf diagram for
2 0 3 4 5 5 6 7 9 the Boeing data
3 2 3 5 7
4 4 4 6 9
5 3 6 9
6 0 1 Leaves have
7 5 6 9 now been put in
8 4 order
9 0

Scale: Stem =10s Leaves = units

N.B. Rearranging the leaves in ascending order clarifies things and is useful for producing
numerical summaries.

N.B.2 One advantage that stem-and-leaf diagrams have over histograms is that they retain the
detail of the raw data.

Use of stem-and-leaf plots

NVSU-FR-ICD-05-00 (081220) Page 12 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

 Stem-and-leaf plots give a visual display of the rough shape of the distribution of the
variable being measured. We can identify whether the density is a) unimodal or multimodal;
b) symmetric, negatively or positively skewed; c) normal, heavy- or light-tailed.
 Stem-and-leaf plots are useful for informal inference. We can find medians and quartiles
easily from the diagrams and obtain estimates of probabilities. For example, in the Boeing
data 10 pieces of equipment lasted under 30 hours so we could estimate the probability of a
new piece of equipment failing within the first 30 hours as 10/28.
 Stem-and-leaf plots are useful for identifying outliers- these are unusually large or small
observations. For example, for the Boeing example, if there had been an extra observation
of 119, then this might be an outlier:

1 0 4
2 0 3 4 5 5 6 7 9
3 2 3 5 7
4 4 4 6 9
5 3 6 9
6 0 1
7 5 6 9
8 4
9 0 This could be
10 considered an
11 9 outlying value

Choice of stem unit

Choice of stem unit can be important.

 Example:
To determine the age of a pre-historic settlement in North Wales, 24 small fragments from a
wooden boat found at the settlement were independently radio-carbon dated. The radio-carbon
determiniations (in years) of age of fragments are:
4969 5163 5052 5144 4965 5152 4967 4934 4895 5078 5019 4908
5009 5046 4912 5012 4889 5034 4914 5117 4931 5081 4984 4881

 Possibility 1: We could round each observation to the nearest one hundred years:

5000 5200 5100 5100 5000 5200 5000 4900 4900 5100 5000 4900
5000 5000 4900 5000 4900 5000 4900 5100 4900 5100 5000 4900

Taking the stem unit to be 1000 years gives the following diagram:
Scale: Stem = 1000's
4 9 9 9 9 9 9 9 9
5 0 2 1 1 0 2 0 1 0 0 0 0 0 1 1 0 Leaves = 100's

Because we have so few stem values here, we lose a lot of information. We can’t say anything
for example about the shape of the distribution.

 Possibility 2: Round observations to the nearest 10 years.

4970 5160 5050 5140 4970 5150 4970 4930 4900 5080 5020 4910
5010 5050 4910 5010 4890 5030 4910 5120 4930 5080 4980 4880

Taking the stem unit as 100 years gives:

48 9 8
Scale: Stem = 100's
Leaves = 10's

NVSU-FR-ICD-05-00 (081220) Page 13 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
49 7 7 7 3 0 0 1 1 3 8
50 5 8 2 1 5 1 3 8
51 6 4 5 2

This plot is a little more informative, but we could still do with having slightly more stems.

 Possibility 3: Split the stems into high and low values

48L In the high category


48H 8 9 you write any 5s, 6s,
49L 0 0 1 1 3 3 7s, 8s or 9s.
49H 7 7 7 8
50L 1 1 2 3
50H 5 5 8 8
51L 2 4 In each low category Scale: Stem = 100's
50H 5 6 you put any 0s, 1s, Leaves = 10's
2s, 3s, or 4s.

The diagram is now quite informative about the distribution- there is evidence of a positive skew.

[Note that if the stem unit was taken to be 10s, then the diagram we would get would be poor-
we would then have too many stem values (a lot of the rows would have no values in them).]

Back-to-back displays for displaying two independent samples

If there are 2 sets of data which you wish to compare, then both of these can be put on the
same stem-and-leaf plot with the leaves for one dataset going to the right and the leaves of the
other dataset going to the left.

 Example:
Using a technique involving chromium dioxide, the protein assimilation efficiencies (i.e.
percentage of protein intake actually absorbed) were measured on field mice and voles fed on
their natural diets. The assimilation efficiencies (in percentages) are given below:

A.E.'s of field mice:


61.3 65.4 71.7 62.6 63.6 76.3 67.8 61.9
57.8 70.6 70.5 68.9 62.6 69.7 74.6

A.E.'s of voles:
51.7 66.7 72.0 69.8 63.7 77.2 62.6 63.5 69.2 67.5
70.1 67.3 75.2 73.8 59.6 69.9 77.6 74.1 73.7

Rounding observations to the nearest integer gives us:

An unordered back-to-back stem-and-leaf


diagram for the protein data

A.E.s for field mice A.E.s for voles


5L 2 Outlier?
8 5H
3 2 4 3 1 6L 4 3 4 0
9 8 5 6H 7 9 8 7
0 1 1 2 7L 2 0 0 4 0 4 4
5 6 7H 7 5 8

Scale: Stem = 10's Leaves = 1's

NVSU-FR-ICD-05-00 (081220) Page 14 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

Then ordering the leaves we get…

An ordered back-to-back stem-and-leaf


diagram showing the protein data

A.E.s for field mice A.E.s for voles


5L 2
8 5H
4 3 3 2 1 6L 0 3 4 4
9 8 5 6H 7 7 8 9
2 1 1 0 7L 0 0 0 2 4 4 4
6 5 7H 5 7 8

Scale: Stem = 10's Leaves = 1's

Stem-and-leaf diagrams for matched-pair data

It is not a good idea to do a back-to-back plot if the 2 variates are not independent. Consider
the following example.

 Example:
Fifteen people participated on a short typing course. Their typing speeds (words/min) before
and after the course were recorded:

Subjec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
Before 15 18 23 27 36 12 8 19 32 22 17 21 16 15 33
After 26 28 27 26 28 24 26 42 32 36 20 29 21 22 28

These data are an example of matched-pair data (there are two measurements recorded on
each participant). Matched-pair data are likely to be dependent (a person with a fast typing
speed before the course is also likely to have a fast typing speed after the course). By drawing
a stem-and-leaf diagram you lose information about how the measurements pair up. You could
draw a scatter diagram (this would show the pairings). Alternatively, you could produce a stem-
and-leaf diagram of the differences:

Subjec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
t
Chang 11 10 4 -1 -8 12 18 23 0 14 3 8 5 7 -5
e

A stem-and-leaf diagram showing the change in typing speeds


after a short course

- 1 8 5
0 Scale: Stem = 10’s
0 4 0 3 8 5 7 Leaves = units.
1 1 0 2 8 4
2 3

A slightly more informative diagram can be obtained by splitting each stem up into two parts
(one for the lower leaves and the other for higher leaves):

NVSU-FR-ICD-05-00 (081220) Page 15 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

NVSU-FR-ICD-05-00 (081220) Page 16 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
A stem-and-leaf diagram showing the change in typing speeds
after a short course

-0H 8 5
-0L 1
0L 4 0 3
0H 8 5 7 Scale: Stem = 10’s
1L 1 0 2 4 Leaves = units.
1H 8
2L 3

Each diagram could then be ordered.

Problems

Stem-and-leaf plots cannot be used for displaying qualitative data and they become impractical
for large numbers of observations.

Cumulative Frequency Plots

A cumulative frequency plot also uses classes and frequencies. The cumulative frequency for a
class is the number of observations with values less than the upper boundary for that class.

 Example:
Consider the mercury example again. The cumulative frequencies are given in the table below:

Interva Frequenc Cumulative


l y frequency
0-10 5 5
10-20 11 16
20-30 10 26
30-40 9 35
40-50 2 37
50-60 1 38
60-70 2 40

In a cumulative frequency polygon the cumulative frequencies are plotted against the upper
class boundaries of the classes. These points are then joined with a straight line.

 Example (continued)
For the mercury example we want to plot the points (0, 0), (10, 5), (20, 16),…, (70, 40) and then
join these points:

NVSU-FR-ICD-05-00 (081220) Page 17 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
A cumulative frequency polygon for the mercury data

40

Cumulative frequency
30

20

10

0 10 20 30 40 50 60 70
Mecrcury level

A cumulative frequency plot is useful for giving us some idea of the shape of the
distribution function of the variable. They can also be used to obtain estimates of the
median and other quantiles for grouped data.

Scatter Plots.

Scatter plots are useful for assessing relationships between 2 variables. To draw a scatter plot
we represent one of the variables by the horizontal axis and the other variable by the vertical
axis. We then simply plot the pairs of data points on the graph.

 Example:
Fifteen children were given a visual-discrimination (V) test during the first week at primary
school and a reading-achievement (R) test at the end of their first year of schooling. Scores out
of 100 were calculated for each test.

Child no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
V-score 75 69 70 62 52 45 42 39 37 34 34 66 54 58 63
R-score 95 90 82 69 58 49 38 35 30 20 31 75 61 64 77

To draw a scatter plot we now want to plot the points (75, 95), (69, 90), (70, 82), …, (63, 77).

A scatter plot depicting primary school test results

100

90

80

70
R-score

60

50

40

30

20

30 40 50 60 70 80 90 100
V-score

The plot would suggest that there is a positive relationship between the V-score and the R-
score.

NVSU-FR-ICD-05-00 (081220) Page 18 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

Positive/ negative correlation

The following graphs give illustrations of variables that are (a) positively and (b) negatively
correlated with each other. Correlation can also be categorised as strong or weak depending
upon how close the points are to lying on a straight line.

15 15
Strong, positive Weak, positive

10
10
5
y

5
0

0 ­5
0 5 10 15 0 5 10 15
x x
15 20
Strong, negative Weak, negative
15
10
10
y

5
5
0

0 ­5
0 5 10 15 0 5 10 15
x x

Correlation does not imply causation

It is important to realise that scatter plots point to associations between variables. They do not
necessarily show a causal relationship.

 Example:
Information about two variables (life expectancy and the number of people per television set) is
available for 12 countries:

NVSU-FR-ICD-05-00 (081220) Page 19 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________

Life expectancy plotted against number of people per TV

80

Life expectancy 70

60

50

40
0 100 200
Number of people per TV

It is clear that the two variables are negatively correlated. However, it clearly would be wrong to
conclude that simply sending more televisions to countries with low life expectancies would
cause their inhabitants to live longer.

This example illustrates the very important distinction between causation and association. Two
variables may be strongly correlated without a cause-and-effect relationship existing between
them. Often the explanation is that both variables are related to a third variable not being
measured. In the example above for instance both life expectancy and the number of
televisions in the population will both be related to the country’s wealth.

There is one further type of graph that we will consider later in the chapter (namely box-and-
whisker plots). We first however need to look at numerical summary measures for data.

Numerical summaries of data

In the next few sections we will look at some numerical ways of summarising data.

Some notation

Suppose that we would like to learn about the random variable X. To do this we will observe a
random sample of n observations, X 1 , . .. , X n , such that each X i has the same distribution
as X. The observed values of X 1 , . .. , X n are then denoted x 1 ,. . ., x n .

 Example:
Suppose we are interested in the number of units of alcohol students at UKC consumed last
week. To do this we could randomly select 50 students to form a random sample X 1 , . .. , X 50 ,
where Xi is the random variable representing the number of units of alcohol consumed by
the ith student. The observed value of Xi is denoted xi .

Now suppose that we order the random sample x 1 ,. . ., x n . We let:


x(1)
 denote the smallest observation;
x( 2)
 denote the second smallest observation;

x(i)
 denote the ith smallest observation;

NVSU-FR-ICD-05-00 (081220) Page 20 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
x( n)
 denote the largest observation.
x(i)
Then is called the ith order statistic and the following relation holds:
x( 1)≤x( 2)≤.. .≤x (n )
.

 Example:
Suppose that we have the observations:
x 1=5, x 2 =10 , x 3=2, x 4 =7 .
Then
x( 1)=x 3=2, x (2 )=x 1 =5, x( 3)=x 4 =7, x( 4 )=x 2=10 .

When we have frequency data, we will denote the frequency of the kth class by f k for k = 1,
K
∑ f k =n .
…, K, where K is the number of classes. Then k =1

NVSU-FR-ICD-05-00 (081220) Page 21 of __


Republic of the Philippines
NUEVA VIZCAYA STATE UNIVERSITY
Bayombong, Nueva Vizcaya
INSTRUCTIONAL MODULE
IM No.:_________________________________
 Example:
Consider the mercury example again. Here we have the frequency table given by:

Interval Frequency
0-10 5
10-20 11
20-30 10
30-40 9
40-50 2
50-60 1
60-70 2

Here we have 7 classes, so that K = 7. Then f 1 =5, f 2 =11 , and so on, such that
7
∑ f k =40=n
k =1 .

VI. EVALUATION (Note: Not to be included in the student’s copy of the IM)
VII. ASSIGNMENT

VIII. REFERENCES

Numbering the IM No.: IM-CCCCCC-SSSSSS-NNNN-NNNN

School Year
Semester
Course Number
e.g.:
IM-COURSE NO-SEMESTER-SCHOOL YEAR
IM-MCB180-1STSEM-2020-2021

NVSU-FR-ICD-05-00 (081220) Page 22 of __

You might also like