0% found this document useful (0 votes)
10 views

Lecture-02 Data Organization and Presentation

Staistics and data Science
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture-02 Data Organization and Presentation

Staistics and data Science
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Data organization and presentation

When conducting a statistical study, the researcher must gather


data for the particular variable under study.

For example, if a researcher wishes to study the number of people


who were bitten by poisonous snakes in a specific geographic
area over the past several years, he or she has to gather the data
from various doctors, hospitals, or health departments.

To describe situations, draw conclusions, or make inferences


about events, the researcher must organize the data in some
meaningful way.

The most convenient method of organizing data is to construct a


frequency distribution.

After organizing the data, the researcher must present them so


they can be understood by those who will benefit from reading
the study.

The most useful method of presenting the data is by constructing


statistical charts and graphs.
This chapter explains how to organize data by constructing
frequency distributions and how to present the data by
constructing charts and graphs.

Presentation of Data

 Tabulation / Frequency Distribution

 Graphical Presentation

• Bar chart

• Pie chart

• Histograms

• Frequency polygons

• Ogives

• Stem-and-Leaf Display

Tabulation / Frequency Distribution

In statistics, a frequency distribution is a table that displays the


frequency of various outcomes in a sample. Each entry in the
table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way, the table
summarizes the distribution of values in the sample.

Frequency Distribution and Graphical presentation for


qualitative data

A frequency distribution for qualitative data lists all categories


and the number of elements that belong to each of the categories.

Example

A sample of 30 employees from large companies was selected,


and these employees were asked how stressful their jobs were.
The responses of these employees are recorded below, where very
represents very stressful, somewhat means somewhat stressful,
and none stands for not stressful at all.

Construct a frequency distribution table for these data.

Solution This variable is classified into three categories: very


stressful, somewhat stressful, and not stressful at all. We record
these categories in the first column of Table 2.4. Then we read
each employee‘s response from the given data and mark a tally,
denoted by the symbol 0, in the second column of Table 2.4 next
to the corresponding category. For example, the first employee‘s
response is that his or her job is somewhat stressful. We show this
in the frequency table by marking a tally in the second column
next to the category somewhat. Note that the tallies are marked in
blocks of five for counting convenience. Finally, we record the
total of the tallies for each category in the third column of the
table. This column is called the column of frequencies and is usually
denoted by f. The sum of the entries in the frequency column
gives the sample size or total frequency. In Table 2.4, this total is
30, which is the sample size.

Relative Frequency and Percentage Distributions

The relative frequency of a category is obtained by dividing the


frequency of that category by the sum of all frequencies. Thus, the
relative frequency shows what fractional part or proportion of the
total frequency belongs to the corresponding category. A relative
frequency distribution lists the relative frequencies for all
categories.
The percentage for a category is obtained by multiplying the
relative frequency of that category by 100. A percentage distribution
lists the percentages for all categories.

Calculating Percentage

Percentage= (Relative frequency) .100

Example

Determine the relative frequency and percentage distributions for


the data in Table 2.4.

Solution The relative frequencies and percentages from Table 2.4


are calculated and listed in Table 2.5. Based on this table, we can
state that .333, or 33.3%, of the employees, said that their jobs are
very stressful. By adding the percentages for the first two
categories, we can state that 80% of the employees said that their
jobs are very or somewhat stressful. The other numbers in Table
2.5 can be interpreted the same way. Notice that the sum of the
relative frequencies is always 1.00 (or approximately 1.00 if the
relative frequencies are rounded), and the sum of the percentages
is always 100 (or approximately 100 if the percentages are
rounded).
Graphical Presentation of Qualitative Data

All of us have heard the adage ―a picture is worth a thousand


words.‖ A graphic display can reveal at a glance the main
characteristics of a data set. The bar graph and the pie chart are two
types of graphs that are commonly used to display qualitative
data.

Bar Graph

A graph made of bars whose heights represent the frequencies of


respective categories is called a bar graph.

The bar graphs for relative frequency and percentage


distributions can be drawn simply by marking the relative
frequencies or percentages, instead of the frequencies, on the
vertical axis.
Sometimes a bar graph is constructed by marking the categories
on the vertical axis and the frequencies on the horizontal axis.

Pie Charts

A pie chart is more commonly used to display percentages,


although it can be used to display frequencies or relative
frequencies. The whole pie (or circle) represents the total sample
or population. Then we divide the pie into different portions that
represent the different categories.

Definition
A circle divided into portions that represent the relative
frequencies or percentages of a population or a sample belonging
to different categories is called a pie chart.

As we know, a circle contains 360 degrees. To construct a pie


chart, we multiply 360 by the relative frequency of each category
to obtain the degree measure or size of the angle for the
corresponding category. Table 2.6 shows the calculation of angle
sizes for the various categories of Table 2.5.

Figure 2.2 shows the pie chart for the percentage distribution of
Table 2.5, which uses the angle sizes calculated in Table 2.6.
Organizing and Graphing for Quantitative Data

Frequency Distributions

A frequency distribution for quantitative data lists all the classes


and the number of values that belong to each class. Data
presented in the form of a frequency distribution are called
grouped data.

Table 2.6 Weekly Earnings of 100 Employees of a Company


Number of classes:

Usually the number of classes for a frequency distribution table


varies from 5 to 20, depending mainly on the number of
observations in the data set. It is preferable to have more classes
as the size of a data set increases. The decision about the number
of classes is arbitrarily made by the data organizer.
Or

A useful recipe to determine the number of classes (k) is the ―2 to


the k rule‖. This guide suggest you select the smallest number (k)
for the number of classes such that

2𝑘 ≥ 𝑛.

Usually the number of classes for a frequency distribution table


varies from 5 to 20, depending mainly on the number of
observations in the data set.1

Class interval or width

Generally, the class interval or width should be same for all


classes. The classes all taken together must cover at least the
distances from the lowest value in the data up to the highest
value. Expanding these words in a formula:

𝐻−𝐿
𝑖≥
𝑘

Where i is the class interval, H is the highest observed value, L is


the lowest observed value and k is the number of classes.

or
𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒
𝐴𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝐶𝑙𝑎𝑠𝑠 𝑊𝑖𝑑𝑡𝑕 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠

Methods of classifying the data according to class interval

There are two methods of classifying the data according to class


intervals namely

• Exclusive method

• Inclusive method

Exclusive Method

When the class intervals are so fixed that the upper limit of one
class is the lower limit of the next class it is known as the
‗Exclusive‘ method of classification. The following data are
classified on the basis:

Student class interval Frequency


10-20 3
20-30 6
30-40 6
It is clear that the exclusive method ensures continuity of data as
much as the upper limit of one class is the lower limit of the next
class. This type of classification maybe used with fractional values
like age, height, weight.

Inclusive method

In this method, the overlapping of the class intervals is avoided.


Both the lower and upper limits are included in the class interval.
This type of classification maybe used for a grouped frequency
distribution for discrete variable like members in a family,
number of workers in a factory etc., where the variable may take
only integral values.

Student class interval Frequency


10-20 5
21-31 6
32-42 15
Solution

Number of classes

Here n=30

We use 2𝑘 ≥ 𝑛.

If k=5 then 25 = 32 ≥ 𝑛.

So, the number of classes is 5.

Class interval or width:

Here, highest value, H=29

and lowest value, L=5

𝐻−𝐿 29−5
so, Class interval, 𝑖≥ = = 4.8 ≈ 5
𝑘 5

Now the frequency distribution are shown below

Relative Frequency and Percentage Distributions


Another procedure:
Graphing Grouped Data

Grouped (quantitative) data can be displayed in a histogram or a


polygon.

Histograms

A histogram can be drawn for a frequency distribution, a relative


frequency distribution, or a percentage distribution. To draw a
histogram, we first mark classes on the horizontal axis and
frequencies (or relative frequencies or percentages) on the vertical
axis. Next, we draw a bar for each class so that its height
represents the frequency of that class. The bars in a histogram are
drawn adjacent to each other with no gap between them. A
histogram is called a frequency histogram, a relative frequency
histogram, or a percentage histogram depending on whether
frequencies, relative frequencies, or percentages are marked on
the vertical axis.

Definition

A histogram is a graph in which classes are marked on the


horizontal axis and the frequencies, relative frequencies, or
percentages are marked on the vertical axis. The frequencies,
relative frequencies, or percentages are represented by the heights
of the bars. In a histogram, the bars are drawn adjacent to each
other

Figures 2.3 and 2.4 show the frequency and the relative frequency
histograms, respectively, for the data of Tables 2.9 and 2.10 of
Sections 2.3.2 and 2.3.3. The two histograms look alike because
they represent the same data. A percentage histogram can be
drawn for the percentage distribution of Table 2.10 by marking
the percentages on the vertical axis. In Figures 2.3 and 2.4, we
have used class limits to mark classes on the horizontal axis.
However, we can show the intervals on the horizontal axis by
using the class boundaries instead of the class limits.

Class Boundary

The class boundary is given by the midpoint of the upper limit of


one class and the lower limit of the next class.

We adjust the classes by deducting 0.5 from each lower limit and
adding 0.5 to each upper limit of all the classes.

Example

The following data give the amounts (in dollars) spent on


refreshments by 20 spectators randomly selected from those who
patronized the concession stands at a recent Major League
Baseball game.

4.95 5.80 4.50 4.85 6.992 12.35 7.75 10.45 21.77 18.00

25.99 8.00 2.99 16.60 9.00 15.75 9.50 3.05 5.65 21.00

Construct a frequency distribution table using inclusive method.

Solution

Number of classes C=1+3.3 log 20=5.29 ~ 5

Approximate class interval= (25.99-2.99)/5=4.6 ~ 5.0

Class limits Class boundaries Class widths frequency

2-6 1.5 to less than 6.5 5 7

7-11 6.5 to less than 11.5 5 6

12-16 11.5 to less than 16.5 5 2

17-21 16.5 to less than 21.5 5 3

22-26 21.5 to less than 26.5 5 2


EXERCISES

1. A sample of 80 adults was taken, and these adults were


asked about the number of credit cards they possess. The
following table gives the frequency distribution of their
responses.

a. Find the class boundaries and class midpoints.


b. Prepare the relative frequency and percentage
distribution columns.
c. Draw a histogram.
2. The following table gives the frequency distribution of ages
for all 50 employees of a company.

a. Find the class boundaries and class midpoints.


b. Prepare the relative frequency and percentage
distribution columns.
c. Draw a histogram.
3. The following data give the numbers of computer keyboards
assembled at the Twentieth Century Electronics Company
for a sample of 25 days.

a. Make the frequency distribution table for these data.


b. Calculate the relative frequencies for all classes.
c. Construct a histogram for the frequency distribution

Shapes of Histogram

 Symmetric

 Skewed

 Uniform or Rectangular

Symmetric Histogram
Skewed histogram

(a) skewed to the right.(b) skewed to the left.

Uniform histogram
Steam and Leaf Display

 A stem-and-leaf plot is a way of organizing quantitative data in


a graphical format, similar to a histogram, to assist in visualizing
the shape of a distribution.
 It is useful for displaying the relative density and shape of the data,
giving the reader a quick overview of the distribution.
 It is also useful for highlighting outliers and finding the mode.
 However, stem-and-leaf displays are only useful for moderately
sized data sets (around 15–150 data points).
 With very small data sets a stem-and-leaf displays can be of little use,
as a reasonable number of data points are required to establish
definitive distribution properties.
 With very large data sets, a stem-and-leaf display will become very
cluttered, since each data point must be represented numerically.
A box plot or histogram may become more appropriate as the data
size increases.

Definition

In a stem-and-leaf display of quantitative data, each value is


divided into two portions – a stem and a leaf. The leaves for each
stem are shown separately in a display.
Example

The following are the scores of 30 college students on a statistics test:

75 52 80 96 65 79 71 87 93 95

69 72 81 61 76 86 79 68 50 92

83 84 77 64 71 87 72 92 57 98

Construct a stem-and-leaf display.

Solution

To construct a stem-and-leaf display for these scores, we split


each score into two parts. The first part contains the first digit,
which is called the stem. The second part contains the second
digit, which is called the leaf. We observe from the data that the
stems for all scores are 5, 6, 7, 8, and 9 because all the scores fall in
the range 50 to 98.

After we have listed the stems, we read the leaves for all scores
and record them next to the corresponding stems on the right side
of the vertical line. The complete stem-and-leaf display for scores
is shown in Figure 2.14.
Features of distributions: using steam and leaf plot

When you assess the overall pattern of any distribution (which is the
pattern formed by all values of a particular variable), look for these
features:

 number of peaks

 general shape (skewed or symmetric)

 centre

 spread

Number of peaks

 Line graphs are useful because they readily reveal some characteristic
of the data. The first characteristic that can be readily seen from a line
graph is the number of high points or peaks the distribution has.

 While most distributions that occur in statistical data have only one
main peak (unimodal), other distributions may have two
peaks (bimodal) or more than two peaks (multimodal).

Examples of unimodal, bimodal and multimodal line graphs are shown


below:
General shape

The second main feature of a distribution is the extent to which it


is symmetric.

A perfectly symmetric curve is one in which both sides of the distribution


would exactly match the other if the figure were folded over its central
point. An example is shown below:

A symmetric, unimodal, bell-shaped distribution—a relatively common


occurrence—is called a normal distribution.

If the distribution is lop-sided, it is said to be skewed.

A distribution is said to be skewed to the right, or positively skewed, when


most of the data are concentrated on the left of the distribution.
Distributions with positive skews are more common than distributions
with negative skews.
Income provides one example of a positively skewed distribution. Most
people make under Tk40,000 a year, but some make quite a bit more, with
a smaller number making many millions of dollars a year. Therefore, the
positive (right) tail on the line graph for income extends out quite a long
way, whereas the negative (left) skew tail stops at zero. The right tail
clearly extends farther from the distribution's centre than the left tail, as
shown below:

A distribution is said to be skewed to the left, or negatively skewed, if most of


the data are concentrated on the right of the distribution. The left tail
clearly extends farther from the distribution's centre than the right tail, as
shown below:
Centre and spread

Locating the centre (median) of a distribution can be done by counting half


the observations up from the smallest. Obviously, this method is
impracticable for very large sets of data. A stem and leaf plot makes this
easy, however, because the data are arranged in ascending order.
The mean is another measure of central tendency. (See the chapter
on central tendency for more detail.)

The amount of distribution spread and any large deviations from the

general pattern (outliers) can be quickly spotted on a graph.

Example: Using stem and leaf plots as graph

The results of 41 students' math tests (with a best possible score of 70) are
recorded below:

31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48
55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50
55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54
57, 39, 52, 45, 35, 51, 63, 42
1. Prepare an ordered stem and leaf plot for the data and briefly describe
what it shows.

2. Are there any outliers? If so, which scores?


3. Look at the stem and leaf plot from the side. Describe the
distribution's main features such as:

a. number of peaks

b. symmetry

c. value at the centre of the distribution

Solution:
A test score is a discrete variable. For example, it is not possible to have a
test score of 35.74542341....

The lowest value is 4 and the highest is 67. Therefore, the stem and leaf
plot that covers this range of values looks like this:

Table 10. Math scores of 41 students

Stem Leaf

0 4

1 8 9

2 3 4 6

3 1 2 4 5 5 7 9

4 0 1 2 3 4 5 5 8 9

5 0 0 0 1 1 2 3 4 4 5 5 6 7 7

6 0 2 3 5 7

Note: The notation 2|4 represents stem 2 and leaf 4.


The stem and leaf plot reveals that most students scored in the interval
between 50 and 59. The large number of students who obtained high
results could mean that the test was too easy, that most students knew
the material well, or a combination of both.

The result of 4 could be an outlier, since there is a large gap between this
and the next result, 18.

If the stem and leaf plot is turned on its side, it will look like the
following:

The distribution has a single peak within the 50–59 interval.

Although there are only 41 observations, the distribution shows that


most data are clustered at the right. The left tail extends farther from the
data centre than the right tail. Therefore, the distribution is skewed to the
left or negatively skewed.
Line Chart

Line charts, especially useful in the fields of statistics and science, are more
popular than all other graphs combined because their visual characteristics
reveal data trends clearly and these charts are easy to create.

A line chart is a visual comparison of how two variables—shown on the x-


and y-axes—are related or vary with each other. It shows related
information by drawing a continuous line between all the points on a grid.

Line charts compare two variables: one is plotted along the x-axis
(horizontal) and the other along the y-axis (vertical). The y-axis in a line
chart usually indicates quantity (e.g. dollars, litres) or percentage, while the
horizontal x-axis often measures units of time. As a result, the line chart is
often viewed as a time series graph. For example, if you wanted to graph
the height of a baseball pitch over time, you could measure the time
variable along the x-axis, and the height along the y-axis. Although they do
not present specific data as well as tables do, line charts are able to show
relationships more clearly than tables do. Line charts can also depict
multiple series and hence are usually the best candidate for time series data
and frequency distribution.

In summary, line charts:

 show specific values of data well,


 reveal trends and relationships between data,
 compare trends in different groups.
Example:

Chart 5.5.1 shows one obvious trend, the fluctuation in the labour force
from January to July. The number of students at Andrew‘s high school who
are members of the labour force is scaled using intervals on the y-axis,
while the time variable is plotted on the x-axis.

The number of students participating in the labour force was 252 in


January, 252 in February, 255 in March, 256 in April, 282 in May, 290 in
June and 319 in July. When examined further, the line chart indicates that
the labour force participation of these students was at a plateau for the first
four months (January to April), and for the next three months (May to July)
the number increased steadily.

Chart 5.5.2 is a single line chart comparing two items. In this example, time
is not a factor. The chart compares the average number of dollars donated
by the age of the donors. According to the trend in the chart, the older the
donor, the more money he or she donates. The 17-year-old donors donate,
on average, $84. For the 19-year-olds, the average donation increased by
$26 to make the average donation of that age group $110.

You might also like