EECM3724_Unit_1_Ch2_slides_2022
EECM3724_Unit_1_Ch2_slides_2022
21/07/2022
Overview
From chapter 2:
• Construct and interpret graphs representing the distribution of variables in a data set;
• A frequency distribution is a tabular summary of data showing the number (frequency) of items in each of
several non-overlapping classes.
• The objective of using frequency distribution is to provide insights about the data that cannot be quickly
obtained by looking only at the raw data.
• We find the frequency of qualitative data by counting the number of observations that fall into a given
class/category.
• Think of your close family relatives. How many are female? How many are males? The quantities you get for
these two categories are the frequency.
Example: Bains game lodge
• Mandisa was hired by Bains game lodge to conduct a survey of the quality of accommodation offered by
the lodge. She designed a questionnaire and asked guests staying at the lodge to rate the quality of their
accommodation as being excellent, above average, average, below average, or poor.
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
• The relative frequency of a class is the fraction or proportion of the total number of data
items belonging to the class.
• The relative frequency distribution is a tabular summary of a set of data showing the
relative frequency for each class.
In the next two slides, we will provide relative frequency for the Bains Game Lodge data. Before you get to those
slides, can you attempt to calculate relative frequencies for each class/category of ratings?
Percentage frequency distribution
• A percentage frequency distribution is a tabular summary of a set of data showing the percentage
frequency for each class.
In the next slide, we will provide percentage relative frequency distribution for the Bains Game Lodge data. Before
you get to the slide, can you attempt to calculate percentage relative frequencies for each class/category of ratings?
Relative Frequency and
Percentage Frequency Distributions
Relative Percentage
Rating Frequency Frequency
Poor 0.10 10
Below Average 0.15 15
Average 0.25 25 0.10(100) = 10
Above Average 0.45 45
Excellent 0.05 5
Total 1.00 100
1/20 = 0.05
Display of qualitative data
• Pie chart Find frequency
• Bar Chart distribution
• The pie chart is a circle subdivided into a number of slices corresponding to the relative
frequency for each class.
• In other words, the size of each slice is proportional to the percentage corresponding to the
category it represents
When the Finance Minister presents a budget, he is sharing the economic cake among different functions.
The cake is the revenue. Each function e.g. learning and culture, get its share.
Pie Chart
Bains game lodge Quality Ratings
• Half of the customers
surveyed gave Bains a Excellent
quality rating of “above 5%
average” or higher Poor
(45%+5%). 10%
• 10% of the guests gave Below
Average
a poor rating, while
Above 15%
• 15% gave a below Average
average rating. 45%
• Given these findings, Average
what can Mandisa tell
25%
the manager of Bains
game lodge?
Bar Chart
• A bar chart, or bar graph, is a graphical tool for depicting qualitative data.
• On one axis (usually the horizontal axis), we specify the labels that are used for each of the classes.
• A frequency, relative frequency, or percentage frequency scale can be used for the other axis (usually
• Using a bar of fixed width drawn above each class label, we extend the height appropriately.
• The bars are separated to emphasize the fact that each class is a separate category.
Bar Graph
Bains game lodge Quality Ratings
• From raw data to:
10
• Frequency distribution
• Relative frequency 9
• Percentage relative 8
frequency
• Pie chart. Frequency 7
• Bar graph 6
• This explains why we
5
must have this course
in our degree 4
programme. 3
2
1
• The costs of parts, rounded to the nearest Rand for the 50 customers, are listed below
• Daniel cannot quickly
91 78 93 57 75 52 99 80 97 62 get insight from the raw
data above.
71 69 72 89 66 75 79 75 72 76 • What can Daniel do?
Can he do frequency
• Data sets with a larger number of elements usually require a larger number of classes.
• Do not use so many classes that some contain only a few data items or nothing at all.
Frequency Distribution – Number of classes
• To determine the classes for a frequency distribution with quantitative data we need to determine
• (1) the number of non-overlapping classes; (2) the width of each class and (3) the class limits.
• The number of classes are generally determined by trial and error but to minimize the trials one can use the 2𝑘
rule.
• The rule states that 2𝑘 ≥ 𝑛, where 𝑘 is the number of classes and 𝑛 is the number of observations. From
Maths class, how to you solve for 𝑘? Remember: k𝑙𝑜𝑔2 ≥ 𝑙𝑜𝑔𝑛
• For our example, 𝑛 =50 and by trial: 25 =32; 26 =64. So 26 ≥ 50, meaning 𝑘=6
Frequency Distribution – Class width
For Bloem Auto Repair, if we choose 6 classes, with class width of 10:
But what are those other statistics that we need midpoints values for? Soon we talk about them.
Relative Frequency and
Percentage Relative Frequency Distributions
• A rectangle is drawn above each class interval with its height corresponding to the interval’s
frequency, relative frequency, or percentage frequency.
• Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.
• The bars within a histogram do not correspond to named categories, as in the bar chart.
• If the class width is made too narrow, the histogram looks “spikey”, and if
width is too wide, the histogram is “blurred”.
Histogram - Bloem Auto Repair 6 classes, with class width of 10
Tune-up Parts Cost
18
What can you say about this
16 histogram?
14
Apart from drawing it, you
should be able to interpret
Frequency 12 what you see in the graph.
10
8
6
4
2
• While a symmetrical distribution is the most ideal, histogram for real data are never perfectly symmetrical, it
might be roughly symmetrical.
Histogram
• Symmetrical
• Left tail is the mirror image of the right tail
• Examples: heights and weights of people
• The mean and the median are equal
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
• Moderately Skewed Left (negative skewness)
• A longer tail to the left Actual calculations will
• Example: exam scores confirm this aspect of
• The mean is less than the median, and they are mean < median < mode
both less than the mode
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
• Moderately Right Skewed (Positive Skewness)
Actual calculations will
• A longer tail to the right confirm this aspect of
• Example: housing values mean > median > mode
• The mean is the largest, while the mode is the smallest.
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
• Highly Skewed Right (Positive Skewness)
• A very long tail to the right
• Example: executive salaries
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Bell shaped histogram
▪ Drawing the histogram helps verify the shape of the population in question
• Cumulative relative frequency distribution – shows the proportion of items with values less than
or equal to the upper limit of each class.
• Cumulative percentage frequency distribution – shows the percentage of items with values less
than or equal to the upper limit of each class.
• The cumulative frequency distribution uses the number of classes, class width and class limits
adopted for the frequency distribution.
• The last entry in the cumulative relative frequency distribution always equals 1 and for
percentage equals 100.
Cumulative Distributions
• Bloem Auto Repair
Cumulative Cumulative
Cumulative Relative Percentage
Cost (R) Frequency Frequency Frequency
< 59 2 0.04 4
< 69 15 0.30 30
< 79 31 2 + 13 0.62 15/50 62 0.30(100)
< 89 38 0.76 76
< 99 45 0.90 90
< 109 50 1.00 100
Cross-tabulations
• So far, we have focused on presentations that are used to summarize the data for one variable at a time.
• For the most expensive restaurants (R400-490), none of them have a good rating with most having excellent
rating (22/28=78.6%).
• Of the 78 least expensive restaurants (R100-190), only 2 have an excellent rating (2/78=2.6% ) but 42 of
them having a good rating (42/78=53.8%).
• The right and bottom margins of the cross-tabulation provide the frequency distributions of quality rating
and meal price.
• The right margin (in Red) is the frequency distribution of quality rating and the bottom margin (in Green) is
the frequency distribution of meal price.
• Dividing the totals in the bottom margin (78, 118, 76 and 28) by the total for that row (300) provides relative
and percentage frequency distributions for meal price.
• Looking at R100-190 – (78/300=0.26), implying 26% of the restaurants were charging a meal price of R100-
Cross-tabulations
• The frequency and relative frequency distributions derived from the margins of a cross-
tabulation provide information about each variable.
• They do not shed light on the relationship between the two variables.
• To explore the relationship between the 2 variables we need to convert the entries in a cross-
tabulation into a row or column percentages.
• The example in the textbook looked at the relationship based on row percentages
• Each col in the table is a percentage frequency distribution of quality rating for one
of the meal prices (for first column 100*(42/78)=53.8)
• You are expected to know how to derive the row and col
percentage tables and interpret the results.
Cross-tabulation: Column Percentages
• Of the restaurants charging low price (R100-190), the greatest proportion are rated good (54%)
• For restaurants charging high price (R400-490), the greatest proportion are rated highly (79%
have an excellent rating).
• Based on this it seems meal prices are positively associated with quality restaurants.
• In doing this example did you see the mix of qualitative (quality) and quantitative (price)
variables.
• While qualitative variables are categorical already, for quantitative we need to create classes
before using the variable in a crosstab.
Cross-tabulation: Simpson’s Paradox
• Data in two or more cross-tabulations are often aggregated to produce a summary cross-
tabulation.
• Patterns previously seen in the aggregated data may be reversed or disappear altogether in the
non-disaggregated data
• For example, the results of a cross-tabulation of health status and youth status might change
once we disaggregate youth status by gender.
• We must be careful in drawing conclusions about the relationship between the two variables in
the aggregated cross-tabulation.
Simpson’s paradox
Aggregated data Disaggregated data
Scatter Diagram and Trend Line
• One variable is shown on the horizontal axis and the other variable is shown on the vertical
axis.
• The general pattern of the plotted points suggests the overall relationship between the
variables.
• A Positive Relationship
y
x
Scatter Diagram and Trend Line
• A Negative Relationship
y
x
Scatter Diagram and Trend Line
• No Apparent Relationship
y
x
Summary: Tabular and Graphical Procedures
Data