0% found this document useful (0 votes)
17 views51 pages

EECM3724_Unit_1_Ch2_slides_2022

The document discusses methods for summarizing both categorical and quantitative data, including frequency distributions, relative frequencies, and graphical representations like pie charts and histograms. It emphasizes the importance of using descriptive statistics and visual tools to derive insights from raw data, illustrated through examples such as guest ratings at a lodge and parts costs at an auto repair shop. Additionally, it covers concepts like Simpson's paradox and the selection of class limits and widths for quantitative data analysis.

Uploaded by

johannesbotle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views51 pages

EECM3724_Unit_1_Ch2_slides_2022

The document discusses methods for summarizing both categorical and quantitative data, including frequency distributions, relative frequencies, and graphical representations like pie charts and histograms. It emphasizes the importance of using descriptive statistics and visual tools to derive insights from raw data, illustrated through examples such as guest ratings at a lodge and parts costs at an auto repair shop. Additionally, it covers concepts like Simpson's paradox and the selection of class limits and widths for quantitative data analysis.

Uploaded by

johannesbotle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Unit 1 (continued):

Graphical and Numerical of Data

Anderson et al., ch. 2

21/07/2022
Overview
From chapter 2:

• Summarizing categorical data

• Summarizing quantitative data

• Summarizing relationships between two categorical variables

• Summarizing relationships between two quantitative variables

• Explaining the Simpson’s paradox (use your own numerical example)


Objectives
• Use descriptive statistics to describe and summarize qualitative and quantitative data.

• Construct and interpret graphs representing the distribution of variables in a data set;

• Construct and interpret cross-tabulations and scatter diagrams;

• Explain Simpson’s paradox.


Summarizing qualitative data
• Frequency distribution

• Relative frequency distribution These are the commonly


used ways of summarizing
this type of data. In this unit,
• Percentage frequency distribution
we describe the techniques
and illustrate them. Because
• Bar charts the reader must not struggle
to get the sense of the
• Pie charts summary of the data, choice
of which technique to use
and when is always
important.
Frequency distribution

• A frequency distribution is a tabular summary of data showing the number (frequency) of items in each of
several non-overlapping classes.

• The objective of using frequency distribution is to provide insights about the data that cannot be quickly
obtained by looking only at the raw data.

• We find the frequency of qualitative data by counting the number of observations that fall into a given
class/category.

• Think of your close family relatives. How many are female? How many are males? The quantities you get for
these two categories are the frequency.
Example: Bains game lodge
• Mandisa was hired by Bains game lodge to conduct a survey of the quality of accommodation offered by
the lodge. She designed a questionnaire and asked guests staying at the lodge to rate the quality of their
accommodation as being excellent, above average, average, below average, or poor.

• The ratings provided by a sample of 20 guests are:

Below Average Average Above Average


Above Average Above Average Above Average
Above Average Below Average Below Average
Average Poor Poor
Above Average Excellent Above Average
Average Above Average Average
Above Average Average

• What insights can Mandisa draw from these responses?


• Mandisa cannot quickly get meaningful insights from the raw data above.
• How about if she do frequency distribution?
Frequency distribution

Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20

• Now Mandisa has a story to tell the management of the lodge.


• ‘Above average’ category is the mode i.e. rating with highest frequency.
• Excellent is the rating least selected as only one guest selected it.
Relative frequency distribution

• The relative frequency of a class is the fraction or proportion of the total number of data
items belonging to the class.

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠


• Relative frequency of each class=
𝑛

• n is the number of observations.

• The relative frequency distribution is a tabular summary of a set of data showing the
relative frequency for each class.

In the next two slides, we will provide relative frequency for the Bains Game Lodge data. Before you get to those
slides, can you attempt to calculate relative frequencies for each class/category of ratings?
Percentage frequency distribution

• The percentage frequency of a class is the relative frequency multiplied by 100

• Percentage frequency of each class=Relative frequencyx100

• A percentage frequency distribution is a tabular summary of a set of data showing the percentage
frequency for each class.

In the next slide, we will provide percentage relative frequency distribution for the Bains Game Lodge data. Before
you get to the slide, can you attempt to calculate percentage relative frequencies for each class/category of ratings?
Relative Frequency and
Percentage Frequency Distributions
Relative Percentage
Rating Frequency Frequency
Poor 0.10 10
Below Average 0.15 15
Average 0.25 25 0.10(100) = 10
Above Average 0.45 45
Excellent 0.05 5
Total 1.00 100

1/20 = 0.05
Display of qualitative data
• Pie chart Find frequency
• Bar Chart distribution

• These graphical tools are most appropriate when the


raw data can be naturally categorized in a meaningful
manner.
• “A picture is worth a thousand words”
• These charts provide important insights into this data
Pie Chart
• The pie chart is a very popular tool for presenting relative frequency distributions for
qualitative data.

• The pie chart is a circle subdivided into a number of slices corresponding to the relative
frequency for each class.

• In other words, the size of each slice is proportional to the percentage corresponding to the
category it represents

When the Finance Minister presents a budget, he is sharing the economic cake among different functions.
The cake is the revenue. Each function e.g. learning and culture, get its share.
Pie Chart
Bains game lodge Quality Ratings
• Half of the customers
surveyed gave Bains a Excellent
quality rating of “above 5%
average” or higher Poor
(45%+5%). 10%
• 10% of the guests gave Below
Average
a poor rating, while
Above 15%
• 15% gave a below Average
average rating. 45%
• Given these findings, Average
what can Mandisa tell
25%
the manager of Bains
game lodge?
Bar Chart
• A bar chart, or bar graph, is a graphical tool for depicting qualitative data.

• On one axis (usually the horizontal axis), we specify the labels that are used for each of the classes.

• A frequency, relative frequency, or percentage frequency scale can be used for the other axis (usually

the vertical axis).

• Using a bar of fixed width drawn above each class label, we extend the height appropriately.

• The bars are separated to emphasize the fact that each class is a separate category.
Bar Graph
Bains game lodge Quality Ratings
• From raw data to:
10
• Frequency distribution
• Relative frequency 9
• Percentage relative 8
frequency
• Pie chart. Frequency 7
• Bar graph 6
• This explains why we
5
must have this course
in our degree 4
programme. 3
2
1

Poor Below Average Above Excellent


Average Average
Rating
Summarizing Quantitative Data
• Frequency Distribution

• Relative Frequency Distribution

• Percentage Frequency Distribution As you work through


this part, see the
• Histogram similarities and
differences with the
categorical data
• Cumulative Distributions
Example: Bloem Auto Repair
• Daniel is the manager of Bloem Auto and he would like to have a better understanding of the cost of
parts used in the engine tune-ups performed in the shop.

• He examines 50 customer invoices for tune-ups.

• The costs of parts, rounded to the nearest Rand for the 50 customers, are listed below
• Daniel cannot quickly
91 78 93 57 75 52 99 80 97 62 get insight from the raw
data above.
71 69 72 89 66 75 79 75 72 76 • What can Daniel do?
Can he do frequency

104 74 62 68 97 105 77 65 80 109 distribution?


• But then there are no
classes/categories. How
85 97 88 68 83 68 71 69 67 74 do we solve that
problem?
62 82 98 101 79 105 79 69 62 73
Frequency Distribution for quantitative data
• Guidelines for Selecting Number of Classes

• Use between 5 and 20 classes.

• Data sets with a larger number of elements usually require a larger number of classes.

• Smaller data sets usually require fewer classes

• Use enough classes to show the variation in the data.

• Do not use so many classes that some contain only a few data items or nothing at all.
Frequency Distribution – Number of classes
• To determine the classes for a frequency distribution with quantitative data we need to determine
• (1) the number of non-overlapping classes; (2) the width of each class and (3) the class limits.

• The number of classes are generally determined by trial and error but to minimize the trials one can use the 2𝑘
rule.

• The rule states that 2𝑘 ≥ 𝑛, where 𝑘 is the number of classes and 𝑛 is the number of observations. From
Maths class, how to you solve for 𝑘? Remember: k𝑙𝑜𝑔2 ≥ 𝑙𝑜𝑔𝑛

• For our example, 𝑛 =50 and by trial: 25 =32; 26 =64. So 26 ≥ 50, meaning 𝑘=6
Frequency Distribution – Class width

• Guidelines for Selecting Width of Classes


• The choices for the number of classes and class width are not independent as more classes mean
smaller class width.
• For consistency and reducing the chances of inappropriate interpretations, use the same class width.
• The number of observations (n=50)
• The minimum = 52 and the maximum = 109
• 𝐿 - Approximate Class Width
𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒−𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 (109−52)
• 𝐿= = = 9.5 ≈ 10
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 6

• We can round off the number of class width to get 10


• Now we need to create the frequency distribution….
Frequency Distribution

For Bloem Auto Repair, if we choose 6 classes, with class width of 10:

Parts Cost (R) Frequency


50-59 2
There is some
60-69 13 picture now.
70-79 16 Where do we see
80-89 7 high frequency?
Low frequency?
90-99 7
100-109 5
Total 50
You can proceed and draw a bar graph with parts costs as classes and on x-axis. The bar graph can have frequency; relative
frequency, percentage relative frequency. Do not hesitate to ask if you do not understand.
Frequency Distribution – Class limit
⚫ Class limits must be chosen so that each data item belongs to one class.
⚫ The lower-class limit identifies the smallest possible data value assigned to the class
⚫ The upper-class limit identifies the largest possible data value assigned to the class.
⚫ In the Table above, 52 is the lowest value in the data, thus we select 50 as the lower-
class limit and 59 as the upper-class limit of the first class.
⚫ The difference between the lower-class limits of adjacent classes is the class width.
⚫ Using the last 2 upper class limits of 90 and 100, we see that the class width is 100-
90=10.
⚫ To have the frequency distribution, we count the number of values in each class.
Frequency Distribution – Class midpoint
• To derive some of the statistics, we need to derive the midpoints of each class in a frequency
distribution for quantitative data.
• The class midpoint is the value halfway between the lower- and upper class limits.
• For example, for the class 90-100, the midpoint is 95.

But what are those other statistics that we need midpoints values for? Soon we talk about them.
Relative Frequency and
Percentage Relative Frequency Distributions

• Only 4% of the parts costs are in


the R50-59 class
Parts Relative Percentage
Cost (R) Frequency Frequency
• 30% of the parts costs are under
R70.
50-59 0.04 4
60-69 0.26 2/50 26 0.04(100)
• The greatest percentage (32% or
almost one-third) of the parts 70-79 0.32 32
costs are in the R70-79 class. 80-89 0.14 14
• 10% of the parts costs are R100 90-99 0.14 14
or more.
100-109 0.10 7/50 10 0.14(100)
Total 1.00 100
Graphical Techniques for Quantitative Data - Histogram

• A common graphical presentation of quantitative data is a histogram.

• The variable of interest is placed on the horizontal axis.

• A rectangle is drawn above each class interval with its height corresponding to the interval’s
frequency, relative frequency, or percentage frequency.

• Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes.

• The bars within a histogram do not correspond to named categories, as in the bar chart.

• In the histogram the bars correspond to an interval on the number line.

• This interval is constructed so that they are all of equal length.


Graphical Techniques for Quantitative Data - Histogram
• In constructing a histogram avoid choosing class width which is of awkward lengths.

• If the class width is made too narrow, the histogram looks “spikey”, and if
width is too wide, the histogram is “blurred”.
Histogram - Bloem Auto Repair 6 classes, with class width of 10
Tune-up Parts Cost
18
What can you say about this
16 histogram?
14
Apart from drawing it, you
should be able to interpret
Frequency 12 what you see in the graph.
10
8
6
4
2

50-59 60-69 70-79 80-89 90-99 100-110


Parts
Cost (R)
Histogram
• From analyzing a histogram we can get information about the shape of the distribution of the data.

• This can be symmetrical, negatively (left) skewed or positively (right) skewed.

• A histogram is negatively skewed if its tail extends further to the left.

• It is positively skewed if the tail extends to the right.

• It is symmetrical if the right tail mirrors the left tail.

• While a symmetrical distribution is the most ideal, histogram for real data are never perfectly symmetrical, it
might be roughly symmetrical.
Histogram
• Symmetrical
• Left tail is the mirror image of the right tail
• Examples: heights and weights of people
• The mean and the median are equal

.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
• Moderately Skewed Left (negative skewness)
• A longer tail to the left Actual calculations will
• Example: exam scores confirm this aspect of
• The mean is less than the median, and they are mean < median < mode
both less than the mode
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
• Moderately Right Skewed (Positive Skewness)
Actual calculations will
• A longer tail to the right confirm this aspect of
• Example: housing values mean > median > mode
• The mean is the largest, while the mode is the smallest.
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
• Highly Skewed Right (Positive Skewness)
• A very long tail to the right
• Example: executive salaries

.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Bell shaped histogram

▪ Many statistical techniques require that the population be bell-shaped.

▪ A bell shape suggests that the data is normally distributed

▪ Drawing the histogram helps verify the shape of the population in question

Using real data do you think we


can get a bell-shaped histogram?
For real data, its never perfectly
symmetrical, it might be roughly
symmetrical.
Cumulative Distributions
• Cumulative frequency distribution - shows the number of items with values less than or equal to
the upper limit of each class

• Cumulative relative frequency distribution – shows the proportion of items with values less than
or equal to the upper limit of each class.

• Cumulative percentage frequency distribution – shows the percentage of items with values less
than or equal to the upper limit of each class.

• The cumulative frequency distribution uses the number of classes, class width and class limits
adopted for the frequency distribution.

• The last entry in the cumulative relative frequency distribution always equals 1 and for
percentage equals 100.
Cumulative Distributions
• Bloem Auto Repair

Cumulative Cumulative
Cumulative Relative Percentage
Cost (R) Frequency Frequency Frequency
< 59 2 0.04 4
< 69 15 0.30 30
< 79 31 2 + 13 0.62 15/50 62 0.30(100)
< 89 38 0.76 76
< 99 45 0.90 90
< 109 50 1.00 100
Cross-tabulations
• So far, we have focused on presentations that are used to summarize the data for one variable at a time.

• A cross-tabulation is a tabular summary of data for two variables simultaneously.

• It allows us to determine the relationship between two variables.

• Cross-tabulation can be used when:

• One variable is qualitative, and the other is quantitative

• Both variables are qualitative ,

• Both variables are quantitative


Cross-tabulations - Textbook example p 37

• Cross-tabulation of quality rating and meal price for 300 restaurants in


Bloemfontein
• The left and top margin labels define the classes for the two variables.
Quantitative
Qualitative variable
variable
Meal Price
Quality Rating R100-190 R200-290 R300-390 R400-490 Total
Good 42 40 2 0 84
Very Good 34 64 46 6 150
Excellent 2 14 28 22 66
Total 78 118 76 28 300
Frequency distribution
Frequency distribution for the quality rate
for the meal price
Insights Gained from Preceding Cross-tabulation.
• The greatest number of restaurants (64) have a very good rating and a meal price of R200-290.

• For the most expensive restaurants (R400-490), none of them have a good rating with most having excellent
rating (22/28=78.6%).

• Of the 78 least expensive restaurants (R100-190), only 2 have an excellent rating (2/78=2.6% ) but 42 of
them having a good rating (42/78=53.8%).

• The right and bottom margins of the cross-tabulation provide the frequency distributions of quality rating
and meal price.

• The right margin (in Red) is the frequency distribution of quality rating and the bottom margin (in Green) is
the frequency distribution of meal price.

• Dividing the totals in the bottom margin (78, 118, 76 and 28) by the total for that row (300) provides relative
and percentage frequency distributions for meal price.

• Looking at R100-190 – (78/300=0.26), implying 26% of the restaurants were charging a meal price of R100-
Cross-tabulations

• The frequency and relative frequency distributions derived from the margins of a cross-
tabulation provide information about each variable.

• They do not shed light on the relationship between the two variables.

• To explore the relationship between the 2 variables we need to convert the entries in a cross-
tabulation into a row or column percentages.

• The example in the textbook looked at the relationship based on row percentages

• I will do the col percentages to have a complete picture of this example.



Cross-tabulation: Column Percentages

• Each col in the table is a percentage frequency distribution of quality rating for one
of the meal prices (for first column 100*(42/78)=53.8)

100* (42/78) =53.8

• You are expected to know how to derive the row and col
percentage tables and interpret the results.
Cross-tabulation: Column Percentages

• Of the restaurants charging low price (R100-190), the greatest proportion are rated good (54%)

• For restaurants charging high price (R400-490), the greatest proportion are rated highly (79%
have an excellent rating).

• Based on this it seems meal prices are positively associated with quality restaurants.

• In doing this example did you see the mix of qualitative (quality) and quantitative (price)
variables.

• While qualitative variables are categorical already, for quantitative we need to create classes
before using the variable in a crosstab.
Cross-tabulation: Simpson’s Paradox

• Simpson’s Paradox: A phenomenon in statistics in which the conclusions based upon an


aggregated cross-tabulation can be completely reversed if we look at the disaggregated data.

• Data in two or more cross-tabulations are often aggregated to produce a summary cross-
tabulation.

• Patterns previously seen in the aggregated data may be reversed or disappear altogether in the
non-disaggregated data

• For example, the results of a cross-tabulation of health status and youth status might change
once we disaggregate youth status by gender.

• Health status of male and female youths is likely to differ.

• We must be careful in drawing conclusions about the relationship between the two variables in
the aggregated cross-tabulation.
Simpson’s paradox
Aggregated data Disaggregated data
Scatter Diagram and Trend Line

• A scatter diagram is a graphical presentation of the relationship between two quantitative


variables.

• One variable is shown on the horizontal axis and the other variable is shown on the vertical
axis.

• The general pattern of the plotted points suggests the overall relationship between the
variables.

• A trend line is an approximation of the relationship, which can be positive, negative or no


relationship
A scatter diagram can indicate 3 possible relationships between 2 variables:
A positive relationship showing a upward sloping trend line,
A negative relationship showing a downward sloping trend line
No apparent relationship showing a close to a horizontal trend line.
Scatter Diagram and Trend Line

• A Positive Relationship
y

x
Scatter Diagram and Trend Line

• A Negative Relationship
y

x
Scatter Diagram and Trend Line

• No Apparent Relationship
y

x
Summary: Tabular and Graphical Procedures

Data

Qualitative Data Quantitative Data

Tabular Graphical Tabular Graphical


Methods Methods Methods Methods

• Frequency • Bar Graph • Frequency Dist.


• Dot Plot
Distribution • Pie Chart • Rel. Freq. Dist.
• Histogram
• Relative Freq. • % Freq. Dist.
• Stem-and-
Distribution • Cum. Freq. Dist.
Leaf Display
• Percent Freq. • Cum. Rel. Freq.
• Scatter
Distribution Distribution
Diagram
• Cross-tabulation • Cum. % Freq.
Distribution
• Cross-tabulation
End of Chapter 2

Attempt questions provided at the end of the chapter in the textbook.


Attempt questions provided at the end of the chapter in the textbook.
Next: UNIT 1, CONT…..
UNIT 1: DESCRIPTIVE STATISTICS

You might also like