Chapter 2 Descriptive Statistics Ver 3
Chapter 2 Descriptive Statistics Ver 3
H
C
AN
C
H
O
AN
G
C
N
C
G
C
N
O
G
N
H
AN
C
H
AN
C
Chapter 2
H
O
AN
C
G
NH
N
C
DESCRIPTIVE STATISTICS
CA
C
C
G
O
O
N
NG
G
N
Chapter Goals
H
After completing this chapter, you should be able to:
AN
n Create and interpret graphs to describe categorical
H
C
AN
variables:
C
H
O
AN
frequency distribution, bar chart, pie chart, Pareto diagram
G
C
N
C
n
C
Create and interpret graphs to describe numerical
O
n
G
variables:
N
H
between variables: H
AN
C
H
O
AN
C
G
n
N
C
n
O
C
data graphically
G
O
N
G
N
Chapter Goals
(continued)
H
After completing this chapter, you should be able to:
AN
H
C
Compute and interpret the mean, median, and mode for a
AN
n
C
H
O
set of data
AN
G
C
N
C
n Find the range, variance, standard deviation, and
C
N
coefficient of variation and know what these values mean
O
G
N
Apply the empirical rule to describe the variation of
H
n
AN
H
AN
C
H
n
O
AN
C
G
C
n
O
C
linear relationship between two variables
G
O
N
G
N
Chapter Topics
H
n Measures of central tendency, variation, and
AN
H
shapeC
AN
C
H
Mean, median, mode, geometric mean
O
C
n
AN
G
C
N
Quartiles
C
n
C
N
Range, interquartile range, variance and standard
O
n
G
deviation, coefficient of variation
N
H
n
C
H
Population summary measures
AN
C
H
O
AN
C
G
C
O
C
n
G
O
N
G
N
Chapter Topics
H
(continued)
Five number summary and box-and-whisker
AN
n
H
C
plots
AN
C
H
O
C
Covariance and coefficient of correlation
AN
G
C
N
C
G
Pitfalls in numerical descriptive measures and
C
n
O
G
ethical considerations
N
H
AN
C
H
AN
C
H
O
AN
C
G
N
C
O
C
G
O
N
G
N
2.1 Summarising data for a categorical variable
H
AN
H
Categorical
AN
AN
C
C
Data
C
O
C
O
G
O
G
N
H
N
AN
C
Tabulating Data Graphing Data
C
H
O
AN
G
H
N
AN
C
Frequency
C
Distribution C
Bar Pie Pareto
O
C
G
Chart Chart Diagram
O
Table
N
G
N
2.1.1 The Frequency Distribution Table
H
AN
AN
C
C
C
O
C
Hospital Unit
O Number of Patients
G
O
G
N
H
N
AN
Cardiac Care 1,052
C
C
Emergency 2,245
O
AN
G
H
N
AN
C
Maternity 552
C
C
O
C
Surgery 4,630
G
O
N
G
(Variables are
N
categorical)
2.1.2 Charts
Bar and Pie Charts
H
AN
H
AN
AN
Bar charts and Pie charts are often used
C
C
C
C
O
G
O
G
N
H
N
AN
Height of bar or size of pie slice shows the
C
n
C
frequency or percentage for each
O
AN
G
H
category
N
AN
C
C
C
O
C
G
O
N
G
N
Bar Chart Example
Hospital Number
H
Unit AN of Patients
H
AN
AN
C
C
C
C
O
G
5000
O
Intensive Care 340
G
N
H
N
Maternity 552
AN
4000
C
3000
C
H
O
2000
AN
G
H
N
AN
C
1000
C
C
O
C
0
G
O
Cardiac
Emergency
Intensive
Surgery
Maternity
N
G
Care
Care
N
Pie Chart Example
Hospital Number % of
H
Unit AN of Patients Total
H
Hospital Patients by Unit
AN
AN
Cardiac Care 1,052 11.93
C
C
C
C
12%
Intensive Care 340 3.86
O
G
O
G
N
H
Maternity 552 6.26
N
AN
Surgery 4,630 52.50
C
C
Emergency
O
Surgery 25%
AN
G
H
53%
N
AN
C
C
C Intensive Care
O
C
G
4%
O
(Percentages
N
Maternity
G
are rounded to
N
the nearest 6%
percent)
Pareto Diagram
Used to portray categorical data
H
n AN
H
AN
AN
A bar chart, where categories are shown in
C
C
C
O
C
descending order of frequency
O
G
O
G
N
H
N
AN
n A cumulative polygon is often shown in the
C
same graph
C
H
O
AN
G
H
N
AN
C
C
many” C
O
C
G
O
N
G
N
Pareto Diagram Example
H
AN
H
Example: 400 defective items are examined
AN
AN
C
C
C
O
C
O
G
O
Source of
G
N
H
N
AN
Manufacturing Error Number of defects
C
Bad Weld 34
C
H
O
Poor Alignment 223
AN
G
H
N
AN
Missing Part 25
C
C
C
Paint Flaw O 78
C
G
O
Electrical Short 19
N
G
N
Cracked case 21
Total 400
Pareto Diagram Example(continued)
Step 1: Sort by defect cause, in descending order
H
Step 2: Determine % in each category
AN
H
AN
AN
C
C
C
Source of
O
C
O
G
O
Manufacturing Error Number of defects % of Total Defects
G
N
H
N
AN
Poor Alignment 223 55.75
C
Paint Flaw 78 19.50
C
H
O
Bad Weld 34 8.50
AN
G
H
N
AN
C
C
C
Cracked case 21O 5.25
C
G
O
H
Step 3: Show results graphically
AN
AN
C
C
% of defects in each category O
C
C
Pareto Diagram: Cause of Manufacturing Defect
O
G
O
60% 100%
G
N
H
N
N
90%
AN
cumulative % (line graph)
50%
80%
C
C
70%
(bar graph)
40%
O
AN
60%
G
H
N
AN
30% 50%
C
C
C
40%
O
C
20%
G
30%
O
N
G
20%
N
10%
10%
0% 0%
Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short
1.4 Graphs for Time-Series Data
n A line chart (time-series plot) is used to show
H
AN
the values of a variable over time
H
AN
AN
C
C
C
O
C
O
G
O
n Time is measured on the horizontal axis
G
N
H
N
AN
C
The variable of interest is measured on the
C
n
O
AN
G
H
vertical axis
N
AN
C
C
C
O
C
G
O
N
G
N
N
G
O
C
C
AN
N Thousands of subscribers
H
50
100
150
200
250
300
350
0
G
1990
O N
C G
1991
C O
1992 C
AN
1993 H C
1994
AN
1995
H
1996
1997N
1998
G
O
1999 C
2000
C N
2001
AN G
H O
Magazine Subscriptions by Year
2002 C
2003 C
2004
AN
H
Line Chart Example
2005
2006
N
G
O
C
C
AN
H
2.2 Summarising data for a quantitative variable
using tables and graphics
H
AN
H
Numerical Data
AN
AN
C
C
C
O
C
O
G
O
G
N
H
N
AN
C
Frequency Distributions Stem-and-Leaf
C
and
H
Display
O
AN
G
H
Cumulative Distributions
N
AN
C
C
C
O
C
G
O
Histogram Ogive
N
G
N
2.2.1 The Frequency Distribution Table
H
AN
AN
C
C
n
C
C
O
C
O
G
O
containing class groupings (categories or
G
N
G
n
H
N
AN
ranges within which the data fall) ...
C
C
and the corresponding frequencies with which
O
n
AN
G
H
N
AN
C
C
C
O
C
G
O
N
G
N
Why Use Frequency Distributions?
H
AN
AN
summarize data
C
C
C
O
C
The distribution condenses the raw data
O
G
O
n
G
N
H
into a more useful form...
N
AN
C
and allows for a quick visual interpretation
C
n
O
AN
G
H
of the data
N
AN
C
C
C
O
C
G
O
N
G
N
Class Intervals
and Class Boundaries
H
AN
H
AN
AN
Each class grouping has the same width
C
C
C
O
C
n
O
G
O
G
N
H
largest number - smallest number
N
AN
w = interval width =
C
number of desired intervals
C
H
O
AN
G
H
N
AN
C
C
G
O
n
N
interval endpoints
Frequency Distribution Example
H
AN
H
AN
Example: A manufacturer of insulation randomly
AN
C
C
selects 20 winter days and records the daily
C
O
C
high temperature
O
G
O
G
N
H
N
AN
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
C
C
H
O
32, 13, 12, 38, 41, 43, 44, 27, 53, 27
AN
G
H
N
AN
C
C
C
O
C
G
O
N
G
N
Frequency Distribution Example
(continued)
H
AN
H
AN
AN
n Sort raw data in ascending order:
C
C
C
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
O
C
O
G
O
Find range: 58 - 12 = 46
G
N
G
n
H
N
AN
n Select number of classes: 5 (usually between 5 and 15)
C
C
Compute interval width: 10
O
n (46/5 then round up)
AN
G
H
N
AN
C
O
less than 30, . . . , 60 but less than 70
C
G
O
n
N
Frequency Distribution Example
(continued)
Data in ordered array:
H
AN
H
AN
AN
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
C
C
C
O
C
O
G
O
Relative
Interval Frequency
G
N
Percentage
H
Frequency
N
AN
10 but less than 20 3 .15 15
C
C
20 but less than 30 6 .30 30
O
AN
G
H
N
AN
C
C
G
O
H
n A graph of the data in a frequency distribution
AN
AN
C
is called a histogram
C
C
O
C
O
G
O
n The interval endpoints are shown on the
G
N
H
N
AN
horizontal axis
C
C
n the vertical axis is either frequency, relative
O
AN
G
H
frequency, or percentage
N
AN
C
C
C
Bars of the appropriate heights are used to
n
O
C
G
O
each class
Histogram Example
H
Interval AN Frequency
H
Histogram : Daily High Tem perature
AN
AN
C
10 but less than 20 3
C
7
C
C
C
30 but less than 40 5
O 6
G
O
40 but less than 50 4 5
G
N
H
5
N
N
50 but less than 60 2
AN
4
Frequency
4
C
3
C
3
O
2
AN
G
H
N
AN
C
1
C
0
C 0
O
C
0
G
(No gaps
O
N
G
between 0 0 10 10 2020 30 30 40 40 50 50 60 60 70
N
H
AN
AN
1. How wide should each interval be?
C
C
C
C
O
G
O
G
N
H
2. How should the endpoints of the
N
AN
n
intervals be determined?
C
C
Often answered by trial and error, subject to
O
n
AN
G
H
user judgment
N
AN
C
The goal is to create a distribution that is
C
C
n
O
neither too "jagged" nor too "blocky”
C
G
O
N
G
H
AN
H
AN
AN
Many (Narrow class intervals)
C
C
C
C
n
O
G
O
with gaps from empty classes
G
N
G
5
H
Freque
N
N
Can give a poor indication of how 0
AN
ncy
n
12
20
28
36
44
52
Temperature
6
frequency varies across classes
C
C
H
O
AN
G
H
N
AN
n 12
C
10
may compress variation too much and
C
C
n 8
Frequency
O
C
4
N
G
variation. 0
0 30 60 More
Temperature
(X axis labels are upper class endpoints)
The Cumulative Frequency Distribuiton
Data in ordered array:
H
AN
H
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
AN
AN
C
C
C
O
C
O Cumulative Cumulative
G
O
Class Frequency Percentage
G
N
G
Frequency Percentage
H
N
AN
10 but less than 20 3 15 3 15
C
C
20 but less than 30 6 30 9 45
O
AN
G
H
N
AN
C
40 but less than 50 4 20 18 90
C
C
O
C
N
G
Total 20 100
N
2.2.3 The Ogive
Graphing Cumulative Frequencies
H
AN
H
Upper
AN
AN
interval Cumulative
C
Interval endpoint Percentage
C
C
Less than 10 10 0
O
C
O
G
O
10 but less than 20 20 15
G
N
G
Ogive: Daily High Temperature
H
20 but less than 30 30 45
N
AN
30 but less than 40 40 70
C
40 but less than 50 50 90 100
Cumulative Percentage
C
50 but less than 60 60 100
H
80
O
AN
G
H
60
N
AN
C
C
C
40 O
C
G
O
20
N
G
N
0
10 20 30 40 50 60
Interval endpoints
2.2.4 Stem-and-Leaf Diagram
H
AN
H
AN
AN
C
C
C
n
O
C
data set O
G
O
G
N
H
N
AN
C
METHOD: Separate the sorted data series
C
H
O
AN
G
H
N
AN
C
C
C
the trailing digits (the leaves)
O
C
G
O
N
G
N
Example
H
AN
AN
C
C
C
O
C
O
G
O
G
N
G
Here, use the 10’s digit for the stem unit:
H
n
N
AN
C
Stem Leaf
C
21 is shown as
O
n 2 1
AN
G
H
N
AN
38 is shown as
C
n 3 8
C
C
O
C
G
O
N
G
N
Example
(continued)
H
AN
AN
C
C
C
O
C
O
G
O
G
N
H
N
AN
n Completed stem-and-leaf diagram:
C
Stem Leaves
C
H
O
AN
2 1 4 4 6 7 7
G
H
N
AN
C
3 0 2 8
C
C
O
C
4 1
G
O
N
G
N
Using other stem units
H
AN
H
Using the 100’s digit as the stem:
AN
n
AN
C
C
C
C
n
O
G
O
G
N
H
N
AN
Stem Leaf
C
n 613 would become 6 1
C
H
O
n 776 would become 7 8
AN
G
H
N
AN
...
C
n
C
C
n 1224 becomes O 12 2
C
G
O
N
G
N
Using other stem units
(continued)
n Using the 100’s digit as the stem:
H
AN
H
AN
AN
The completed stem-and-leaf display:
C
C
C
O
C
O
G
O
Data:
G
N
H
Stem Leaves
N
AN
6 136
613, 632, 658, 717,
C
C
722, 750, 776, 827, 7 2258
O
841, 859, 863, 891,
AN
G
H
8 346699
N
AN
C
9 13368
C
1169, 1224
G
11 47
N
12 2
2.3 Summarising data for two variables using tables and graphics
Relationships Between Variables
Graphs illustrated so far have involved only a
H
n AN
H
single variable
AN
AN
C
C
When two variables exist other techniques are
C
n
O
C
used: O
G
O
G
N
H
N
AN
Categorical Numerical
C
C
(Qualitative) (Quantitative)
O
AN
G
H
Variables Variables
N
AN
C
C
C
O
C
N
G
N
2.3.1 Cross Tables
H
AN
H
AN
Cross Tables (or contingency tables) list the
AN
n
C
C
number of observations for every combination
C
O
C
of values for two categorical or ordinal
O
G
O
G
N
H
variables
N
AN
C
C
H
O
If there are r categories for the first variable
AN
n
G
H
N
AN
C
C
C
variable (columns), the table is called an r x c
O
C
G
O
cross table
N
G
N
Cross Table Example
n 4 x 3 Cross Table for Investment Choices by Investor
H
AN
(values in $1000’s)
H
AN
AN
C
C
C
C
Category O
G
O
G
N
H
Stocks 46.5 55 27.5 129
N
AN
C
Bonds 32.0 44 19.0 95
C
H
O
CD 15.5 20 13.5 49
AN
G
H
N
AN
C
C
C
O
C
N
G
N
2.3.2 Charts
H
AN
H
AN
AN
Scatter Diagrams are used for paired
C
C
C
C
O
G
O
G
N
G
numerical variables
H
N
AN
C
C
H
O
n The Scatter Diagram:
AN
G
H
N
AN
C
C
C
axis and the other variable is measured
O
C
G
O
H
AN
H
AN
AN
C
C
C
C
O
G
O
23 125 250
G
N
H
N
AN
26 140
200
Cost per Day
29 146
C
150
C
33 160
O
100
AN
G
38 167
H
N
AN
C
42 170 50
C
50 188
C
0
O
C
G
O
55 195
0 10 20 30 40 50 60 70
N
G
60 200
N
H
AN
H
Side by side bar charts
AN
AN
n
C
C
C
O
C
Comparing Investors
O
G
O
G
N
H
N
AN
Savings
C
C
CD
O
AN
G
H
Bonds
N
AN
C
C
Stocks
C
O
C
G
O
N
G
0 10 20 30 40 50 60
N
H
n ANSales by quarter for three sales territories:
H
AN
AN
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
C
C
C
C
North 45.9 46.9 45 43.9
O
G
O
G
N
H
60
N
AN
C
50
C
H
O
40
AN
East
G
H
N
AN
30 West
C
C
C
O North
20
C
G
O
N
G
10
N
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Graphical
Presentation of Data
(continued)
H
AN
H
Techniques reviewed in this chapter:
AN
n
AN
C
C
C
O
C
Categorical
O Numerical
G
O
G
N
H
Variables Variables
N
AN
C
C
• Frequency distribution • Line chart
O
AN
G
• Bar chart • Frequency distribution
H
N
AN
C
• Pie chart • Histogram and ogive
C
G
O
• Scatter plot
N
G
N
2.4 Summarizing data for a quantitative variable
using numerical measures
H
Describing Data Numerically
AN
H
C
AN
C
H
O
C
Central Tendency Variation
AN
G
C
N
C
G
C
Arithmetic Mean Range
O
G
N
Median Interquartile Range
H
AN
Mode Variance
C
H
AN
C
H
Standard Deviation
O
AN
C
G
N
C
Coefficient of Variation
O
C
G
O
N
G
N
2.4.1 Measures of center location
(measures of central tendency)
H
Overview
AN
H
C Central Tendency
AN
C
H
O
AN
G
C
N
C
G
C
N
O
Mean Median Mode
G
N
H
AN
åx
C
H
i
AN
C
x= i=1
H
O
AN
n
C
G
N
C
O
C
G
O
average ranked values observed value
N
G
N
Arithmetic Mean
n The arithmetic mean (mean) is the most
H
AN
common measure of central tendency
H
C
AN
C
For a population of N values:
H
O
C
n
AN
G
C
N
C
åx
C
x1 + x 2 + + x N
i
N
Population
O
μ= =
i=1
G
values
N
N N
H
Population size
AN
C
H
O
åx
AN
C
G
i
x1 + x 2 + + x n Observed
N
C
x= i=1
=
O
values
C
G
n n
O
N
G
Sample size
N
Arithmetic Mean
H
(continued)
AN
H
C
AN
n The most common measure of central tendency
C
H
O
C
Mean = sum of values divided by the number of values
AN
G
C
N
C
Affected by extreme values (outliers)
G
n
C
N
O
G
N
H
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
AN
C
H
AN
C
H
O
AN
Mean = 3 Mean = 4
C
G
N
C
O
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
C
G
= =3 = =4
O
N
G
5 5 5 5
N
Median
H
AN
H
n In an ordered list, the median is the “middle”
C
AN
C
H
number (50% above, 50% below)
O
AN
G
C
N
C
G
C
N
O
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
G
N
H
AN
Median = 3 Median = 3
C
H
AN
C
H
O
AN
C
G
N
C
Not affected by extreme values
O
C
G
O
N
G
N
Finding the Median
H
AN
n The location of the median:
H
C
AN
C
n +1
H
O
C
Median position = position in the ordered data
AN
G
C
2
N
C
G
C
N
O
n If the number of values is odd, the median is the middle number
G
N
n If the number of values is even, the median is the average of
H
n +1 H
AN
C
H
O
AN
n
C
G
2
N
C
position of the median in the ranked data
O
C
G
O
N
G
N
Mode
n A measure of central tendency
H
AN
Value that occurs most often
H
n C
AN
C
H
O
C
n
AN
G
C
N
C
n
C
N
O
n There may may be no mode
G
N
There may be several modes
H
n
AN
C
H
AN
C
H
O
AN
C
G
N
C
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
O
C
G
0 1No
2 Mode
3 4 5 6
O
N
Mode = 9
G
N
Review Example:
Summary Statistics
H
AN
n Five houses on a hill by the beach
H
C
AN
C
House Prices:
H
Mean: ($3,000,000/5)
O
C
n
AN
G
C
N
$2,000,000
O
= $600,000
C
G
500,000
C
N
O
300,000
G
N
100,000
Median: middle value of ranked data
H
n
100,000
AN
= $300,000
C
H
Sum 3,000,000
AN
C
H
O
AN
C
G
C
n
O
C
= $100,000
G
O
N
G
N
Which measure of location
is the “best”?
H
AN
H
C
AN
n Mean is generally used, unless extreme
C
H
O
C
values (outliers) exist . . .
AN
G
C
N
C
G
C
Then median is often used, since the median
N
n
O
G
is not sensitive to extreme values.
N
H
AN
H
O
AN
C
G
N
C
O
C
G
O
N
G
N
Shape of a Distribution
H
Describes how data are distributed
AN
n
H
C
Measures of shape
AN
n
C
H
O
AN
Symmetric or skewed
G
C
n
N
C
G
C
N
O
G
N
H
H
O
AN
C
G
N
C
O
C
G
O
N
G
N
2.4.2 Measures of dispersion
(measures of variability)
Variation
H
AN
H
C
AN
C
Range Interquartile Variance Standard Coefficient of
H
O
AN
G
C
N
C
G
C
N
O
G
N
n Measures of variation give
H
AN
H
values.
O
AN
C
G
N
C
O
Same center,
C
G
O
N
different variation
G
N
Range
H
AN
n Simplest measure of variation
H
C
AN
Difference between the largest and the smallest
C
n
H
O
AN
observations:
G
C
N
C
G
C
N
Range = Xlargest – Xsmallest
O
G
N
H
AN
Example:
C
H
AN
C
H
O
AN
C
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
G
N
C
O
C
Range = 14 - 1 = 13
G
O
N
G
N
Disadvantages of the Range
Ignores the way in which data are distributed
H
n
AN
H
C
AN
C
H
O
C
7 8 9 10 11 12 7 8 9 10 11 12
AN
G
C
N
Range = 12 - 7 = 5 Range = 12 - 7 = 5
C
G
C
N
O
Sensitive to outliers
G
n
N
H
AN
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
C
H
Range = 5 - 1 = 4
AN
C
H
O
AN
C
G
N
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
C
C
O
C
G
O
N
G
N
Interquartile Range
H
Can eliminate some outlier problems by using
AN
n
H
the interquartile range
C
AN
C
H
O
AN
G
C
N
C
n
C
and calculate the range of the middle 50% of
O
G
the data
N
H
AN
C
n H
Interquartile range = 3rd quartile – 1st quartile
AN
C
H
O
AN
IQR = Q3 – Q1
C
G
N
C
O
C
G
O
N
G
N
Interquartile Range
H
AN
Example:
H
C
AN
Median
C
X X
H
Q1 Q3
O
C
minimum (Q2) maximum
AN
G
C
N
C
25% 25% 25% 25%
C
N
O
G
12 30 45 57 70
N
H
AN
C
H
Interquartile range
AN
C
H
O
= 57 – 30 = 27
AN
C
G
N
C
O
C
G
O
N
G
N
Quartiles
H
Quartiles split the ranked data into 4 segments with
AN
n
H
an equal number of values per segment
C
AN
C
H
O
AN
G
C
N
C
25% 25% 25% 25%
C
N
O
G
Q1 Q2 Q3
N
H
AN
n The first quartile, Q1, is the value for which 25% of the
C
H
observations are smaller and 75% are larger
AN
C
H
O
AN
C
n
G
larger)
N
C
O
C
Only 25% of the observations are greater than the third
G
O
N
quartile
G
N
Quartile Formulas
H
AN
H
C
AN
Find a quartile by determining the value in the
C
H
O
AN
appropriate position in the ranked data, where
G
C
N
C
G
C
N
O
First quartile position: Q1 = 0.25(n+1)
G
N
H
AN
H
O
AN
C
G
C
O
C
G
O
N
G
where n is the number of observed values
N
Quartiles
H
AN
H
C
Example: Find the first quartile
AN
n
C
H
O
AN
G
C
N
C
G
C
N
O
(n = 9)
G
N
H
so use the value half way between the 2nd and 3rd values,
H
AN
C
H
O
AN
C
G
so Q1 = 12.5
N
C
O
C
G
O
N
G
N
Population Variance
H
Average of squared deviations of values from
AN
n
H
C
the mean
AN
C
H
O
AN
G
C
N
N
O
å (x - μ)
C
Population variance: 2
G
n
C
N
i
O
G
σ =2 i=1
N
H
N
AN
C
H
AN
C
H
O
AN
C
G
N = population size
N
C
O
C
G
O
N
G
N
Sample Variance
H
Average (approximately) of squared deviations
AN
n
H
C
of values from the mean
AN
C
H
O
AN
G
C
n
N
O
å (x - x)
C
Sample variance: 2
G
n
C
N
i
O
G
s =
2 i=1
N
H
n -1
AN
C
H
AN
C
H
O
AN
C
G
n = sample size
N
C
O
C
G
O
N
G
N
Population Standard Deviation
H
Most commonly used measure of variation
AN
n
H
C
Shows variation about the mean
AN
n
C
H
O
AN
Has the same units as the original data
G
C
n
N
C
G
C
N
O
G
n Population standard deviation:
N
H
AN
å (x - μ)
C
H 2
AN
C
H
i
O
AN
C
G
σ= i=1
N
C
N
O
C
G
O
N
G
N
Sample Standard Deviation
H
AN
H
n Most commonly used measure of variation
C
AN
C
H
O
C
Shows variation about the mean
AN
n
G
C
N
C
Has the same units as the original data
G
n
C
N
O
G
N
H
å i
AN
n
(x - x) 2
C
H
AN
S=
C
i=1
H
O
AN
C
n -1
G
N
C
O
C
G
O
N
G
N
Calculation Example:
Sample Standard Deviation
H
AN
H
Sample C
AN
C
Data (xi) : 10 12 14 15 17 18 18 24
H
O
AN
G
C
N
n=8 Mean = x = 16
C
G
C
N
O
(10 - X)2 + (12 - x)2 + (14 - x)2 + + (24 - x)2
G
s=
N
n -1
H
AN
C
H
AN
C
H
O
AN
C
G
8 -1
N
C
O
C
G
O
N
G
126
= = 4.2426
N
7 scatter around the mean
Measuring variation
H
AN
H
C
AN
C
H
O
AN
G
C
N
C
G
C
N
O
G
Large standard deviation
N
H
AN
C
H
AN
C
H
O
AN
C
G
N
C
O
C
G
O
N
G
N
Comparing Standard Deviations
H
AN
H
C
AN
Data A
C
H
O
Mean = 15.5
AN
G
C
s = 3.338
N
C
11 12 13 14 15 16 17 18 19 20 21
C
N
O
G
Data B
N
H
Mean = 15.5
AN
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
C
H
AN
C
H
O
AN
C
Data C
G
N
C
Mean = 15.5
O
C
G
s = 4.570
O
N
11 12 13 14 15 16 17 18 19 20 21
G
N
Advantages of Variance and
Standard Deviation
H
AN
H
C
AN
C
H
O
C
Each value in the data set is used in the
AN
n
G
C
N
O
calculation
C
G
C
N
O
G
N
H
weight
C
H
AN
C
H
(because deviations from the mean are squared)
O
AN
C
G
N
C
O
C
G
O
N
G
N
Coefficient of Variation
H
Measures relative variation
AN
n
H
C
Always in percentage (%)
AN
n
C
H
O
AN
Shows variation relative to mean
G
C
n
N
C
G
Can be used to compare two or more sets of
C
N
n
O
G
data measured in different units
N
H
AN
C
æ s öH
AN
C
CV = çç ÷÷ × 100%
H
O
AN
C
G
èx ø
N
C
O
C
G
O
N
G
N
Comparing Coefficient
of Variation
H
AN
Stock A:
H
n C
AN
C
n Average price last year = $50
H
O
AN
G
C
n Standard deviation = $5
N
C
G
æs ö
C
$5
O
CVA = çç ÷÷ × 100% = × 100% = 10%
G
èx ø $50
N
Both stocks
H
n
standard
C
n
H
Average price last year = $100 deviation, but
AN
C
H
stock B is less
O
Standard deviation = $5
AN
C
G
n
variable relative
N
C
to its price
O
æs ö $5
C
G
O
N
G
èx ø $100
N
Chebychev’s Theorem
H
AN
H
C
For any population with mean μ and
AN
n
C
H
O
C
standard deviation σ , and k > 1 , the
AN
G
C
N
C
G
C
N
the interval
O
G
N
[μ + kσ]
H
AN
C
Is at least AN
H
C
H
O
AN
C
G
N
C
O
C
G
O
N
G
N
Chebychev’s Theorem
(continued)
H
AN
Regardless of how the data are distributed, at
H
n C
AN
C
H
O
AN
G
C
standard deviations of the mean (for k > 1)
N
C
G
C
N
O
n Examples:
G
N
H
AN
At least within
C
H
AN
C
H
(1 - 1/1.52) = 55.6% ……... k = 1.5 (μ ± 1.5σ)
O
AN
C
G
C
O
C
G
O
N
G
N
The Empirical Rule
H
AN
H
If the data distribution is bell-shaped, then
C
AN
n
C
H
O
C
the interval:
AN
G
C
N
C
μ ± 1σ contains about 68% of the values in
C
n
O
G
the population or the sample
N
H
AN
C
H
AN
C
H
O
68%
AN
C
G
N
C
O
C
μ
G
O
N
G
μ ± 1σ
N
The Empirical Rule
H
AN
μ ± 2σ contains about 95% of the values in
H
n C
AN
C
H
O
AN
G
C
μ ± 3σ contains almost all (about 99.7%) of
N
C
n
C
the values in the population or the sample
O
G
N
H
AN
C
H
AN
C
H
O
95% 99.7%
AN
C
G
N
C
O
C
G
O
μ ± 2σ μ ± 3σ
N
G
N
2.4.3 Weighted Mean
H
n The weighted mean of a set of data is
AN
H
C
AN
C
H
O
AN
G
C
n
N
åw x
C
G
C
i i
w 1x1 + w 2 x 2 + + w n x n
O
x= i=1
=
G
Where wi nis the weight of the ith observation
n
N
n
H
AN
and
C
n = å wi H
AN
C
H
O
AN
Use when data is already grouped into n classes, with
C
G
n
N
C
O
C
G
O
N
G
N
Approximations for Grouped Data
H
AN
Suppose data are grouped into K classes, with
H
C
AN
frequencies f1, f2, . . . fK, and the midpoints of the
C
H
O
C
classes are m1, m2, . . ., mK
AN
G
C
N
C
G
C
N
O
G
n For a sample of n observations, the mean is
N
H
AN
åfm
C
H where n = å fi
i i
AN
C
x= i=1
H
O
i=1
AN
C
n
G
N
C
O
C
G
O
N
G
N
Approximations for Grouped Data
H
AN
Suppose data are grouped into K classes, with
H
C
AN
frequencies f1, f2, . . . fK, and the midpoints of the
C
H
O
C
classes are m1, m2, . . ., mK
AN
G
C
N
C
G
C
N
O
G
n For a sample of n observations, the variance is
N
H
AN
åi i
f (m - x) 2
C
H
AN
C
s2 =
H
i=1
O
AN
C
n -1
G
N
C
O
C
G
O
N
G
N
2.6 Measures of association between two
quantitative variables
H
The covariance measures the strength of the linear relationship
AN
n
between two variables
H
C
AN
C
H
The population covariance:
O
C
n
AN
G
C
N
C
å (x - µ )(y i - µ y )
C
i x
O
Cov (x , y) = s xy = i=1
G
N
N
H
AN
H n
AN
C
å (x - x)(y - y)
H
O
AN
C
i i
G
Cov (x , y) = s xy = i=1
N
C
n -1
O
C
G
O
Only concerned with the strength of the relationship
N
G
N
n No causal effect is implied
Interpreting Covariance
H
AN
H
C
AN
C
H
O
C
n
AN
G
C
N
C
G
C
N
O
G
Cov(x,y) > 0 x and y tend to move in the same direction
N
H
AN
H
AN
C
H
Cov(x,y) = 0 x and y are independent
O
AN
C
G
N
C
O
C
G
O
N
G
N
Coefficient of Correlation
H
AN
Measures the relative strength of the linear relationship
H
n C
AN
between two variables
C
H
O
AN
G
C
N
O
Population correlation coefficient:
C
n
C
N
O
Cov (x , y)
G
ρ=
N
H
σXσY
AN
C
H
AN
C
H
n
O
AN
C
G
Cov (x , y)
N
C
r=
O
C
G
sX sY
O
N
G
N
Features of
Correlation Coefficient, r
H
AN
H
C
Unit free
AN
n
C
H
O
AN
Ranges between –1 and 1
G
C
N
C
G
The closer to –1, the stronger the negative linear
C
n
O
relationship
G
N
H
n
C
relationship
H
AN
C
H
O
AN
C
n
G
N
C
relationship
O
C
G
O
N
G
N
Scatter Plots of Data with Various
Correlation Coefficients
H
AN
Y Y Y
H
C
AN
C
H
O
AN
G
C
N
C
G
C
N
O
X X X
G
N
r = -1 r = -.6 r=0
H
AN
Y
C
Y Y H
AN
C
H
O
AN
C
G
N
C
O
C
G
O
N
G
X X X
N
r = +1 r = +.3 r=0
Chapter Summary
n Reviewed types of data and measurement levels
H
H
H
AN
AN
Data in raw form are usually not easy to use for decision
AN
n
C
C
making -- Some type of organization is needed:
C
C
C
C
O
O
¨ Table ¨ Graph
O
G
G
G
N
N
N
n Techniques reviewed in this chapter:
H
AN
H
Frequency distribution Line chart
AN
n n
C
Frequency distribution
H
Bar chart n
C
C
n
AN
O
n
Pie chart
G
C
Stem-and-leaf display
N
n
Pareto diagram
C
n
N
Scatter plot
O
n
n G
Cross tables and
N
side-by-side bar charts
Chapter Summary
(continued)
H
n Described measures of central tendency
AN
H
n Mean, median, mode
C
AN
C
H
O
C
n
AN
G
C
N
Symmetric, skewed
C
n
C
Described measures of variation
O
n
G
Range, interquartile range, variance and standard deviation,
N
n
H
coefficient of variation
AN
H
n
AN
C
H
O
AN
C
n
G
N
C
variables
O
C
G
O
covariance and correlation coefficient
N
G
N