Lecture 3 - Numerical Summary - Part 1
Lecture 3 - Numerical Summary - Part 1
Reading materials:
Chap 4 (Keller)
1 2
1 2
3 4
5 6
5 6
1
Arithmetic mean from raw data Arithmetic mean from frequency table
N
X i
• Apply this formula for the sample:
• Arithmetic mean from population: i 1
N
k
n
x x f i i
• Arithmetic mean from sample:
i
x i 1
x i 1 k
n f
i 1
i
Where: Xi, xi - the value of each item Where: xi - the value of class i
N, n - total number of items fi – frequency of class i
7 8
7 8
• Advantages:
– Easy to understand and calculate
– Values of every items are included => representative for
the whole set of data
• Disadvantages
– Sensitive to outliers:
Sample: (43; 38; 37; : : : ; 27; 34): => x 33.5
Contaminated sample
(43; 38; 37; : : : ; 27; 1934): => x 71.5
9 10
9 10
Median is the value of the observation which is • If the data has an odd number of observations:
located in the middle of the data set (n 1)th
– Middle observation:
2
Steps to find median:
Median x ( n1)th
1. Arrange the observations in order of size (normally 2
ascending order) • If the data has an even number of observations:
2. Find the number of observations and hence the middle – There are two observations located in the middle and
observation
3. The median is the value of the middle observation M edian ( x th x th )/2
n n
1
2 2
11 12
11 12
2
Example Advantages and disadvantages of median
• Advantages:
• E.g1. Raw data: 11, 11, 13, 14, 17 => find median
– Easy to understand and calculate
• E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find – Not affected by outlying values => thus can be used
median when the mean would be misleading
• Disadvantages
– Value of one observation => fails to reflect the whole
data set
– Not easy to use in other analysis
13 14
13 14
Mode
Example to calculate mode
8 3
• Steps to find mode
12 7
1. Draw a frequency table for the data
16 12
2. Identify the mode as the most frequent value 17 8
19 5
15 16
15 16
17 18
3
Which measure of centre is best?
• Mean generally most commonly used
• Sensitive to extreme values
• If data skewed/extreme values present, median better, e.g.
real estate prices
• Mode generally best for categorical data – e.g. restaurant
service quality (below): mode is very good. (ordinal)
Rating # customers
Excellent 20
Very good 50
Good 30
Satisfactory 12
Poor 10
Very Poor 6 19
19