0% found this document useful (0 votes)
4 views

Chapter 03. Numerical Measures

Chapter 3 discusses descriptive statistics, focusing on numerical measures such as measures of central tendency and measures of variability. It explains the three common measures of central tendency: mean, median, and mode, along with their calculations and applications. Additionally, the chapter covers percentiles and their significance in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 03. Numerical Measures

Chapter 3 discusses descriptive statistics, focusing on numerical measures such as measures of central tendency and measures of variability. It explains the three common measures of central tendency: mean, median, and mode, along with their calculations and applications. Additionally, the chapter covers percentiles and their significance in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Chapter 3.

Descriptive Statistics:
Numerical Measures
Content of Chapter 3

 3.1. Measures of central tendency

 3.2. Measures of Variability

2
Measures of central tendency
• Measures of central tendency – Độ đo xu hướng
trung tâm /vị trí trung tâm
What does “center location” mean? How to

understand it correctly!

3
Measures of central tendency
What does “center location” mean? How to
understand it correctly!

4
Measures of central tendency
What does “center location” mean? How to
understand it correctly!

5
3.1 Measures of central tendency -1
These figures indicate that you can see where (a typical
value) most values tend to occur.

Measures of central tendency refer to the center


point or typical value of a dataset. These measures
indicate where most values tend to occur.

The 3 most common measures of central


tendency/location are
 Mean (trung bình)
 Median (trung vị),
 The mode (yếu vị).

6
Measures of central tendency -2

• Mean
• Weighted Mean (Trung bình có trọng số)
• Median (trung vị)
• Mode (Yếu vị)
• Percentiles (Phân vị)
• Quartiles (Tứ phân vị)

7
Measure of mean -1

8
Measure of mean -2
When dataset is a grouped-data:

Then the mean value is determined by


k k

i 1
xi*  f i 
i 1
xi*  f i
x  ,
k n

i 1
fi

9
Mean - Grouped data -3

• where is the frequency for class i,


is the midpoint for class i and determined by

*
x min
(i )
+ x max
(i )
x =
i
2

• where and are lower limit and


upper limit for class i, respectively.

10
Measure of mean -4
Example 1:
The number of new orders received by a company
over the last 25 working days were recorded as
follows:

3, 0, 1, 4, 4, 4, 2, 5, 3, 6, 4, 5, 1, 4, 2, 3, 0, 2, 0, 5,
4, 2, 3, 3, 1 sample

Question: Compute the mean value.

ANS:

11
Measure of mean -5
Example 1 (cont):
The number of new orders received by a company
over the last 25 working days were recorded as
follows:

Question: Compute the mean value.

12
Weighted Mean -1

• Example 2: Construction Wages


– Ron Butler, a home builder, is looking over the
expenses he incurred for a house he just built.
– Listed below are the categories of worker he employed,
along with their respective wage and total hours worked.
– For the purpose of pricing future projects, he would like
to know the average wage ($/hour) he paid the workers
he employed.

13
Weighted Mean -2
• Construction Wages:
Worker (wage- amount of
$/hour) time(hours)
Carpenter 21.60 520

Electrician 28.72 230

Laborer 11.80 410

Painter 19.75 270

Plumber 24.16 160


FYI, equally-weighted (simple) mean = $21.21

14
Weighted Mean -2
• Sol:
worker *
Carpenter 21.60 520 11232.0

Electrician 28.72 230 6605.6

Laborer 11.80 410 4838.0

Painter 19.75 270 5332.5

Plumber 24.16 160 3865.6

1590 31873.7
FYI, equally-weighted (simple) mean = $21.21

15
Weighted Mean -3
• Sol:

FYI, equally-weighted (simple) mean = $21.21

16
Mean - Grouped data -1
Suppose we are given a frequency distribution summarizing
a sample of 65 customer satisfaction ratings for a consumer
product. Determine sample mean.

17
Mean - Grouped data -2

( ∗
Satisfaction Frequency Class midpoint i i
rating (i ( i∗)
( i)
36-38 4 37 4(37)=148
39-41 15 40 600
42-44 25 43 1075
45-47 19 46 874
48-50 2 49 98

= 43.

18
Mean - Grouped data -3
Example 2: Marks obtained by 50 students are given

Question: Compute mean score.

19
Discussion - measure of mean -1
Discussion: the mean value is located at the
center.

20
Discussion - measure of mean -2
Discussion: The mean doesn’t always locate the center
of the data accurately. Observe the histograms below
where we display the mean in the distributions.

21
Median -1
Median: The middle value in an ordered data set.

Intuitively, the median value divides a data set


into roughly equal parts
How to determine the median value?

 Step 1: arrange the data from smallest to largest

 Step 2: Determine the number of observations. It


could be odd or even.

22
Median -2
 Step 3: Determine the median. There are two
cases:
 If n is odd then, n has the form of n =
2*m+1guess value of m. Then the median
is the value at the (m+1)th position in the
ordering.
 If n is even then, n has the form of n = 2*m.
guess value of m. The median is the
average of the values at position mth and
(m+1)th in the ordering data.

23
Median -3
Example 1:

• For an odd number of observations:

26 18 27 12 14 27 19 7 observations

12 14 18 19 26 27 27 in ascending order

the median is the value at the fourth position.

Median = 19

24
Median -4
Example 2:
• For an even number of observations:

26 18 27 12 14 27 30 19 8 observations

12 14 18 19 26 27 27 30 in ascending order

the median is the average of two values at 4th and


5th position.
Median = (19 + 26)/2 = 22.5

25
Mode -1
The third measure of the central tendency of a
population/sample is the mode, which is denoted .
 Mode is the value that occurs with the highest
frequency.

It’s possible to have no mode, one mode, or


more than one mode.

26
Mode -2
 What is mode?

27
Mode -3
If only one data value occurs with the greatest frequency, the data
set is unimodal. For instance,

28
Mode -4

• The data have two modes:

1, 3, 3, 3, 4, 4, 6, 6, 6, 9.

• Mode = 3 or 6.

• There is no mode in the data set: 1, 1, 2, 2, 3, 3,


4, 4 .

29
Mode -5

• Example: Apartment Rents

450 occurred most frequently (7 times)

Mode = 450
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Note: Data is in ascending order.


30
Examples
Example 1:
The number of new orders received by a company
over the last 25 working days were recorded as
follows:

3, 0, 1, 4, 4, 4, 2, 5, 3, 6, 4, 5, 1, 4, 2, 3, 0, 2, 0, 5,
4, 2, 3, 3, 1

Q 1: Compute the median value and explain its meaning.


Q 2: Determine mode and explain its meaning.

31
Examples

Example 2: A data is given about:


40 25 35 30 20 40 30 20 40 10 30 20 10 5 20

Q 1: Compute the median value and explain its meaning.

Q 2: Determine mode and explain its meaning.

32
Comparison of the Mean, Median, and
Mode -7
ANS1: Hint:
 Ordered data:
 The median value:
 Explain:

33
Comparison of the Mean, Median, and
Mode -8
ANS 2: Hint:
 The mode value:
 Explain:

34
Comparison of the Mean, Median, and
Mode -1

 Which is the best: the Mean, Median, or Mode?

 When you have a symmetrical distribution for real


data, the mean, median, and mode are equal.

 If you have a skewed distribution, the median is often


the best measure of central tendency (because Outliers
and skewed data have a smaller effect on the median.

 When you have ordinal data, the median or the mode


is usually the best choice. For categorical data, you
have to use the mode.

35
Comparison of the Mean, Median, and
Mode -2

 Which is the best: the Mean, Median, or Mode?

36
Comparison of the Mean, Median, and
Mode -3

 Which is the best: the Mean, Median, or Mode?

37
Comparison of the Mean, Median, and
Mode -4
The median may be a better indicator of the most
typical value if a data set has an outlier. An
outlier is an extreme value that differs greatly
from other values.

However, when the sample size is large and does


not include outliers, the mean usually provides
a better measure of central tendency.

38
Comparison of the Mean, Median, and
Mode -5
Example 1: Suppose that in a small town of 50 people, one
person earns $5,000,000 per year and the other 49 each earn
$30,000. Which is the better measure of the “center”: the
mean or the median?

ANS: The median is a better measure of the “center” than


the mean because 49 of the values are 30,000 and one
is 5,000,000. The 5,000,000 is an outlier.

39
Percentile (Phân vị) -1

40
Percentile -2

Discussion: 80% of people are shorter than you.


Your height is the 80th percentile in that group.

How to determine your height?


 pth percentile (0<p<100):
A given data is in order. Suppose that p% of
values (observations) are less than x* then x* is
called the pth percentile.

41
Percentile -3
Step 1: order the dataset from smallest to
largest.

 Case #1: If i is an integer value, the pth percentile


value is the average of values at the i and (i+1)th
positions in the ordering dataset.

 Case #2: If i is not an integer value, we round up i.


Then, the pth percentile value is the value at the ith
position.

42
Percentile -4
Example: Researcher has obtained the number of
hours worked per week during the summer for a
sample of fifteen students.

40 25 35 30 20 40 30 20 40 10 30 20 10 5
20

43
Percentile -5
Step 1: ordering data: 5 10 10 20 20 20 20 25 30 30 30 35 40
40 40

44
Percentiles -6
• Example: The 85th percentile for the starting
salary data
– Step 1. Arrange the data in ascending order. 3710,
3755, 3850, 3880, 3880, 3890, 3920, 3940, 3950,
4050, 4130, 4325.
– Step 2.

– Step 3.

45
80th Percentile
• Example: Apartment Rents
i = (p/100)n = (80/100)70 = 56
Averaging the 56th and 57th data values:
80th Percentile = (535 + 549)/2 = 542
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Note: Data is in ascending order.


46
80th Percentile
• Example: Apartment Rents
“At least 80% of the “At least 20% of the
items take on a items take on a
value of 542 or less.” value of 542 or more.”

56/70 = 0.8 or 80% 14/70 = 0.2 or 20%


425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

47
Quartiles -1

• Quartiles are values that divide an ordered data set


into four parts that each contain approximately one-
fourth of the data values. Quartiles are specific
percentiles.
• First Quartile (Q1) = 25th Percentile: Twenty –five
percent of the data values fall at or bellow the first
quartile.
• Second Quartile (Q2) = 50th Percentile = Median
• Third Quartile (Q3) = 75th Percentile

48
Quartiles -2

• Example: The starting salary data


• Arrange the data in ascending order.
3710, 3755, 3850, 3880, 3880, 3890, 3920, 3940, 3950, 4050,
4130, 4325.

1. For Q1, i = (p/100)n = (25/100)12 = 3. Because i is an integer,


Q1 = (3850 + 3880)/2 = 3865.

2. For Q2, i = (p/100)n = (50/100)12 = 6. Because i is an integer,


Q2 = (3890 + 3920)/2 = 3905.

3. For Q3, i = (p/100)n = (75/100)12 = 9. Because i is an integer,


Q3 = (3950 + 4050)/2 = 4000.
49
Quartiles -3

• The quartiles divide the starting salary data into four


parts, with each part containing 25% of the
observations.

50
Quartiles -4
• Example: Apartment Rents
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Third quartile = 75th percentile


i = (p/100)n = (75/100)70 = 52.5 = 53
Third quartile = 525 Note: Data is in ascending order.

51
3.2 Measures of Variability (-1)
To introduce the idea of variability, consider this
example. The following data are recorded:

Dataset01: 1, 2, 3, 3, 5, 4.
mean = 3, median = 3, mode = 3.

Dataset02: 2, 3, 3, 3, 3, 4.
mean = 3, median = 3, mode = 3.

Their dotplots:

They have the same center, but what about their spreads?
Which one has more spread?

52
3.2 Measures of Variability (-2)

53
3.2 Measures of Variability (-3)

54
Measures of Variability (-4)

• The spread of a distribution refers to the


variability of the data.
• If the observations cover a wide range, the
spread is larger. If the observations are
clustered around a single value, the spread is
smaller.

55
Measures of Variability (-5)

• Therefore, we need measures of variation to


express how two/more distribution differ. There are
many ways to describe variability or spread including:
• Range (Hạng),
• Inter-quartile Range (Độ trải giữa),
• Variance (Phương sai),
• Standard Deviation (Độ lệch chuẩn),
• Coefficient of variation (Hệ số biên thiên).

56
Range (-1)

• The range of a data set is the difference between the


largest and smallest data values.
• That means R = x(max) – x(min).

• It is the simplest measure of variability.

• It is very sensitive to (affected by) the smallest


and largest data values. Thus, the range is not
the best measure of a data set’s variation.

57
Range (-2)
• Example: The starting salary data

– Arrange the data in ascending order.

3710, 3755, 3850, 3880, 3880, 3890, 3920, 3940, 3950, 4050,
4130, 4325.

– Range = 4325 – 3710 = 615.

– It is very sensitive to the smallest and largest data values.

– Data: 3510, 3755, 3850, 3880, 3880, 3890, 3920, 3940, 3950,
4050, 4130, 4825.

– Range = 4825 – 3510 = 1315.


58
Range -3

• Example: Apartment Rents


Range = largest value - smallest value

Range = 615 - 425 = 190

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Note: Data is in ascending order.


59
Inter-quartile Range -1

• The inter-quartile range of a data set is the difference


between the third quartile and the first quartile.

• It is the range for the middle 50% of the data.

• It overcomes the sensitivity to extreme data values.

• Inter-quartile Range (IQR) = Q3 – Q1

• For the data on monthly starting salaries, the quartiles are


Q3 = 4000 and Q1 = 3865. The inter-quartile range is 4000
- 3865 = 135.
60
Inter-quartile Range -2

• Example: Apartment Rents


3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445

Inter-quartile Range = Q 3 - Q 1 = 525 - 445 = 80


425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Note: Data is in ascending order.


61
Variance and standard deviation (-1)

The variance is a measure of variability that utilizes


all the data.

Variance: the average squared distance


from the mean.

62
Variance and standard deviation (-2)
The variance is the average of the squared differences
between each data value and the mean.

The variance is computed as follows:


n
1
   xi  x 
n 2
1
 xi  x 
2

2
s 
2
n i1
n 1 i1
for a sample for a population
2
2 2 – µ) i
 (x
N
 The standard deviation is computed as
follows:

s s 2   2
for a sample for a population
63
Variance and Standard Deviation -3

• An alternative formula for the computation of the sample


variance is

s 2
= Σ x i2 – nx 2
n–1

• The standard deviation is a commonly used measure of the


risk associated with investing in stock and stock funds. It
provides a measure of how monthly returns fluctuate around
the long-run average return.

64
Coefficient of Variation
• The coefficient of variation (CV) is a measure of relative
variability. It is the ratio of the standard deviation to the

mean.
s
CV  100%
x

Example: The weekly sales of two products A and B were


recorded as given below:
Product A 59 75 27 63 27 28 56
Product B 150 200 125 310 330 250 225

Find out which of the two shows greater fluctuation in sales.

65
Distribution Shape: Skewness -1
• An important measure of the shape of a
distribution is called skewness.

• In practice, Pearson’s coefficient of skewness of


sample data is:

3(x – Median)
Skewness =
s

• where s is sample standard deviation.

66
Distribution Shape: Skewness -2

If the skewness is negative, the data are skewed left,


meaning that the left tail is longer.

 If the skewness is positive, the data are skewed right,


meaning that the right tail is longer.

 If the skewness is zero, the data are is symmetric.

67
Distribution Shape: Skewness -3
• Moderately Skewed Left
– Skewness is negative.
– Mean will usually be less than the median.

Skewness = – 0.31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0

60
Distribution Shape: Skewness -4

• Highly Skewed Right


– Skewness is positive (often above 1.0).
– Mean will usually be more than the median.

.35
Skewness = 1.25
.30
Relative Frequency

.25
.20
.15
.10
.0
50

61
Distribution Shape: Skewness -5

• Symmetric (not skewed)


– Skewness is zero.
– Mean and median are equal.

.35 Skewness = 0
.30
Relative Frequency

.25
.20
.15
.10
.05
0

62
Boxplot -1
 We can use box plot for presenting quantitative data.

 Box plot shows the five-number summary of a dataset:


including the minimum value, Q1, Q2, Q3, and
maximum value.

 Box plot divides the data into sections that each contain
approximately 25% of the observations in that set.

63
Boxplot -2

Boxplot is often used to detect (dò tìm) outliers.


When drawing a box plot, an outlier is
defined as a data point that is located
outside the whiskers of the box plot.

An observed value is called an outlier in the data


if that value falls outside of the following
interval:
[ Q1−1.5×IQR, Q3+1.5×IQR ]
where IQR = Q3 – Q1

64
Boxplot -3

Smallest non- Largest non-


outlier outlier

66
Boxplot -4
How to draw a boxplot?
Step 1: Calculate the three number summary { Q1; Q2; Q3}.
Step 2: Detect outliers.
Step 3: A box plot can be presented horizontally or vertically.
In a box plot, we draw:
 the line (front whisker) goes from minimum (except outlier)
to Q1 and the line (back whisker) goes from Q3 to
maximum (except outlier).
 a box from Q1 to Q3.
 A vertical/horizontal line goes through the box at the
median.

67
Boxplot -5
Example: Consider the data: 27, 89, 63, 61, 78, 87, 74, 72,
54, 88, 62, 81, 78, 73, 63, 56, 83, 86, 83, 93.

67
Boxplot -6
The shape of a box plot will show if a distribution
of the dataset is normally distributed or skewed.
 When the median is in the middle of the box,
and the whiskers (râu) are about the same on
both sides of the box, then the distribution is
symmetric.

67
Boxplot -7
 When the median is closer to the bottom of the box,
and if the whisker is shorter on the lower end of the
box, then the distribution is positively skewed
(skewed right).

67
Boxplot -8
 When the median is closer to the top of the box,
and if the whisker is shorter on the upper end of
the box, then the distribution is negatively
skewed (skewed left).

67
End of Chapter 3,

69

You might also like