Chapter 1.3 Data description (B)
Chapter 1.3 Data description (B)
MEASURES OF DISPERSION
• The more spread out or dispersed the data, the larger is the
range, the quartile deviation, the variance and the standard
deviation.
1
Range
• Range is the difference between the largest and the smallest
observations in a data set.
Range = largest value – smallest value
Solution: Range = 45 – 10 = 35
Example
The following table shows the daily outputs of 80 workers in a
factory. Determine the range.
Daily outputs Number of workers
10 – 19 6
20 – 29 10
30 – 39 30
40 – 49 20
50 – 59 10
60 – 69 4
Solution: Range = 69.5 – 9.5 = 60 units
• Advantage:
It is easy to understand and simple to calculate.
• Disadvantage:
Since only the largest and the smallest values are considered, it
can be very much influenced by them especially if they are
unrepresentative extreme values.
2
Quartile deviation (Semi-inter quartile range)
• The disadvantage of the range can be overcome by ignoring the
extreme values. This is done by ignoring the top and the bottom
quarters and considering only the range between the quartiles
(called the inter-quartile range).
Inter-quartile range = Q3 − Q1
• If the inter-quartile range is divided by two, the figure obtained is
called semi inter-quartile range or quartile deviation which gives
the average amount by which Q1 and Q3 differ from the median.
Quartile deviation = semi-inter quartile range
Q3 − Q1
=
2
• Definition of quartiles:
(a) The first quartile, also called the lower quartile, is denoted
by Q1. It is defined as the value of an item one-quarter of the
way through a distribution
(b) The third quartile, also called the upper quartile, is
denoted by Q3. It is defined as the value of an item three-
quarter of the way through a distribution.
Quartiles divide the data into 4 equal parts. Thus with the
quartiles known, we can say that a quarter of the observations
lies below the first quartile. A quarter lies above the third quartile
while half of the observations lies between the two quartiles.
3
n+1
Q1 = value of th item
4
3( n + 1)
Q3 = value of th item
4
where n = no. of items in a data set.
Example
The following array shows the daily wages (in RM) of ten factory
workers: 20, 25, 26, 30, 32, 36, 38, 38, 40, 45
Calculate (1) the quartiles;
(2) the inter-quartile range;
(3) the quartile deviation.
Solution n = 10
n +1 10 + 1
(1) Q1= value of th item = value of th item
4 4
= value of 2.75thitem = 2nd item + 0.75 (3rd –2nd )
= 25+0.75(26-25)=25.75 (RM)
3(n + 1) 3(10 + 1)
Q3= value of th item = value of th item
4 4
= value of 8.25thitem =8th item+0.25(9th -8th )
=38 + 0.25(40-38)=38.5 (RM)
(2) The inter-quartile range = Q3 – Q1
=38.5 – 25.75= 12.75 (RM)
12.75
(3) The quartile deviation = 2
= 6.375 (RM)
• Advantages:
It can be computed even though the end values of the
distribution are not known, as with the open-ended classes.
It is not influenced by the extreme values.
4
• Disadvantage:
It is not fully representative of a set of measurements as it is
not based on all the information available.
(b) Grouped data
cQ3 3n
Q3 = LQ3 + − f Q3 −1
f Q3 4
where LQ3 = lower class boundary of Q3 class
cQ3 = class size of Q3 class
f Q3 = frequency of Q3 class
f Q3 −1 = cum. freq. of the preceding Q3 class
5
Example
The following frequency distribution shows the daily production
level.
Production (units) No. of days
13 – 17 2
18 – 22 22
23 – 27 10
28 – 32 14
33 – 37 3
38 – 42 4
43 – 47 6
48 – 52 1
Calculate the quartile deviation using
(a) the linear interpolation method;
(b) an ogive.
Solution
Production Class f Cumulative
(units) boundaries f
13 – 17 12.5 – 17.5 2 2
18 – 22 17.5 – 22.5 22 24
23 – 27 22.5 – 27.5 10 34
28 – 32 27.5 – 32.5 14 48
33 – 37 32.5 – 37.5 3 51
38 – 42 37.5 – 42.5 4 55
43 – 47 42.5 – 47.5 6 61
48 – 52 47.5 – 52.5 1 62
Total 62
n = 62
62
Q1 = value of th item = value of 15.5 th item
4
3 62
Q3 = value of th item = value of 46.5 th item
4
6
(a) Q1 class boundaries: 17.5 – 22.5
5
Q1 = 17.5 + (15.5 − 2) = 20.57 units
22
(b)
No. of days
'<' ogive: Production at a factory for a period of 62 days
(Cum. freq.)
70
60
50
40
30
20
10
class boundaries
0
12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5
Production (units)
7
From the ‘<’ ogive, Q1 = 20.6 units which shows that 25% of the
days are having production less than or equal to 20.6 units and the
other 75% of the days are having production more than or equal to
20.6 units.
From the ‘<’ ogive, Q3 = 32.0 units which shows that 75% of the
days are having production less than or equal to 32.0 units and the
other 25% of the days are having production more than or equal to
32.0 units.
Q3 − Q1 32.0 − 20.6
Quartile deviation = = = 5.7 units
2 2
Standard deviation
• Standard deviation is the root-mean-square deviation between
the individual values and the mean in a distribution.
• Consider a set of data: x1 , x2 , …, xn
Let mean of the data be: x
n
( x − x )
root-mean-square deviation: 2
8
• Computation of the standard deviation:
(a) Raw data
Population standard deviation, Sample standard deviation,
( x − ) ( x − x )
2 2
= s=
N n −1
Alternatively:
= − x 2
−
N N s= n
n −1
Variance
9
Example
Find the standard deviation and variance for the following data:
2, 12, 7, 5, 9
Solution
N=5
∑x = 2 + 12 + 7 + 5 + 9 = 35
∑x2 = 22 + 122 + 72 + 52 + 92 = 303
Population standard deviation,
x2 x
2 2
303 35
= − = − = 11.6 = 3.41
N N 5 5
Population variance, = 3.41 = 11.6
2 2
Example
During a particular summer month, the number of central air-
conditioning units sold by a random sample of 5 salespersons from
a heating and air-conditioning firm were as follows:
8, 11, 5, 12, 8
Find the sample standard deviation and the sample variance.
Solution
n=5
∑x = 8 + 11 + 5 + 12 + 8 = 44
∑x2 = 82 + 112 + 52 + 122 + 82 = 418
10
(b) Grouped data
fx 2 −
( fx )2
= −
f f s= n
n −1
Example
Find the mean and standard deviation of the following frequency
distribution.
Class interval Frequency
0–6 2
6 – 12 4
12 – 18 10
18 – 24 12
24 – 30 8
30 – 36 4
Solution
Class interval Class mark, x f fx fx2
0–6 3 2 6 18
6 – 12 9 4 36 324
12 – 18 15 10 150 2250
18 – 24 21 12 252 5292
24 – 30 27 8 216 5832
30 – 36 33 4 132 4356
Total 40 792 18072
11
fx 792
Population mean, = f = 40 = 19.8
Example
The output distribution for a sample of 100 workers in BB Company
is shown below:
Output (units) Number of workers
21 – 25 10
26 – 30 35
31 – 35 16
36 – 40 14
41 – 45 12
46 – 50 10
51 – 55 3
Calculate the mean and the standard deviation.
Solution
Output (units) Class mark, x f fx fx2
21 – 25 23 10 230 5290
26 – 30 28 35 980 27440
31 – 35 33 16 528 17424
36 – 40 38 14 532 20216
41 – 45 43 12 516 22188
46 – 50 48 10 480 23040
51 – 55 53 3 159 8427
Total 100 3425 124025
12
fx 3425
Sample mean, x = = = 34.25 units
f 100
fx 2
−
( fx )
2
124025 −
34252
s= n = 100 = 67.8662 = 8.24
units
n −1 100 − 1
13
Example
Typist A can type 40 words per minute with standard deviation of 5
while typist B can type 160 words per minute with standard
deviation of 10. Which typist is more consistent in her work?
Solution
The standard deviation of typist B is twice of typist A. B can type
four times the speed of A. Taking into consideration all the
information, the coefficient of variation is used.
standard deviation
CV = 100
mean
5
CV for A = 100 =12.5%
40
10
CV for B = 100 = 6.25%
160
The results show that the typing ability of typist B is more consistent
than typist A.
14
Mean deviation
The mean deviation measure the average difference
individual values and the arithmetic mean in a distribution.
In contrast to the range and the quartile deviation, the mean
deviation takes into account all the data and yet it is not greatly
affected by extreme values in the distribution due to the
averaging technique used.
Ungrouped data
Mean deviation =
where | | = positive sign
Example
The following data shows the daily wages of 5 factory workers(in
RM):
20 40 58 60 32
Calculate the mean deviation.
Solution
Let x = daily wages of the workers(RM)
Mean = (20 + 40 + 58 + 60 + 32) / 5 = 42
x |x - mean|
20 |20-42|= |-22|=22
40 2
58 16
60 18
32 10
Total 68
15
So, mean deviation =
Grouped data
Mean deviation =
(note: in the case of frequency distribution of multi-value grouping,
x = the class mark)
Advantage
~It measures the degree of dispersion in terms of all the values in
the distribution.
Disadvantage
~Since the signs of the differences are ignored in the calculation,
it is not suitable for further statistical analysis.
Example
No. of children No. of families, f
0 8
1 11
2 20
3 5
4 3
5 2
6 1
Total 50
Calculate the mean deviation.
16
Solution
Mean = children
No. of No. of fx |x-mean| f|x-
children, x families, f mean|
0 8 0 1.88 15.04
1 11 11 0.88 9.68
2 20 40 0.12 2.40
3 5 15 1.12 5.60
4 3 12 2.12 6.36
5 2 10 3.12 6.24
6 1 6 4.12 4.12
Total 50 94 13.36 49.44
17
Example
Calculate the mean deviation.
Length f Class fx |x- f|x-
(cm) mark, x mean| mean|
10-20 3 15 45 39.2 117.6
20-30 7 25 175 29.2 204.4
30-40 10 35 350 19.2 192
40-50 16 45 720 9.2 147.2
50-60 34 55 1870 0.8 27.2
60-70 13 65 845 10.8 140.4
70-80 7 75 525 20.8 145.6
80-90 6 85 510 30.8 184.8
90-100 4 95 380 40.8 163.2
Total 100 5420 1322.4
Mean=
18
BOX-AND-WHISKER PLOT(OR BOX-PLOT)
~ a box-and-whisker plot provides a graphical representation of
the data based on the five number summary, namely Xsmallest, Q1,
median, Q3 and Xlargest
~ the general form of a box-and-whisker-plot is shown in the
diagram below:
Example
The following is a set of data from a sample of size n = 7:
12 7 4 9 0 7 3
(a)List the five number summary
(b)Form the box-and-whisker plot and describe the shape.
Solution
(a) Rearrange the data, we have 0 3 4 7 7 9 12 The Xsmallest
= 0, xlargest = 12
Q1 location = so Q1 = 3
Q3 location = so Q3 = 9
Median, Q2 = 7
19
(b)Box plot diagram
Xsmallest = 0 Q2 =7 Xlargest = 12
Q1 = 3 Q3 = 9
From the box-plot, we can say that the data set is quite symmetrical
distributed as the left and right whisker are same length. However,
the middle 50% value is left skewed as the vertical median line is
closer to the right side of the box.
COEFFICIENT OF SKEWNESS
~The term skewness is used to describe the shape of a frequency
distribution.
~If the histogram of a frequency distribution is drawn, the
distribution is said to be skewed if the peak of the histogram lies
to either side of the centre of the distribution. The terms positive
and negative skewness are used to describe the direction of the
skewness.
Positive skewness
The distribution is said to have a positive skewness if the peak of
the histogram lies to the left of the centre of the distribution.
Negative skewness
The distribution is said to have a negative skewness if the peak of
the histogram lies to the right of the centre of the distribution.
20
The coefficient of skewness is used to measure the degree of
skewness.
(a)Pearson measure of skewness(denoted by Sk(1) and Sk(2))
Pearson first coefficient of skewness, Sk(1)
=
Pearson second coefficient of skewness, Sk(2)
=
Pearson measures can take values from -3 to +3. For a
symmetrical distribution, it is zero; for positively or negatively
skewed curve, it takes the appropriate sign. For a moderately
positively/negatively skewed distribution, it takes the appropriate
sign and the absolute value of the measure is less than 1.
SkQ =
Quartile measure of skewness takes values between -1 and +1. It
is convenient to use when the median and the quartile are used to
describe the distribution.
Example
The lengths of stay by patients on the cancer floor of a local
hospital were organized into a frequency distribution. The mean
and median length of stay was 28 and 25 days respectively, and
the standard deviation was found to be 4.2 days.Calculate the
coefficient of skewness
Solution
Since given the values of mean, median and standard deviation,
use
Sk2 =
21
Relationship between the mean, median and mode in a
frequency distribution:
~In a positively skewed distribution, mean > median > mode
~In a negatively skewed distribution, mean < median < mode.
~In a symmetrical distribution, mean = median = mode.
2. The following array shows the amounts spent (in RM) by a random sample
of 15 students at a primary school canteen:
0.50, 0.50, 0.75, 0.75, 0.75, 0.85, 0.90, 1.50, 1.90, 1.90, 2.35,
2.45, 2.71, 3.00, 3.10.
Determine
(a) the quartiles,
(b) the inter-quartile range,
(c) the standard deviation.
22
4. A company owns two garages, A and B. In garage A, a representative
sample of 200 consumers’ purchases was taken. The results were as
follows:
237 245 283 296 253 249 236 254 305 242
23
Calculate the mean and the standard deviation of this department.
Calculate the relative dispersion.
19 17 12 30 38 27 14 16 17 0 20 45
5 25 11
Construct a box- and- whisker plot and comment on the skewness of the
distribution.
24
Answers:
1. (a) 5 min, 15.25 min (b) 5.125 min (c) 26.989 min2
5. (a) Males: 32.54 years, 21.85 years; Females: 35.89 years, 23.39
years
(b) Males: CV = 67.2%; Females: CV = 65.2%; The ages of males are
more variable.
6. (b) CV = 7.70% (c) RM260. RM23.9 (d) Admin Dept: CV = 9.19%; hence
earnings are more variable.
Extra questions
1)
25
2)
26