2 Analysing Data Distributions
2 Analysing Data Distributions
51, 63, 84, 64, 55, 63, 70, 81, 73, 51, 82, 58, 62, 65, 69, 81, 73, 79
Spread a single value that indicates how wide-ranging the data are
Skewness whether data tends towards the lowest or highest value in the distribution
Outliers whether data are so far from the average that you disregard them
x fx
for data in a list for data in a frequency distribution
n f
Dealing with large amounts of data
If data has a small set of possible values it can be useful to display it in
a frequency table, rather than as a list.
4,5,7,2,2,3,3,4,2,6,5,4,5,5,4,6,3,1,2,3,4,5,6,5,4,3,2,3,3,4,5,3,6,5,5,3,4,1
Goals Frequency
Displaying the data in this way has
1 2 several benefits:
2 5
3 9 •Data calculations can be done quickly
fx
Goals Frequency
fx
2 2
1
2 5 10
f
3 9 27 147
Mean
4 8 32 38
5 9 45
3.868...
6 4 24
7 1 7 3.9 goals (1dp)
Time Midpoint
Frequency fx
t seconds x
fx
12 ≤ t < 13 3 12.5 37.5
f
13 ≤ t < 14 8 13.5 108
14 ≤ t < 16 16 15 240 Use class
midpoints as x
16 ≤ t < 18 7 17 119
18 ≤ t < 24 2 21 42
Total f = 36 Total fx = 546.5
546.5
Estimated mean 15.180... 15.2 seconds (1dp)
36
Poorly defined groups
Eg The times taken by 26 pupils to run an 800m race are given in the table below.
Estimate the mean time taken to run the race.
4537
Estimated mean 174.5 seconds
26
Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:
7 2 3 6 6 x 1615
8 0 4 7 7 7
9 2 3 8 9
1615
Mode = Mean μ = 73.4
22
Lower quartile Q1 =
Standard deviation σ =
Median Q2 =
1458
Mode = Mean μ = 66.3
22
Lower quartile Q1 =
Standard deviation σ =
Median Q2 =
Number of
Frequency fx
eggs
1 3 3
2 6 12
3 5 15
4 1 4
5 6 30
fx 64
Mode =
64
Mean μ = 3.0
Lower quartile Q1 = 21
Upper quartile Q3 =
mean - median
Skew using
IQR = standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:
Number of
Frequency
eggs
1 2 You may use that fx 89
2 5
2
and fx 315
3 10
4 8
5 3
Mode = 89
Mean μ = 3.2
28
Lower quartile Q1 =
IQR =
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:
f 71
3320
Mean μ = 46.8
71
Lower quartile Q1 =
Upper quartile Q3 =
3mean - median
Skew using
standard deviation
IQR =
Combined mean
Eg The mean percentage achieved in S1 was 58% for the 12 pupils that sat it in 2008
The mean percentage achieved in S1 was 76% for the 7 pupils that sat it in 2009
What is the overall mean for 2008-2009?
PickPick four
three numbers
numbers
Pick
Pick
Pickfour
three
four
so whose
that themean
numbers
numbers
numbers with
mean
with isamedian
withaa 4mean
is and of
mean
less median
of567 theismedian
than
of 3
Median & quartiles for a list of data
You must be able to identify the median, upper and lower quartiles of data
Eg the number of runs scored in each innings by Kevin Pietersen during the
successful 2005 Ashes series were 64, 57, 20, 71, 0, 21, 23, 45, 158, 14
10 data values n 10 In order: 0, 14, 20, 21, 23, 45, 57, 64, 71, 158
1st 2nd 3rd 4th 5th 6th 7th 8th
n 10
Lower quartile: 2.5 3 rd 20 runs
4 4
n 10 5th 6th 23 45
Median: 5 34 runs
2 2 2 2
3n
Upper quartile: 7.5 8th 64 runs
4
Median & quartiles for large amounts of data
Eg the number of goals conceded by Everton each game in a season is given in the
frequency table below. Calculate the median number of goals scored per game.
1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,6,6,7
2nd 7th 16th 24th 33rd
13 ≤ t < 14 8 11 13 86 1 13.75
14 ≤ t < 16 16 27 n
Median 18 th
16 ≤ t < 18 7 34 2
18 ≤ t < 24 2 36 7 into the 14≤t ≤16 group
14 167 2 14.875
Lower quartile
n th
4
3n
Median 2
n th Upper quartile 27 th
4
Upper quartile 4
3n th
16 into the 14≤t ≤16 group
14 16
16 2 16
Poorly defined groups
Eg The times taken by 26 pupils to run an 800m race are given in the table below.
Use interpolation to find the median and quartiles
7 2 3 6 6 x 1615
8 0 4 7 7 7
9 2 3 8 9
1615
Mode = Mean μ = 73.4
22
Lower quartile Q1 = 6th 62
Standard deviation σ =
Median Q2 = 11.5 74.5
th
1458
Mode = Mean μ = 66.3
22
Lower quartile Q1 = 6th 55
11th 12th
Standard deviation σ =
Median Q2 = 2 64
Number of Running
Frequency fx total
eggs
1 3 3 3
2 6 12 9
3 5 15 14
4 1 4 15
5 6 30 21
fx 64
Mode =
64
Mean μ = 3.0
Lower quartile Q1 = 6th 2 21
Median Q2 = 11 3
th
Standard deviation σ =
Upper quartile Q3 = 16 th 5
mean - median
Skew using
IQR = 3 standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:
Number of Running
Frequency
eggs total
1 2 2 You may use that fx 89
2 5 7
2
and fx 315
3 10 17
4 8 25
5 3 28
Mode = 89
Mean μ = 3.2
28
7 th 8 th
Lower quartile Q1 = 2 2.5
Median Q2 = 14 th 15 th
3 Standard deviation σ =
2
21st 22 nd
Upper quartile Q3 = 2 4 Skew by comparing mean, median & mode
IQR = 1.5
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:
f 71
3320
Mean μ = 46.8
71
Lower quartile Q1 = 30 4.75
24 10 32.0
Median Q2 = 30 22.5
10 39.4 Standard deviation σ =
24
54 1 4 6
0 5 10 15
4 5 8 13 15 5 4 1 4 6 4
Mean = 9 Average difference =
5 5
The larger the average difference, the more spread out the data
If you weren’t careful the data values less than the mean would
contribute a negative difference and mess up the calculation…
Standard Deviation
In the GCSE you used range and inter-quartile range to measure the spread of data
In S1 you also use a measure called standard deviation, denoted by the symbol σ
x 2
where μ is the mean, calculated by x
n n
This considers the square of the difference between each piece of
data and the mean, and so avoids problems with negative differences
Eg the length of 5 worms are 4cm, 5cm, 8cm, 13cm and 15cm
(4 9) 2 (5 9) 2 (8 9) 2 (13 9) 2 (15 9) 2
= 4.3 (1dp)
5
NB: this calculation does not give the same answer as the method shown
previously, so do not be tempted to calculate standard deviation without squaring
Standard Deviation for a list of data
The calculation involves repeatedly subtracting the mean.
Using algebra beyond the scope of S1, the rule can be x 2
2
manipulated into a form that is faster to calculate: n
Eg the length of 5 worms are 4cm, 5cm, 8cm, 13cm and 15cm
4 2 5 2 8 2 13 2 15 2
Mean 9 from previously 9 2 = 4.3 (1dp)
5
Often, you will be told what Σx2 and Σx are, and only need to calculate the
mean before substituting those values into the standard deviation formula:
Eg given that Σx2 = 641.5 and Σx = 53.8 for 8 pieces of data, calculate σ
2
641.5 53.8
= 5.9 (1dp)
8 8
NB: The rule for σ is not given to you on the formula sheet – you must memorise it
Standard deviation for large amounts of data
Eg the number of goals conceded by Everton each game in a season
is given in the frequency table below. Calculate the standard deviation
in the number of goals scored per game.
Time
Frequency You will often be given Σfx and Σfx2
t seconds
12 ≤ t < 13 3
13 ≤ t < 14 8
14 ≤ t < 16 16
16 ≤ t < 18 7
18 ≤ t < 24 2
fx 2
2
8431.75 546.5
2
1615
Mode = Mean μ = 73.4
22
Lower quartile Q1 = 6th 62 2
124551 1615
Standard deviation σ = 16.5
Median Q2 = 11.5th 74.5 22 22
1458
Mode = Mean μ = 66.3
22
Lower quartile Q1 = 6th 55 2
100608 1458
11th 12th
Standard deviation σ = 13.5
Median Q2 = 2 64 22 22
Number of Running
Frequency fx total
fx 2
eggs
1 3 3 3 3
2 6 12 9 24
3 5 15 14 45
4 1 4 15 16
5 6 30 21 150
fx 64 fx 2
238
Mode =
64
Mean μ = 3.0
Lower quartile Q1 = 6th 2 21
2
238 64
Median Q2 = 11 3
th
Standard deviation σ = 1.4
21 21
Upper quartile Q3 = 16 th 5
mean - median
Skew using
IQR = 3 standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:
Number of Running
Frequency
eggs total
1 2 2 You may use that fx 89
2 5 7
2
and fx 315
3 10 17
4 8 25
5 3 28
Mode = 89
Mean μ = 3.2
28
7 th 8 th
Lower quartile Q1 = 2 2.5
2
315 89
14 th 15 th Standard deviation σ = 1.1
Median Q2 = 2 3 28 28
21st 22 nd
Upper quartile Q3 = 2 4 Skew by comparing mean, median & mode
IQR = 1.5
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:
f 71
3320
Mean μ = 46.8
71
Lower quartile Q1 = 30 4.75
24 10 32.0
2
191750 3320
Median Q2 = 30 Standard deviation σ =
22.5
24 10 39.4 71 71
22.7
Upper quartile Q3 = 40 1617.25 30 68.7 Q3 2Q2 Q1
Skew using
Q3 Q1
IQR = 36.7
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency Midpoint fx Running total fx 2
0-20 24 10.25 246 24 2521.5
21-40 31 30.5 945.5 55 28837.75
41-60 47 50.5 2373.5 102 119861 .75
61-80 75 70.5 5287.5 177 372768.75
81-100 23 90.5 2081.5 200 188375.75
f 200 fx 10934 fx 2 712365.5
Lower quartile Q1 = 20.5 26
31 20 37.3 10934
Mean μ = 54.7
200
Median Q2 = 40.5 45
47 20 59.6 2
712365 .5 10934
Standard deviation σ =
200 200
x 2
x 2
2
n n
x 2
2 x 2
n
x 2
2
x
1 2
n n n
x 2
2 2 2
n
x 2
2
x 2
2
n n
Measuring skew
Symmetrical
Mode = Median = Mean
Q2-Q1 = Q3-Q2
Positively skewed
Mode < Median < Mean
Q2-Q1 < Q3-Q2
Negatively skewed
Mode > Median > Mean
Q2-Q1>Q3-Q2
Other measures of skew
If mean > mode indicates positive skew, then:
Dividing by σ ‘scales’ the value, but has no effect on its sign as σ is always positive
If Q3-Q2 > Q2-Q1 indicates positive skew, then so does Q3 Q2 Q2 Q1 0
You will be told which of these Q3 2Q2 Q1 0
measures to use in the exam – Scaling this by dividing by the IQR Q3 – Q1
all you have to do is substitute
the values and remember that if Q3 2Q2 Q1 > 0 if data is positively skewed
the outcome is positive, this
indicates positive skew!
Q3 Q1 < 0 if data is negatively skewed
Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:
x 124551
2
7 2 3 6 6 x 1615
8 0 4 7 7 7
n 22
9 2 3 8 9
1615
Mode = 87 Mean μ = 73.4
22
Lower quartile Q1 = 6th 62 2
124551 1615
Standard deviation σ = 16.5
Median Q2 = 11.5th 74.5 22 22
1458
Mode = 63 Mean μ = 66.3
22
Lower quartile Q1 = 6th 55 2
100608 1458
11th 12th
Standard deviation σ = 13.5
Median Q2 = 2 64 22 22
Number of Running
Frequency fx total
fx 2
eggs
1 3 3 3 3
2 6 12 9 24
3 5 15 14 45
4 1 4 15 16
5 6 30 21 150
fx 64 fx 2
238
Mode = 2 and 5
64
Mean μ = 3.0
Lower quartile Q1 = 6th 2 21
2
238 64
Median Q2 = 11 3
th
Standard deviation σ = 1.4
21 21
Number of Running
Frequency
eggs total
1 2 2 You may use that fx 89
2 5 7
2
and fx 315
3 10 17
4 8 25
5 3 28
Mode = 3 89
Mean μ = 3.2
28
7 th 8 th
Lower quartile Q1 = 2 2.5
2
315 89
14 th 15 th Standard deviation σ = 1.1
Median Q2 = 2 3 28 28
21st 22 nd
Upper quartile Q3 = 2 4 Skew by comparing mean, median & mode
mean median & mode positive skew
IQR = 1.5
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:
f 71
3320
Mean μ = 46.8
71
Lower quartile Q1 = 30 4.75
24 10 32.0
2
191750 3320
Median Q2 = 30 Standard deviation σ =
22.5
24 10 39.4 71 71
22.7
Upper quartile Q3 = 40 1617.25 30 68.7 Q3 2Q2 Q1
Skew using 0.592
Q3 Q1
IQR = 36.7 Positive skew
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency Midpoint fx Running total fx 2
0-20 24 10.25 246 24 2521.5
21-40 31 30.5 945.5 55 28837.75
41-60 47 50.5 2373.5 102 119861 .75
61-80 75 70.5 5287.5 177 372768.75
81-100 23 90.5 2081.5 200 188375.75
f 200 fx 10934 fx 2 712365.5
Lower quartile Q1 = 20.5 26
31 20 37.3 10934
Mean μ = 54.7
200
Median Q2 = 40.5 45
47 20 59.6 2
712365 .5 10934
Standard deviation σ =
200 200
Do any of his scores stand out as unusual or incoherent with his other performances?
Clearly, the 158 was by far his best score – it could be considered as an outlier
The boundary of being 1½ times the IQR away was arbitrary, and x 2
you may be given a different threshold for classifying outliers x 2
Now try Ex4G, p72, Q1,8
Interpreting and comparing data
Pupils lose marks in S1 because they are unable to interpret and compare data.
The number of marks available tell you how many different measures to analyse
If there are 2 marks, write one sentence each about average and spread
If there are 3 marks, write one sentence each about average, spread and skew
If there are 4 marks, write one sentence each about all four main features
Relating measures to the context
It is also critical that you relate these measures to the context of the data…
The classes have a very similar mean, suggesting that on average, results are similar
Organic farm
The median for the farm which uses pesticides is lower– on average there are less
worms in its soil.
The IQR and range for the farm which uses pesticides is also lower– there is less
variation in the number of worms in the soil.
Now try:
The data supports the claim as the Ex4F, p71-72, Q1,4
average and spread have decreased. Ex4G, p73-74, Q3+4
7a) Compare the Maths exam marks with the Science exam marks
b) Mr Brown wants to give a bonus to a department on the basis of the exam results.
Use your answer to (a) to advise him.
8) Which farmer’s top hen would you want on your farm? Explain why!
9) McDonalds claim they cater to a healthier target audience than their rivals.
Comment on this claim with reference to your answers to questions 5 and 6
Which measures to use?
Usually, when analysing average and spread you choose either:
Class A Class B
Which class’s results
Mean 68% 73%
are more spread out?
Standard
5% 6%
Deviation
Class B have a larger standard deviation, but a higher mean too. Is it a fair comparison?
Calculating will enable you to ‘fairly’ compare the dispersion of 2 sets of data.
Class A: 13.6 Class B: 12.2
So the mean is about 14 times the standard By dividing by σ, you ‘scale
deviation for class A – whereas in class B the down’ the numbers to
mean is about 12 times the standard deviation. make a fair comparison
Class B’s results are more spread out, relative to the mean
Histograms Height (h)
in cm
Frequency
Frequency
density
Eg Some data on height 130 ≤ h < 150 42 42 20 2.1
150 ≤ h < 160 35 35 10 3.5
Frequency 160 ≤ h < 165 16 16 5 3.2
Frequency density
Class width
165 ≤ h < 180 39 39 15 2.6
Finding frequencies from histograms
Eg The histogram gives information about the books sold in a bookshop
one Saturday. Use the histogram to complete the table.
Frequency Area 4
80 4 20
Salary £1000s
When a histogram is constructed for this data, the 6-8 minutes bar has
width 2cm and height 3cm. Find the dimensions of the 9-14 minutes bar.
0.3 Now try Ex4G, p75, Q6
Mode = Mean μ =
Lower quartile Q1 =
Standard deviation σ =
Median Q2 =
Mode = Mean μ =
Lower quartile Q1 =
Standard deviation σ =
Median Q2 =
Number of
Frequency
eggs
1 3
2 6
3 5
4 1
5 6
Mode =
Mean μ =
Lower quartile Q1 =
Upper quartile Q3 =
mean - median
Skew using
IQR = standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:
Number of
Frequency
eggs
1 2 You may use that fx 89
2 5
2
and fx 315
3 10
4 8
5 3
Mode =
Mean μ =
Lower quartile Q1 =
IQR =
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:
Mean μ =
Lower quartile Q1 =
Lower quartile Q1 =
Mean μ =
Median Q2 =
Standard deviation σ =
Upper quartile Q3 =
3mean - median
Skew using
standard deviation
IQR =
7a) Compare the Maths exam marks with the Science exam marks
b) Mr Brown wants to give a bonus to a department on the basis of the exam results.
Use your answer to (a) to advise him.
8) Which farmer’s top hen would you want on your farm? Explain why!
9) McDonalds claim they cater to a healthier target audience than their rivals.
Comment on this claim with reference to your answers to questions 5 and 6
WB1 Over a period of time, the number of Number leaving Totals
people x leaving a hotel each morning
2 7 9 9 (3)
was recorded. These data are
summarised in the stem and leaf diagram 3 2 2 3 5 6 (5)
below. For these data, 4 0 1 4 8 9 (5)
(a) write down the mode,
5 2 3 3 6 6 6 8 (7)
(b) find the values of the three quartiles.
6 0 1 4 5 (4)
7 2 3 (2)
Mode = 56 8 1 (1)
n 27
n
Lower quartile: 6.75 7th 35
4
n
Median: 13.5 14th 52
2
3n
Upper quartile: 20.25 21st 60
4
Given that Σx = 1335 and Σx2 = 71801 find
(c) the mean and the standard deviation of these data.
mean – mode
One measure of skewness is found using .
standard deviation
(d) Evaluate this measure to show that these data are negatively skewed.
(e) Give two other reasons why these data are negatively skewed
c)
1335
49.444... x
2 x 2
2
27 n n
2
71801 1335 17378
2
27 27 81
Mean = 49.4 (1dp)
17378
14.647 ... Standard deviation = 14.6 (1dp)
81
49.444... 56
d) 0.4475 ... < 0 indicating negative skew
14.647...
e) For negative skew: Mode > Median > Mean Q2-Q1 > Q3-Q2
56 52 49.4 52 35 60 52
WB2 The following table summarises the Number of
Distance (km)
distances, to the nearest km, that 134 examiners
examiners travelled to attend a meeting in 41–45 4
London.
(a) Give a reason to justify the use of a 46–50 19
histogram to represent these data. 51–60 53
(b) Calculate the frequency densities needed 61–70 37
to draw a histogram for these data.
(DO NOT DRAW THE HISTOGRAM) 71–90 15
91–150 6
a) Data is continuous and class widths vary
Effective
Number of Frequency Frequency
class Class width Frequency density
examiners density Class width
boundaries
40.5 45.5 5 4 4 5 0.8
45.5 50.5 5 19 3.8
50.5 60.5 10 53 5.3
60.5 70.5 10 37 3.7
70.5 90.5 20 15 0.75
90.5 150.5 60 6 0.1
(c) Use interpolation to estimate the median Q2, Distance Number of Running
the lower quartile Q1, and the upper quartile Q3 (km) examiners total
40.5–45.5 4 4
The mid-point of each class is represented by x
45.5–50.5 19 23
and the corresponding frequency by f.
Calculations then give the following values 50.5–60.5 53 76
Σfx = 8379.5 and Σfx2 = 557489.75 60.5–70.5 37 113
(d) Calculate an estimate of the mean and an
70.5–90.5 15 128
estimate of the standard deviation for these data.
90.5–150.5 6 134
6233
58.80
f 134
Median Q2 = 67th 50.5 6753 23
10
106
th 50.5 33.5 23 10
5563
Lower quartile Q1 = 33.5 53
52.48
106
4967
Upper quartile Q3 = 100.5 th 60 .5 100.5 76 10
37
67.12
74
2
8379 .5 557489 .75 8379.5
Mean μ = 62.53 Standard deviation σ =
134 134 134
15.81
Q3 2Q2 Q1
WB2 One coefficient of skewness is given by
Q3 Q1
(e) Evaluate this coefficient and comment on the skewness of these data.
(f) Give another justification of your comment in part (e).
Min 5 Q1 12 Q2 17 Q3 28 Max 63
IQR 16 Q3 1.5 IQR 52 Q1 1.5 IQR 12 63 is outlier,
next biggest is 45
(b) Comment on the distribution of delays. Justify your answer.
(c) Suggest how the distribution might be interpreted by a passenger who
frequently flies from City A to City B.
Min 5 Q1 12 Q2 17 Q3 28 Max 63
IQR 16 Q3 1.5 IQR 52 Q1 1.5 IQR 12 63 is outlier,
next biggest is 45