4A Graphs and Statistics of Data Sets: Class Problems
4A Graphs and Statistics of Data Sets: Class Problems
(a) Draw a histogram of the grouped data. The units of the horizontal and vertical axes
should be years and %/year, respectively. You can assume the last bin ends at 110.
In (b)–(d), did you have to make additional assumptions? If yes, how valid do you think they
are?
Solution.
(a) The first bar should contain the Finns of ages 0–14 years (an interval that covers 15
years). The number of Finns in this group is 896023, and their relative frequency is
896023/5487308 ≈ 16.3%. So we should draw the bar with height 16.3/15 ≈ 1.09 (% per
year). We get the following histogram. Adding up the areas we get 100% as we should.
1/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
1.6
1.4
1.2
prosenttia per v
1
0.8
0.6
0.4
0.2
0 15 25 45 65 75 110
(b) The bar of 0–14 has height 1.09 % per year, and the bar of 65–74 has height 1.17 % per
year. From this data we do not really know the age distribution within each group, but
if we assume it is approximately uniform, we would say that 1-year-olds are about 1.09%
of the population, and 66-year-olds are about 1.17%, so the answer is there are more
66-year-olds.
(c) The first two bars cover 28% of the population. The first three bars cover 52.8% of the
population. So we certainly know that the median is somewhere between 25 and 45 years.
To find the median, we seek a point m such that 50% of population lies below m.
We assume again that the age distribution between 25 and 45 years is uniform. Because
the first two bars already cover 28% of the population, we need to find 22% more (of
the whole population) from the third group; or 22/24.8 ≈ 0.8871 of the third age group.
Since the third age group is 20 years long, we take the lowest 0.8871 · 20 = 17.7 years of
the third group.
Our estimate is that the median is at 25+17.7 = 42.7 years.
(d) Again, assuming that the age distribution is uniform in each group, the average age within
the first group is (0+15)/2 = 7.5 years, within the second group (15+25)/2 = 20 years,
2/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
and so on. To find the average age of all Finns, we need to find the sum of their ages,
and then divide by the total count. Adding up the ages in each group (by our uniformity
assumption), we get the average
896023 · 7.5 + 640387 · 20 + 1363155 · 35 + 1464640 · 55 + 642428 · 70 + 480675 · 92.5
5487308
which is about 43.2 years. This is the weighted average of the group averages, where
weights are the frequencies of the groups. Alternatively, we could have used the relative
frequencies as weights.
We assumed that the age distribution is approximately uniform within each group. Certainly
the true distribution is not exactly uniform, but without more data it is difficult to say how
good the approximation is. But one might guess that in the last two groups, the true density
probably decreases strongly towards the right end.
In fact, Tilastokeskus also gives more detailed statistics. We find that the number of 1-year-olds
was 58 008 (≈ 1.06 %) and the number of 66-year-olds was 76 975 (≈ 1.40 %). So the numbers from
the histogram were pretty good for 1-year-olds, but underestimated the proportion of 66-year-olds.
3/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
4A2 (Quantiles) The R software defines the quantile function of data x = (x1 , . . . , xn ) as
follows. Let x(1) = the smallest number in the data, x(2) = second smallest, etc. Thus we have
ordered data x(1) ≤ x(2) ≤ · · · ≤ x(n) . Then the horizontal unit interval [0, 1] is divided into
n − 1 equal parts, at points pk = (k − 1)/(n − 1), k = 1, . . . , n. The quantile function is defined
by drawing points (pk , x(k) ) and connecting them with straight line segments.
Draw (on paper by hand) the quantile functions of the following data sets, and for each
data set, determine the lower quartile Q(0.25), median Q(0.50) and upper quartile Q(0.75):
(a) x = (1000, 2000, 5000, 9000),
Solution.
(a) Graph below.
Lower quartile Q(0.25) = 1750, median Q(0.50) = 3500, upper quartile Q(0.75) = 6000.
● ● ●
8000
8000
●
15
6000
6000
●
10
4000
4000
●
5
2000
2000
● ● ●
● ● ● ● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(e) True, because the quantile function is nondecreasing. (The kth point cannot be smaller
than the (k − 1)th point, because the data was sorted in increasing order.)
4/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
(f) False. One very small value is enough to make the average very small, below the lower
quartile. Consider, for example, the data x = (−1000, 1, 1, 5, 20). It has the same quartiles
as (c), but its average -194.6 is way below the lower quartile.
5/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
Home problems
4A3 (Apartment sizes.) In town A there are 361 apartments, and in town B 248 apartments.
The following histograms describe the size distributions (in square meters, “neliömetriä”).
Taajama A Taajama B
3.0
3.0
Prosenttia/Neliömetri
Prosenttia/Neliömetri
2.0
2.0
1.0
1.0
0.0
0.0
Neliömetriä Neliömetriä
Answer the following questions by using the histograms. Assume, for simplicity, that no
apartment has area exactly at a bin boundary.
Grading. (a) 1 p. for correct frequency with reasonable precision (within 82 ± 8). Otherwise
0 p.
(b) 1 p. for correct answer with a somewhat reasonable argument. Bar areas must be taken into
account (heights not enough). Correct answer to the last question (“additional assumptions”)
not required for the point.
Solution.
(a) In town B, the bar heights on intervals 80–100 and 100–120 are approximately 1.4 and
0.25. Multiplying by the interval lengths, we get the bar areas. Adding them up we get
the relative frequency of apartments of 80 m2 or more as
6/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
(b) We can start from either end of the distribution, and try to find how many bars are
needed to cover 50%. In this case we start from the right end because the bars seem
bigger. (This is just a matter of convenience.) When we exceed 50%, we know that the
median is within the bar where that happened.
In town B, starting from the right, we find bars of heights approximately 0.25, 1.4, 2.75,
and widths 20, 20, 10. By multiplication, we get relative frequencies 5%, 28%, 27.5%. The
two rightmost bars totalled only 33%, but with the third bar we have over 50%, so the
median is certainly within [70, 80].
In town A, starting from the right, we find approximate heights 0.1, 0.75, 2.1 and thus
areas 2%, 15%, 21%. Their total is 38% < 50%, so the median must be lower than 70%.
According to this calculation, the median is larger in town B. In this case, we did not need
to assume uniform distribution within the bars. Even if (and when) the distributions are
non-uniform, we know that the median of town B is somewhere within [70, 80].
4A4 (Brick machine) A machine makes bricks, whose mass varies randomly. The masses of a
sample of 25 bricks are given in the table below.
5.18 3.75 3.59 3.19 4.38
4.10 6.93 6.23 5.22 5.02
3.94 6.13 5.68 4.42 5.42
5.14 3.69 6.24 6.41 4.56
4.83 4.97 3.38 5.47 5.61
(a) Find the following statistics of the data set: minimum, median and maximum.
(b) Find the mean and standard deviation of the bricks in the shaded cells.
Grading. (a) 0.5 points for correct median; 0.5 points if minimum and maximum both correct.
(b) 0.5 points for each correct value. For standard deviation, we accept here either the “standard
deviation” (divisor n) or the “sample standard deviation” (divisor n − 1).
Total points rounded up.
Solution.
(a) One method is to arrange (sort) all of the data in increasing order. Then the first value
3.19 is the minimum, the centermost (13th) value 5.02 is the median, and the last value
6.93 is the maximum.
There are other methods that do not require you to arrange all of the data in order. To find
the minimum, you can just scan through the data once, and keep track what is the smallest
value you have seen so far; whenever you see something smaller, update your “minimum”. After
going through the data once, you have found the minimum. Finding the maximum is obviously
similar.
7/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A
Median is slightly harder. Borrowing ideas from the previous histogram exercises, you can first
count the data points by their integer part (bins of length 1). Because there are 6 points within
[3, 4) and 6 points within [4, 5), the median is not yet among them. You can then find it easily.
For finding the median from larger data, there are algorithmic solutions that avoid the full
sorting. Look them up if you are interested in algorithms.
The so-called sample standard deviation of the shaded values is (1 − 1/8)−1/2 sd ≈ 1.11. Dif-
ferent computer software have different conventions by default; Python (NumPy) calculates the
standard deviation,
q and
q R calculates the sample standard deviation. The ratio between these
n−1
quantities is n = 1 − n1 , which tends towards 1 as n increases.
8/8