0% found this document useful (0 votes)
93 views

4A Graphs and Statistics of Data Sets: Class Problems

This document contains an exercise on analyzing grouped age distribution data from Finland. It asks students to: 1) Draw a histogram of the age distribution data and calculate median, mean ages. 2) Draw quantile plots and calculate quartiles for sample data sets. 3) Evaluate whether some claims about relationships between mean, median, quartiles are always true. The key assumptions made in analyzing the grouped data and the validity of these assumptions are also discussed.

Uploaded by

giridhar shreyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

4A Graphs and Statistics of Data Sets: Class Problems

This document contains an exercise on analyzing grouped age distribution data from Finland. It asks students to: 1) Draw a histogram of the age distribution data and calculate median, mean ages. 2) Draw quantile plots and calculate quartiles for sample data sets. 3) Evaluate whether some claims about relationships between mean, median, quartiles are always true. The key assumptions made in analyzing the grouped data and the validity of these assumptions are also discussed.

Uploaded by

giridhar shreyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MS-A0503 First course in probability and statistics J Kohonen

Department of mathematics and systems analysis Spring 2020


Aalto SCI Exercise 4A

4A Graphs and statistics of data sets


Class problems
4A1 (Grouped data) This table represents the age distribution of Finland on 31.12.2015. The
table lists age groups in whole years, but in this exercise we treat ages as real numbers, so
someone in the age group 0–14 might be 14.7 years old.

Age (years) Frequency


0–14 896 023
15–24 640 387
25–44 1 363 155
45–64 1 464 640
65–74 642 428
75– 480 675
(Source: Tilastokeskus)

(a) Draw a histogram of the grouped data. The units of the horizontal and vertical axes
should be years and %/year, respectively. You can assume the last bin ends at 110.

Try to answer the following questions by using the grouped data.

(b) Which are more common in the population, 1-year-olds or 66-year-olds?

(c) What is the median age of the population?

(d) What is the average age of the population?

In (b)–(d), did you have to make additional assumptions? If yes, how valid do you think they
are?

Solution.

(a) The first bar should contain the Finns of ages 0–14 years (an interval that covers 15
years). The number of Finns in this group is 896023, and their relative frequency is
896023/5487308 ≈ 16.3%. So we should draw the bar with height 16.3/15 ≈ 1.09 (% per
year). We get the following histogram. Adding up the areas we get 100% as we should.

1/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

1.6
1.4
1.2
prosenttia per v

1
0.8
0.6
0.4
0.2

16.3% 11.7% 24.8% 26.7% 11.7% 8.8%


0

0 15 25 45 65 75 110

(b) The bar of 0–14 has height 1.09 % per year, and the bar of 65–74 has height 1.17 % per
year. From this data we do not really know the age distribution within each group, but
if we assume it is approximately uniform, we would say that 1-year-olds are about 1.09%
of the population, and 66-year-olds are about 1.17%, so the answer is there are more
66-year-olds.

(c) The first two bars cover 28% of the population. The first three bars cover 52.8% of the
population. So we certainly know that the median is somewhere between 25 and 45 years.
To find the median, we seek a point m such that 50% of population lies below m.
We assume again that the age distribution between 25 and 45 years is uniform. Because
the first two bars already cover 28% of the population, we need to find 22% more (of
the whole population) from the third group; or 22/24.8 ≈ 0.8871 of the third age group.
Since the third age group is 20 years long, we take the lowest 0.8871 · 20 = 17.7 years of
the third group.
Our estimate is that the median is at 25+17.7 = 42.7 years.

(d) Again, assuming that the age distribution is uniform in each group, the average age within
the first group is (0+15)/2 = 7.5 years, within the second group (15+25)/2 = 20 years,

2/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

and so on. To find the average age of all Finns, we need to find the sum of their ages,
and then divide by the total count. Adding up the ages in each group (by our uniformity
assumption), we get the average
896023 · 7.5 + 640387 · 20 + 1363155 · 35 + 1464640 · 55 + 642428 · 70 + 480675 · 92.5
5487308
which is about 43.2 years. This is the weighted average of the group averages, where
weights are the frequencies of the groups. Alternatively, we could have used the relative
frequencies as weights.

We assumed that the age distribution is approximately uniform within each group. Certainly
the true distribution is not exactly uniform, but without more data it is difficult to say how
good the approximation is. But one might guess that in the last two groups, the true density
probably decreases strongly towards the right end.
In fact, Tilastokeskus also gives more detailed statistics. We find that the number of 1-year-olds
was 58 008 (≈ 1.06 %) and the number of 66-year-olds was 76 975 (≈ 1.40 %). So the numbers from
the histogram were pretty good for 1-year-olds, but underestimated the proportion of 66-year-olds.

3/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

4A2 (Quantiles) The R software defines the quantile function of data x = (x1 , . . . , xn ) as
follows. Let x(1) = the smallest number in the data, x(2) = second smallest, etc. Thus we have
ordered data x(1) ≤ x(2) ≤ · · · ≤ x(n) . Then the horizontal unit interval [0, 1] is divided into
n − 1 equal parts, at points pk = (k − 1)/(n − 1), k = 1, . . . , n. The quantile function is defined
by drawing points (pk , x(k) ) and connecting them with straight line segments.
Draw (on paper by hand) the quantile functions of the following data sets, and for each
data set, determine the lower quartile Q(0.25), median Q(0.50) and upper quartile Q(0.75):
(a) x = (1000, 2000, 5000, 9000),

(b) x = (1000, 2000, 2000, 8000, 9000),

(c) x = (1, 20, 1, 5, 1).


Then consider the following claims. For each claim, either argue why the claim is true (for all
data sets), or show it false by a counterexample.
(d) The mean and median of a data set are always equal.

(e) The lower quartile is always smaller or equal to the median.

(f) The lower quartile is always smaller or equal to the mean.

Solution.
(a) Graph below.
Lower quartile Q(0.25) = 1750, median Q(0.50) = 3500, upper quartile Q(0.75) = 6000.

(b) Graph below.


Lower quartile Q(0.25) = 2000, median Q(0.50) = 2000, upper quartile Q(0.75) = 8000.

(c) Graph below.


Lower quartile Q(0.25) = 1, median Q(0.50) = 1, upper quartile Q(0.75) = 5.
20

● ● ●
8000

8000


15
6000

6000


10
4000

4000


5
2000

2000

● ● ●

● ● ● ● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(d) False. Any of (a)–(c) is a counterexample.

(e) True, because the quantile function is nondecreasing. (The kth point cannot be smaller
than the (k − 1)th point, because the data was sorted in increasing order.)

4/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

(f) False. One very small value is enough to make the average very small, below the lower
quartile. Consider, for example, the data x = (−1000, 1, 1, 5, 20). It has the same quartiles
as (c), but its average -194.6 is way below the lower quartile.

5/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

Home problems
4A3 (Apartment sizes.) In town A there are 361 apartments, and in town B 248 apartments.
The following histograms describe the size distributions (in square meters, “neliömetriä”).

Taajama A Taajama B
3.0

3.0
Prosenttia/Neliömetri

Prosenttia/Neliömetri
2.0

2.0
1.0

1.0
0.0

0.0

20 40 60 80 100 120 20 40 60 80 100 120

Neliömetriä Neliömetriä

Answer the following questions by using the histograms. Assume, for simplicity, that no
apartment has area exactly at a bin boundary.

(a) How many apartments in town B have area at least 80 m2 ?


(b) In which town is the median area larger? Did you have to make additional assumptions
about the distribution to answer this question?

Grading. (a) 1 p. for correct frequency with reasonable precision (within 82 ± 8). Otherwise
0 p.
(b) 1 p. for correct answer with a somewhat reasonable argument. Bar areas must be taken into
account (heights not enough). Correct answer to the last question (“additional assumptions”)
not required for the point.

Solution.
(a) In town B, the bar heights on intervals 80–100 and 100–120 are approximately 1.4 and
0.25. Multiplying by the interval lengths, we get the bar areas. Adding them up we get
the relative frequency of apartments of 80 m2 or more as

1.4 × 20 + 0.25 × 20 = 33%.

Because town B contains 248 apartments in total, the frequency of 80 m2 or more is


approximately 248 × 0.33 ≈ 82.

6/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

(b) We can start from either end of the distribution, and try to find how many bars are
needed to cover 50%. In this case we start from the right end because the bars seem
bigger. (This is just a matter of convenience.) When we exceed 50%, we know that the
median is within the bar where that happened.
In town B, starting from the right, we find bars of heights approximately 0.25, 1.4, 2.75,
and widths 20, 20, 10. By multiplication, we get relative frequencies 5%, 28%, 27.5%. The
two rightmost bars totalled only 33%, but with the third bar we have over 50%, so the
median is certainly within [70, 80].
In town A, starting from the right, we find approximate heights 0.1, 0.75, 2.1 and thus
areas 2%, 15%, 21%. Their total is 38% < 50%, so the median must be lower than 70%.
According to this calculation, the median is larger in town B. In this case, we did not need
to assume uniform distribution within the bars. Even if (and when) the distributions are
non-uniform, we know that the median of town B is somewhere within [70, 80].

4A4 (Brick machine) A machine makes bricks, whose mass varies randomly. The masses of a
sample of 25 bricks are given in the table below.
5.18 3.75 3.59 3.19 4.38
4.10 6.93 6.23 5.22 5.02
3.94 6.13 5.68 4.42 5.42
5.14 3.69 6.24 6.41 4.56
4.83 4.97 3.38 5.47 5.61

(a) Find the following statistics of the data set: minimum, median and maximum.

(b) Find the mean and standard deviation of the bricks in the shaded cells.

Grading. (a) 0.5 points for correct median; 0.5 points if minimum and maximum both correct.
(b) 0.5 points for each correct value. For standard deviation, we accept here either the “standard
deviation” (divisor n) or the “sample standard deviation” (divisor n − 1).
Total points rounded up.

Solution.
(a) One method is to arrange (sort) all of the data in increasing order. Then the first value
3.19 is the minimum, the centermost (13th) value 5.02 is the median, and the last value
6.93 is the maximum.
There are other methods that do not require you to arrange all of the data in order. To find
the minimum, you can just scan through the data once, and keep track what is the smallest
value you have seen so far; whenever you see something smaller, update your “minimum”. After
going through the data once, you have found the minimum. Finding the maximum is obviously
similar.

7/8
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4A

Median is slightly harder. Borrowing ideas from the previous histogram exercises, you can first
count the data points by their integer part (bins of length 1). Because there are 6 points within
[3, 4) and 6 points within [4, 5), the median is not yet among them. You can then find it easily.
For finding the median from larger data, there are algorithmic solutions that avoid the full
sorting. Look them up if you are interested in algorithms.

(b) The average of the shaded values is


1
m = (5.18 + 4.10 + 3.94 + 5.14 + 4.83 + 3.75 + 6.93 + 6.13) = 5.00.
8
Their variance is
1
sd2 = (5.18 − 5.00)2 + · · · + (6.13 − 5.00)2 = 1.07235

8
thus the standard deviation is
p √
sd = sd2 = 1.07235 = 1.035543 ≈ 1.04

The so-called sample standard deviation of the shaded values is (1 − 1/8)−1/2 sd ≈ 1.11. Dif-
ferent computer software have different conventions by default; Python (NumPy) calculates the
standard deviation,
q and
q R calculates the sample standard deviation. The ratio between these
n−1
quantities is n = 1 − n1 , which tends towards 1 as n increases.

8/8

You might also like