Descriptive Statistics Tutorial 1 Solutions
Descriptive Statistics Tutorial 1 Solutions
5
(a) x
i 1
i 2 2 5.5 1 0 8.5
7
(b) x
i 4
i 1 0 7 1 7
3
(c) x
i 1
2
i 2 2 2 2 (5.5) 2 38.25
2
3
(d) xi ( 2 2 5.5) (9.5) 90.25
2 2
i 1
To illustrate, we shall consider a study of proneness to Acute Respiratory Infection, where the following data
were obtained:
Sex Total
Prone male (column 1) female (column 2)
not prone (row 1) 107 124 231
prone (row 2) 157 101 258
Total 264 225 489
Note: It is common for i to index the rows and j to index the columns. As a result, x12 would represent the
number in row 1, column 2, i.e. not prone and female = 124.
Using data from the Acute Respiratory Infection study above, calculate the following:
2
(a) x
j 1
1j x11 x12 107 124 231
2 2 2
(b) xij xi1 xi 2 x11 x12 x21 x22 107 124 157 101 489
i 1 j 1 i 1
Page 1 of 5
3. Using more than one set of data
3
(a) x y
i 1
i i x1 y1 x 2 y 2 x3 y3 (2 1) (3 0) (6 1) 2 0 6 4
3 3
(b) x y
i 1
i
i 1
i (2 3 6) (1 0 1) 5 0 0
Let’s say we have three recordings of systolic blood pressure (in mmHg) on the same individual, i.e.
(a) Calculate the arithmetic mean for these data and comment on why this may be an informative measure.
x i
x1 x2 x3 (120 130 125)
mean = x i 1
125 mmHg
n n 3
The arithmetic mean may be a good estimate for the true mean systolic blood pressure but we need to know
a bit more about the data (i.e. under what circumstances were they collected? At times that were close
together, or say in low, medium and high stress environments respectively? You can probably think of other
ways the data could have arisen...)
(b) Calculate the variance of the blood pressure recordings (in mmHg2) and comment on why this may be an
informative measure.
x x
2
i 1
i
(120 125 ) 2 (130 125 ) 2 (125 125 ) 2 25 25 0
variance = 25 mmHg2
n 1 3 1 2
The variance may be informative as it gives us an idea of how the data may vary about its mean. However,
its accuracy may be dubious with only three observations - and again, its relevance depends on the context
of the data. As a measure on its own (i.e. 25), this means little to us (at this stage in the course) unless it is
to compare, for instance, with another person who has lower or higher measures of variance (e.g. 10 mmHg2
or 50 mmHg2).
Page 2 of 5
5. Classification of variables
Classify the following variables as either quantitative (then either discrete or continuous) or categorical (then
nominal or ordinal).
(a) A ‘Likert’ scale used in an opinion poll taking values 1 to 5, (where 1 = strongly disagree, ... ,
3 = agree, ... , 5 = strongly agree).
Definition: a widely used questionnaire format named by developer, Rensis Likert. Respondents of
questionnaires are asked to choose from several responses in a range such as ‘strongly agree’, ‘agree’,
‘undecided’, ‘disagree’, and ‘strongly disagree’. Each response receives a number rating. The five-point
Likert scale is most common.
This is a qualitative variable since the actual numbers you use are not important, you could equally state
that A = strongly disagree, ... , C = agree, ... , E = strongly agree.
There is however a clear ordering apparent so the variable is ordinal.
This variable is truly numerical, and as a result, quantitative. In addition it is discrete as it will typically only
take whole numbers [although one could argue continuous].
This variable is also truly numerical, and as a result, quantitative. In addition it is continuous as its
accuracy is only limited by the measuring equipment.
This is a qualitative variable since colour of hair is clearly categorical. There is also no clear ordering of
colour so the variable is nominal.
6. Summary measures
Assuming that these babies are a random sample of those born in Australia between midnight and 7am on
18 December 1997, use the sample of birth weights for the nine babies to address the following:
(a) Determine the sample mean, sample median and sample mode.
Note: it is also perfectly acceptable to have one decimal place for your estimates but for the purposes of this
course we suggest you include one more decimal place than that of the original data.
(b) Discuss which quantity you believe is the most informative measure of central tendency for birth weight
in this example.
The sample median would be most appropriate here since data are left skewed, i.e. a histogram shows
Page 3 of 5
If you weren’t sure about the skewness in the data, there could be some argument for the sample mean
since there sample size is small, and under these circumstances estimates of the mean would be less
variable than estimates of the median.
In other words, if we estimated the mean, median and mode for many samples of the same size from the
same population we would expect:
- estimates of the mean to be closer together (less variable) than
- estimates of the median and
- estimates of the mode
(c) Calculate the 25th and 75th percentiles for birth weight (overall).
Following the method of Bland on page 49, the median (second quartile) is given by the value
corresponding to index 0.5 x (9+1) = index 5 and so the median is 3.30 (as we already saw in part a).
The 25th percentile (first quartile) is given by 0.25 x (9+1) = index 2.5 and so the 25 th percentile is the
average of the index 2 and index 3 values, given by (2.2+2.8)/2 = 2.50. The 75th percentile is given by
0.75 x (9+1) = index 7.5, and so equal to (3.6+3.8)/2 = 3.70.
(d) Calculate the sample mean birth weight for males and females separately and a measure of sample
variance for both.
Page 4 of 5
(e) Discuss what the measures calculated in (d) indicate for differences in sex.
The mean weight for males is higher than that for females but variation is a lot smaller for males than
females. So weights for male babies appear to be higher and fairly close to 3.4 kg. Weights for female
babies appear to be generally lower but there is more of variation, i.e. they vary more about the mean
value of 2.75 kg.
We will learn methods for distinguishing whether or not these differences are significant or simply a result
of random variation in later lectures.
Page 5 of 5