MY464 - Lecture 3
MY464 - Lecture 3
Lecture 3
Descriptive statistics for continuous variables
Outline
Tables of grouped frequencies
Charts
Histograms
Stem and leaf plots
Measures of central tendency
Mode, median, mean
Skewed vs. symmetric distributions, outliers
Measures of dispersion
Range, inter-quartile range
box plots
Variance and standard deviation
Today’s examples are mostly about describing one variable
(‘univariate’)
In Week 5 we’ll look at bivariate scenarios (with inference!) where we compare
groups or samples on a continuous variable
2
A data set about time use
Media time use in the UK
United Kingdom Time Use Survey, 2014-2015
Gershuny, J., Sullivan, O. (2017). United Kingdom Time Use Survey, 2014-
2015. Centre for Time Use Research, University of Oxford. [data collection].
UK Data Service. SN: 8128, http://doi.org/10.5255/UKDA-SN-8128-1
Respondents fill out diaries of their activities for two days (ten minute intervals)
We will use data from the first diary day, for 7157 respondents aged 18 or over
3
Watching TV/Video (descriptive statistics)
Maximum = hours
Minimum = hours
Range =
4
Pie chart
5
Frequency table...?
There are a lot of unique values
Produces a very long table!
Still somewhat informative,
e.g. the cumulative percent
column
We need different approaches to
describe variables with this many
observed values
Possibilities:
A table or a graph: first group
the values in some way
Single-number summaries
6
Frequency table: continuous data grouped into intervals
Intervals or categories
must be
Mutually exclusive:
no case belongs to
more than one
interval
Exhaustive:
categories cover all
the values of the
variable in the data Note that the intervals here are of different widths
set 0 is only exactly 0 hours
0-2, 2-4, etc are 2 hours (note, these should really read 0.166-2,
Choice of group size is
2.166-4, etc)
a trade-offs between More than 8 Hours includes values as high as 20
detail (many intervals)
and readability (few
intervals)
7
Equal & larger number of intervals
Another approach:
Divide the variable into intervals of equal width, e.g. 1 hour
Called ‘bins’ or ‘bands’ in some software
Smaller intervals (e.g. here, 30 mins) give more detail, but perhaps at
the expense of readability
In our data set this produces more than 20 intervals to cover the full
range of observed values
A table with 20+ rows isn’t very easy to read
So a chart will probably still be better…
8
A histogram
Frequencies (or proportions
/percentages) of responses on More than 1200 respondents
vertical (y) axis watch no TV (zero hours)
Must start from zero
Values of the variable on the
horizontal (x) axis
Each bar represents an interval
missing data are excluded
Bars touch each other:
continuous variable
Except where there are gaps in the
data, e.g. no one watching 18-
18.5 hours
Bars are of equal widths
9
Examples of skewed distributions
10
Example of a (roughly) symmetrical distribution
11
Stem and leaf plot
A histogram made of digits
Here, TV & Video hours, rounded to 0.1 hours
Stem gives the number of hours
Leaves give the number of 0.1 hours
Stems and leaves are chosen for best fit to the data (hundreds, tens, units, decimals…)
Note that this stem and leaf plot shows the 0-0.5 and 0.5-1 range as separate stems
Here each digit corresponds to 14 cases (respondents) (“Each leaf: 14 case(s)”)
Stems
Leaves
12
Stem and leaf plot rotated
Here 0 on the left, higher values as you move to the right
Note the shape of the distribution and correspondence with
histogram
Note that you can read off individual values
13
Measures of central tendency
Different measures:
Mode: the most common value
Median: the central value
Mean: the average value
14
Mode
The value of the variable that occurs most often in the data is the mode
Describes the ‘centre’ of the distribution of a variable in the sense of the
‘bulk’ of cases
Here, most people (63%) use the internet every day
So ‘every day’ is the modal category
Some properties of the mode
The value which occurs most often (i.e. has highest frequency)
middle value
The median is the value which falls in the middle of the ordering, which
has half the cases below and half above
Middle case here has the value 8
For an odd number of n observations, middle observation is
Here , i.e. the 7th case
For an even number of observations, middle value falls in between two
observations
E.g. if 14 cases, , so median is the average of the 7 th and 8th cases:
1 27 2cases
3 4 4 4 8 87 9cases
9 10 10 10 Median =
middle value
17
Median for our example data
The TV/Video hours stem
and leaf plot
Too many observations to
find manually, but easy for
the computer to calculate
Median is 2.16666…
Half of the observations are
below, half are above
18
The median for a grouped variable
We can find the median from the cumulative frequency, by looking for where
the cumulative percentage passes 50%
Here, 0-2 hours shows up as 50.0 in the table, but that is actually rounded up from slightly
less than 50.0. This is usually not something you have to worry about!
The median is 2.16666, so 2-4 hours is the median of the grouped / recoded variable
The median is also called the 50th percentile
19
Some properties of the median
Can use the median for continuous and discrete interval/ratio
variables
20
Quiz: Central tendency
Which is a correct
interpretation of a median
of 0.00 for employment
hours?
A. Respondents spent an
average of 0 hours in
employment
B. At least 50% of
respondents did not
work on the day
C. The most common
number of hours in
employment is 0
21
An example of using medians for comparing
groups
Annual median full
time gross pay by
occupation
UK, April 2022
22 https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2022
Accessed 10th October 2023
Another example: gender pay gap
1. 1.33+6.33+4.16+4.50+1.16+3.00+0.83+1.00+1.50+1.50 = 23.83
2. 23.83/10 = 2.38
Mean hours across these ten people are 2.38
Rounded to the nearest 10 minutes, this is 2 hours, 20 minutes
24
Mean – using notation
Case TV/Video
To calculate the
∑
hours
mean, add up all 𝑌𝑖 1 1.33
the scores, and 𝑌=
then divide the
result by the
𝑛 2 6.33
3 4.16
number of cases
4 4.50
Y = variable of interest 5 1.16
pronounced “Y-bar” is the mean of Y 6 3.00
subscript = case number ( is the first 7 0.83
case, is the second case, etc.) 8 1.00
is ‘sigma’, means “add together 9 1.50
everything after the sign” 10 0.00
Here, ‘add together all the s over cases ’ Total or sum: 23.83
= sample size; thus i goes from 1 to n Sum divided by 10: 2.38
25
Caveats about using the mean
Because it is derived from arithmetical calculations, the mean is (strictly
speaking) only appropriate for interval/ratio data
In practice, it is sometimes used for ordinal variables (when the number
of categories/values is large)
If you use it in this way, do so with caution
E.g. don’t read too much into a small difference in means between
two groups
E.g. don’t report mean using a lot of decimal places
Cannot use the mean for nominal variables
26
Caveats about using the mean
The mean can give a distorted impression of central tendency when a
variable contains a small number of unusual scores/values at one end of the
distribution
If there is such a tail of values which are higher (or lower) than most, the
mean will be pulled in the direction of the extreme scores
Such extreme values are often also termed outliers
27
Example of effects of outliers on mean and median
In our starting data for ten observations: Case TV/Video Case TV/Video
hours hours
Mean =
1 1.33 1 1.33
Median =
2 6.33 93 16.16
3 4.16 3 4.16
Replace Case 2 with one of the more
extreme observed values 4 4.50 4 4.50
Case 2: hours 5 1.16 5 1.16
Case 93: hours 6 3.00 6 3.00
7 0.83 7 0.83
In our new data for ten observations: 8 1.00 8 1.00
Mean = 9 1.50 9 1.50
Median = 10 0.00 10 0.00
28
Mean, median, outliers and skew
Histogram of TV/Video
Hours
29
Another example of mean and median for skewed data
Histogram of household income (before housing costs) in UK, year ending 2022
30 Source: Department for Work & Pensions, Households Below Average Income: an analysis of the UK income distribution: FYE 1995 to FYE 2022.
Accessed 10th October 2023
Mean & median for (ideal-type) skewed and
symmetric distributions
Relative size of
Positive skew
mean and median
skew
Positive
not a definitive
indicator of skew –
depends on precise
shape of histogram
Median < Mean
Negative skew
Symmetric
31
Histograms
don’t
necessarily
have smooth
shapes
33
When a ‘distorted’ mean conveys important information
Life expectancy: average number of years a new-born child would live if current
levels of mortality were to stay the same
Gapminder data (www.gapminder.org)
Shock factors can contribute to sudden drops in countries’ life expectancies
35
Quiz: Skewed or symmetric?
Judging from the mean
and the median, the
distribution of
employment hours is…
A. Positively skewed
(skewed to the right)
B. Negatively skewed
(skewed to the left)
C. Symmetric
36
Measures of dispersion
Measures of central tendency do not provide a complete description of a
distribution
Example: consider the following distributions of scores on a 1–10 scale
Group 1
For all three groups,
Mean = 6
Group 2 Median = 6
And yet they have very
Group 3 different distributions
37
Dispersion
Central tendency is the same for each distribution
But dispersion is different:
Group 1, range =
Group 2, range =
Group 3, range =
39
Interquartile range (IQR)
To calculate the IQR use Quartiles
Quartiles divide the distribution into quarters
Same process as the median, but into four instead of two segments:
1st quartile separates lowest 25% of distribution from upper 75% formula =
2nd quartile (=median) divides distribution in half
3rd quartile separates lowest 75% of distribution from upper 25%
Can also use percentiles, this is the value in a distribution below which a
certain percentage of values lie
1st quartile = 25th percentile
2nd quartile = 50th percentile = median
3rd quartile = 75th percentile
e.g. Group 2: 1 2 2 3 4 4 4 8 8 9 9 10 10 10
40
Interquartile range
To calculate the interquartile range (IQR):
Subtract value of 1st quartile from value of 3rd quartile
IQR = 3rd quartile – Ist quartile
That is, the IQR gives the range of the middle 50% of the distribution
Interquartile range
= 6.5
e.g. Group 2: 1 2 2 3 4 4 4 8 8 9 9 10 10 10
41
TV/Video hours example
Statistic value
Minimum 0.00
Maximum 20.33
Range 20.33
1st quartile (25th percentile) 0.83
2nd quartile (50th percentile): median 2.17
3rd quartile (75th percentile) 3.67
Inter-quartile range 2.83
3rd quartile
2nd quartile: median
1st quartile
Lowest observed value
Inter-quartile range
A word on extreme values or outliers
There is no formal definition of
extreme values or outliers or more
broadly, unusual values
Some (people, and software
designers) make a distinction between
extreme values and outliers and even
extreme outliers
Software generally applies its own
threshold in terms of how many IQRs
they are above the 75th percentile or
below the 25th percentile
E.g. common rule is that anything more
than is ‘extreme’
44
Comparing income percentiles in the UK over time
Net income among working age
families in Britain, 1979-2015
Source: Blundell, R, Joyce, R, Keiller, AN and Zilia, JP
(2018) Income inequality and the labour market in Britain and
the US. Journal of Public Economics, Vol.162, pp48-62
https://tinyurl.com/y37qvbpy
Tracks values of income (in thousands
of pounds) at a selection of percentiles
(see key)
Where in the distribution of income
have the greatest increases ocurred?
Who was most dramatically affected by
the financial crisis of 2008?
See Simon Hughes MP and Margaret
Thatcher PM debating the rights and
wrongs of the 1979-1990 trends:
https://www.youtube.com/watch?v=okHGCz6xxiw
45
Wealth share of richest 1%, 1807-2021, selected countries
Interactive chart – can select other countries by visiting website
47
Example of using box plots for comparisons
European Social Survey (https://www.europeansocialsurvey.org/)
2018 wave, UK respondents
Asked to what extent they trust a range of institutions on a scale of 0-10
48 Source: ESS Round 9: European Social Survey Round 9 Data (2018). Data file edition 2.0. NSD - Norwegian Centre for Research Data, Norway – Data
Archive and distributor of ESS data for ESS ERIC. doi:10.21338/NSD-ESS9-2018
Another way of describing dispersion
Deviations from the mean
∑ ¿¿¿
definition: 5 1.16 -1.22
6 3.00 0.62
7 0.83 -1.55
8 1.00 -1.38
9 1.50 -0.88
10 0.00 -2.38
…so the sum of Mean: 2.38 Total = 0
deviations is not a good
summary of variation in
the data!
50
A solution?
We could take the absolute values of the deviations,
and sum them
∑ |(𝑌 𝑖 −𝑌 )| Bars either side mean ‘take
absolute value’
But then, the more scores we have, the larger the sum
will be
not ideal – for example would make this a bad measure
for comparing dispersion in samples of different sizes
51
Variance
Another way of turning negative numbers
to positive is to square them, i.e. multiply Superscript 2 used to denote
them by themselves a number squared
Sum the squared deviations, normalise e.g.
again by dividing this sum by number of In other words or
scores: the result is called the variance
Conventional symbol used to denote the
sample variance is
The variance is essentially the mean of the
squared deviations – indicates the average
variation in scores from the mean
Note we now have rather than in the
𝑠 =
2 ∑ ( 𝑌 𝑖 −𝑌 ) 2
denominator
(No need to worry or even know why, for this
𝑛 −1
course!)
52
Standard deviation
Variance is expressed in the original units of
analysis, squared. To return to the original
units of analysis, or metric…
take the square root of the variance
√ ∑ ( 𝑌 𝑖 −𝑌 )
Useful measure of average deviation, expressed 2
in original units of analysis of variable in
question 𝑠=
𝑛−1
53
Variance and standard deviation for the example data
Divided by n:
Square root:
54
Illustration (fictional example)
Distributions of scores from , cases in each group:
Group 3
𝑠 2=1.69 , 𝑠=1.3
55
Some comments on the standard deviation
For descriptive statistics, not usually very meaningful on its own
but useful in comparisons, e.g. dispersion of a variable between
different samples (groups)
Sensitive to extreme values/outliers and skewness, so as with all statistics,
use with caution
As with means, can be calculated for ordinal variables – strictly speaking
not appropriate, so use with caution if you do this
Where it is really crucial: in inferential statistics (generalising from sample
to population)
You will meet the standard deviation again later in the course!
56
Example of comparisons of standard deviations
ESS 2018, UK respondents: trust in institutions
57 Source: ESS Round 9: European Social Survey Round 9 Data (2018). Data file edition 2.0. NSD - Norwegian Centre for Research Data, Norway – Data
Archive and distributor of ESS data for ESS ERIC. doi:10.21338/NSD-ESS9-2018
Another example of comparisons of standard deviations
UK ONS official statistics
Standard deviation around mean of mothers’ age at birth of first child
A selection of years:
Standard
Year Mean deviation
1970 23.7 4.6
1975 24.2 4.7
1980 24.7 4.7
1985 25.1 4.9
1990 25.5 5.3
1995 26.1 5.7
2000 26.5 6.0
2005 27.2 6.1
2010 27.7 6.1
2015 28.6 5.9
A. Sleep hours
B. TV/Video hours
C. Employment hours
59
Quiz: Box plots
60
What we’ve covered so far on description
Statistical or graphical tool Nominal Ordinal Interval
Frequency table Yes Yes Yes, but if many observed values, likely
to be larger than is useful – or, can group
the values
Bar chart Yes Yes Yes, but if many observed values, a
histogram will probably give a clearer
summary
Mode Yes Yes Yes, but if many observed values, likely
to get many modes
Median No Yes Yes
Mean No (No)* Yes
Variance/std. deviation No (No)* Yes
Histogram No No Yes
Stem and leaf plot No Yes (but not Yes
very useful)
62
Concluding remarks
We’ve covered a range of statistics and charts for variables of different types
Conditioning on which ones are appropriate for a particular variable, no particular one
is always ‘better’ than another
E.g. median isn’t always more informative than mean
In your own analyses, good practice to consider all appropriate statistics and charts –
they convey different information, highlight different features of your data. Consider
all, then make a reasoned decision on what to present
In reading other people’s analyses, bear in mind not only what’s given but also
what’s not conveyed (is it important/would it help you understand the data better?)
Please work through this week’s analysis homework to review in next week’s seminar:
Tables of descriptive statistics
Mean, median, min, max, range, quartiles
Charts for single continuous variables:
Histograms, stem and leaf plots, box plots
Next week’s lecture: inferential statistics for contingency tables (how we go from
sample to population: example of chi-squared test)
63