0% found this document useful (0 votes)
26 views

MY464 - Lecture 3

Uploaded by

jianxina
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

MY464 - Lecture 3

Uploaded by

jianxina
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

MY464

Introduction to Quantitative Methods for Media and


Communications

Lecture 3
Descriptive statistics for continuous variables
Outline
 Tables of grouped frequencies
 Charts
 Histograms
 Stem and leaf plots
 Measures of central tendency
 Mode, median, mean
 Skewed vs. symmetric distributions, outliers
 Measures of dispersion
 Range, inter-quartile range
 box plots
 Variance and standard deviation
 Today’s examples are mostly about describing one variable
(‘univariate’)
 In Week 5 we’ll look at bivariate scenarios (with inference!) where we compare
groups or samples on a continuous variable
2
A data set about time use
 Media time use in the UK
 United Kingdom Time Use Survey, 2014-2015
 Gershuny, J., Sullivan, O. (2017). United Kingdom Time Use Survey, 2014-
2015. Centre for Time Use Research, University of Oxford. [data collection].
UK Data Service. SN: 8128, http://doi.org/10.5255/UKDA-SN-8128-1
 Respondents fill out diaries of their activities for two days (ten minute intervals)
 We will use data from the first diary day, for 7157 respondents aged 18 or over

 Today we will focus on hours spent watching TV/Video


 Derived from respondent activity diaries
 An interval/ratio level variable
 Not quite continuous (10 minute intervals) but has large number of possible
values

3
Watching TV/Video (descriptive statistics)

 Maximum = hours
 Minimum = hours
 Range =

4
Pie chart

5
Frequency table...?
 There are a lot of unique values
 Produces a very long table!
 Still somewhat informative,
e.g. the cumulative percent
column
 We need different approaches to
describe variables with this many
observed values
 Possibilities:
 A table or a graph: first group
the values in some way
 Single-number summaries

6
Frequency table: continuous data grouped into intervals
 Intervals or categories
must be
 Mutually exclusive:
no case belongs to
more than one
interval
 Exhaustive:
categories cover all
the values of the
variable in the data  Note that the intervals here are of different widths
set  0 is only exactly 0 hours
  0-2, 2-4, etc are 2 hours (note, these should really read 0.166-2,
Choice of group size is
2.166-4, etc)
a trade-offs between  More than 8 Hours includes values as high as 20
detail (many intervals)
and readability (few
intervals)
7
Equal & larger number of intervals
 Another approach:
 Divide the variable into intervals of equal width, e.g. 1 hour
 Called ‘bins’ or ‘bands’ in some software
 Smaller intervals (e.g. here, 30 mins) give more detail, but perhaps at
the expense of readability
 In our data set this produces more than 20 intervals to cover the full
range of observed values
 A table with 20+ rows isn’t very easy to read
 So a chart will probably still be better…

8
A histogram
 Frequencies (or proportions
/percentages) of responses on More than 1200 respondents
vertical (y) axis watch no TV (zero hours)
 Must start from zero
 Values of the variable on the
horizontal (x) axis
 Each bar represents an interval
 missing data are excluded
 Bars touch each other:
continuous variable
 Except where there are gaps in the
data, e.g. no one watching 18-
18.5 hours
 Bars are of equal widths

9
Examples of skewed distributions

 Positively skewed distribution  Negatively skewed distribution


 Skewed “to the right”  Skewed “to the left”

10
Example of a (roughly) symmetrical distribution

11
Stem and leaf plot
 A histogram made of digits
 Here, TV & Video hours, rounded to 0.1 hours
 Stem gives the number of hours
 Leaves give the number of 0.1 hours
 Stems and leaves are chosen for best fit to the data (hundreds, tens, units, decimals…)
 Note that this stem and leaf plot shows the 0-0.5 and 0.5-1 range as separate stems
 Here each digit corresponds to 14 cases (respondents) (“Each leaf: 14 case(s)”)

Stems

Leaves

12
Stem and leaf plot rotated
 Here 0 on the left, higher values as you move to the right
 Note the shape of the distribution and correspondence with
histogram
 Note that you can read off individual values

13
Measures of central tendency

 Next, some summary statistics which summarise some aspect


of the distribution of a variable with a single number

 First: measures of central tendency


 For describing the centre, or the most typical value

 Different measures:
 Mode: the most common value
 Median: the central value
 Mean: the average value

14
Mode

code Value label Frequency Proportion Percentage

1 Never 299 0.153 15.3

2 Only occasionally 108 0.055 5.5

3 A few times a week 93 0.047 4.7

4 Most days 217 0.111 11.1

5 Every day 1242 0.634 63.4

Total: 1959 1.000 100.0

 The value of the variable that occurs most often in the data is the mode
 Describes the ‘centre’ of the distribution of a variable in the sense of the
‘bulk’ of cases
 Here, most people (63%) use the internet every day
 So ‘every day’ is the modal category
Some properties of the mode
 The value which occurs most often (i.e. has highest frequency)

 Appropriate for variables of all levels of measurement, in principle


 The only appropriate measure of ‘central tendency’ for a nominal
variable

 Most useful for discrete variables with small numbers of categories,


less useful for variables with a lot of different values
 Can be the case that the modal value is only just the most
commonly observed and that another value (or more than one!) is
almost equally common
 Might hear people talk about bimodal or multimodal distributions
The median
 Take a toy example: a set of 13 cases, values ordered lowest to highest:
2 2 3 4 4 4 8 8 9 9 10 10 10
6 cases 6 cases

middle value

 The median is the value which falls in the middle of the ordering, which
has half the cases below and half above
 Middle case here has the value 8
 For an odd number of n observations, middle observation is
 Here , i.e. the 7th case
 For an even number of observations, middle value falls in between two
observations
 E.g. if 14 cases, , so median is the average of the 7 th and 8th cases:
1 27 2cases
3 4 4 4 8 87 9cases
9 10 10 10  Median =

middle value
17
Median for our example data
 The TV/Video hours stem
and leaf plot
 Too many observations to
find manually, but easy for
the computer to calculate
 Median is 2.16666…
Half of the observations are
below, half are above

18
The median for a grouped variable

 We can find the median from the cumulative frequency, by looking for where
the cumulative percentage passes 50%
 Here, 0-2 hours shows up as 50.0 in the table, but that is actually rounded up from slightly
less than 50.0. This is usually not something you have to worry about!
 The median is 2.16666, so 2-4 hours is the median of the grouped / recoded variable
 The median is also called the 50th percentile

19
Some properties of the median
 Can use the median for continuous and discrete interval/ratio
variables

 Can also use the median for ordinal variables

 Cannot use the median for nominal variables


 Median is found by ranking all the values or observations in order

20
Quiz: Central tendency
Which is a correct
interpretation of a median
of 0.00 for employment
hours?
A. Respondents spent an
average of 0 hours in
employment
B. At least 50% of
respondents did not
work on the day
C. The most common
number of hours in
employment is 0

21
An example of using medians for comparing
groups
 Annual median full
time gross pay by
occupation
 UK, April 2022

22 https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2022
Accessed 10th October 2023
Another example: gender pay gap

 Interactive data viz from


https://informationisbeautiful.net/
23
Case TV/Video
hours
The mean 1 1.33
2 6.33
 Using the TV/Video hours
variable 3 4.16
 Let’s focus on the first ten people 4 4.50
in the data set 5 1.16
 The mean is another measure of 6 3.00
central tendency

7 0.83
Typically when people talk about
the average, they refer to the 8 1.00
mean 9 1.50
10 0.00
 To calculate the mean: Total or sum: 23.83
1. Add up all the scores Sum divided by 10: 2.38
2. Divide by the number of cases

1. 1.33+6.33+4.16+4.50+1.16+3.00+0.83+1.00+1.50+1.50 = 23.83
2. 23.83/10 = 2.38
 Mean hours across these ten people are 2.38
 Rounded to the nearest 10 minutes, this is 2 hours, 20 minutes

24
Mean – using notation
Case TV/Video
 To calculate the

hours
mean, add up all 𝑌𝑖 1 1.33
the scores, and 𝑌=
then divide the
result by the
𝑛 2 6.33
3 4.16
number of cases
4 4.50
 Y = variable of interest 5 1.16
 pronounced “Y-bar” is the mean of Y 6 3.00
 subscript = case number ( is the first 7 0.83
case, is the second case, etc.) 8 1.00
  is ‘sigma’, means “add together 9 1.50
everything after the sign” 10 0.00
 Here, ‘add together all the s over cases ’ Total or sum: 23.83
 = sample size; thus i goes from 1 to n Sum divided by 10: 2.38

25
Caveats about using the mean
 Because it is derived from arithmetical calculations, the mean is (strictly
speaking) only appropriate for interval/ratio data
 In practice, it is sometimes used for ordinal variables (when the number
of categories/values is large)
 If you use it in this way, do so with caution
 E.g. don’t read too much into a small difference in means between
two groups
 E.g. don’t report mean using a lot of decimal places
 Cannot use the mean for nominal variables

26
Caveats about using the mean
 The mean can give a distorted impression of central tendency when a
variable contains a small number of unusual scores/values at one end of the
distribution
 If there is such a tail of values which are higher (or lower) than most, the
mean will be pulled in the direction of the extreme scores
 Such extreme values are often also termed outliers

27
Example of effects of outliers on mean and median
 In our starting data for ten observations: Case TV/Video Case TV/Video
hours hours
 Mean =
1 1.33 1 1.33
 Median =
2 6.33 93 16.16

3 4.16 3 4.16
 Replace Case 2 with one of the more
extreme observed values 4 4.50 4 4.50
 Case 2: hours 5 1.16 5 1.16
 Case 93: hours 6 3.00 6 3.00

7 0.83 7 0.83
 In our new data for ten observations: 8 1.00 8 1.00
 Mean = 9 1.50 9 1.50
 Median = 10 0.00 10 0.00

Total or sum: 23.83 Total or sum: 33.37


 Mean is much more sensitive to outliers Sum divided by 10: 3.37
Sum divided by 2.38
than median 10:
 If your data have outliers, use the median as
measure of central tendency – or report both
mean and median

28
Mean, median, outliers and skew
 Histogram of TV/Video
Hours

 Mean = 2.54 hours


 Median = 2.16 hours
 Distribution is positively
skewed, or skewed to the
right

 Mean is ‘pulled’ in the


direction of skew, by large or
extreme values

29
Another example of mean and median for skewed data
 Histogram of household income (before housing costs) in UK, year ending 2022

Poverty line set at


60% of median
income

Not shown in the chart

30 Source: Department for Work & Pensions, Households Below Average Income: an analysis of the UK income distribution: FYE 1995 to FYE 2022.
Accessed 10th October 2023
Mean & median for (ideal-type) skewed and
symmetric distributions
 Relative size of
Positive skew
mean and median

skew
Positive
not a definitive
indicator of skew –
depends on precise
shape of histogram
Median < Mean

Negative skew

Symmetric

Median = Mean Median > Mean

31
Histograms
don’t
necessarily
have smooth
shapes

32 Source: European Social Survey https://blogs.city.ac.uk/sociology/2019/11/07/new-european-social-survey-data/ Accessed 10th October 2023


Some comments on mean and median
 Often good practice to report the mean and the median, as they each give a
different perspective on your data
 Particularly informative when distribution is skewed
 When people say ‘average’ they usually mean the mean, but do check
 For data that are skewed, the median is often more stable
 The mean is more sensitive to being affected by extreme values when
 the sample size is smaller (mean TV watching hours for 10 people is more
sensitive to this than mean TV watching hours for the whole UK sample)
 the outlier values are very unusual (a trillionnaire has more impact on mean
income than a millionnaire would)
 there are more extreme values (one or two billionnaires don’t change UK
mean income much, but many billionnaires might)
 Too simplistic to say ‘the median is always better’ – it depends!

33
When a ‘distorted’ mean conveys important information
 Life expectancy: average number of years a new-born child would live if current
levels of mortality were to stay the same
 Gapminder data (www.gapminder.org)
 Shock factors can contribute to sudden drops in countries’ life expectancies

34 Source: Gapminder, https://www.gapminder.org/tools/#$chart-type=bubbles&url=v1 Accessed 10th October 2023


Measures of central tendency (summary)
 Measures of central tendency describe the ‘typical’, ‘central’ or ‘average’
value of the distribution of a variable
 Different measures:
 Mode
 Median
 Mean
 The measure(s) to use depends on the variable you want to describe:
 its level of measurement (nominal, ordinal, interval or ratio)
 the shape of its distribution
 In many instances it is good practice to use more than one measure of
central tendency
 Almost always a good idea to explore different ones, even if you don’t
report them all

35
Quiz: Skewed or symmetric?
Judging from the mean
and the median, the
distribution of
employment hours is…

A. Positively skewed
(skewed to the right)
B. Negatively skewed
(skewed to the left)
C. Symmetric

36
Measures of dispersion
 Measures of central tendency do not provide a complete description of a
distribution
 Example: consider the following distributions of scores on a 1–10 scale

(fictional data, just for clarity of demonstration)…

Group 1
For all three groups,
Mean = 6
Group 2 Median = 6
And yet they have very
Group 3 different distributions

37
Dispersion
 Central tendency is the same for each distribution
 But dispersion is different:

Homogenous: Very dispersed: A bit less


dispersed: Group 1 Group 2 Group 3
*
*
*
*
*
*
*
*
*
*
* *
* * * * * *
* * * * * * * * * * *
1 2 3 4 5 *
6 7 8 9 10 *1 *2 *3 *4 5 6 7 *8 *9 10* 1 2 3 *4 *5 *6 *7 *8 9 10

How can we describe these differences numerically?


38
Range
 Range = difference between the largest and smallest observed values:

 Group 1, range =
 Group 2, range =
 Group 3, range =

 Can be a misleading measure if there are extreme values

 A solution: exclude extreme scores, and calculate the range of the


middle section of values...

39
Interquartile range (IQR)
 To calculate the IQR use Quartiles
 Quartiles divide the distribution into quarters
 Same process as the median, but into four instead of two segments:
 1st quartile separates lowest 25% of distribution from upper 75% formula =
 2nd quartile (=median) divides distribution in half
 3rd quartile separates lowest 75% of distribution from upper 25%
 Can also use percentiles, this is the value in a distribution below which a
certain percentage of values lie
 1st quartile = 25th percentile
 2nd quartile = 50th percentile = median
 3rd quartile = 75th percentile

 e.g. Group 2: 1 2 2 3 4 4 4 8 8 9 9 10 10 10

Ist quartile = 2nd quartile 3rd quartile


25 percentile
th = 50th percentile = 75th percentile
= 2.75 =6 = 9.25
, so 1st quartile is three quarters of way between 3 rd and 4th observation = 2.75

40
Interquartile range
 To calculate the interquartile range (IQR):
 Subtract value of 1st quartile from value of 3rd quartile
 IQR = 3rd quartile – Ist quartile

 That is, the IQR gives the range of the middle 50% of the distribution

Interquartile range
= 6.5

 e.g. Group 2: 1 2 2 3 4 4 4 8 8 9 9 10 10 10

Ist quartile = 2nd quartile 3rd quartile


25 percentile
th
= 50th percentile = 75th percentile
= 2.75 =6 = 9.25

41
TV/Video hours example
Statistic value
Minimum 0.00
Maximum 20.33
Range 20.33
1st quartile (25th percentile) 0.83
2nd quartile (50th percentile): median 2.17
3rd quartile (75th percentile) 3.67
Inter-quartile range 2.83

 Full range of data is 20.33 hours


 Middle half of cases lie within a range of 2.83 hours
 Middle 50% are nearer the lower end of the scale than the top end
42
A box plot

Highest observed value

Extreme values, outliers

Highest non-outlier value

3rd quartile
2nd quartile: median
1st quartile
Lowest observed value
Inter-quartile range
A word on extreme values or outliers
 There is no formal definition of
extreme values or outliers or more
broadly, unusual values
 Some (people, and software
designers) make a distinction between
extreme values and outliers and even
extreme outliers
 Software generally applies its own
threshold in terms of how many IQRs
they are above the 75th percentile or
below the 25th percentile
 E.g. common rule is that anything more
than is ‘extreme’

44
Comparing income percentiles in the UK over time
 Net income among working age
families in Britain, 1979-2015
 Source: Blundell, R, Joyce, R, Keiller, AN and Zilia, JP
(2018) Income inequality and the labour market in Britain and
the US. Journal of Public Economics, Vol.162, pp48-62
https://tinyurl.com/y37qvbpy
 Tracks values of income (in thousands
of pounds) at a selection of percentiles
(see key)
 Where in the distribution of income
have the greatest increases ocurred?
 Who was most dramatically affected by
the financial crisis of 2008?
 See Simon Hughes MP and Margaret
Thatcher PM debating the rights and
wrongs of the 1979-1990 trends:
https://www.youtube.com/watch?v=okHGCz6xxiw

45
Wealth share of richest 1%, 1807-2021, selected countries
Interactive chart – can select other countries by visiting website

46 Source: Our World In Data,


https://ourworldindata.org/grapher/wealth-share-richest-1-percent?tab=chart&country=CHN~FRA~RUS~GBR~
USA~ZAF~FIN~DEU~IND
Percentiles in political protest
 ‘We are the 99%’ used as a
key slogan in the Occupy
Wall Street movement in
2011, which expanded to a
broader, international
‘Occupy’ movement
 ‘We are the 99%’ remains
widely used – try an Internet
search!
 Key idea that the ‘top 1%’
have disproportionate wealth
and power in comparison to
the rest of society
 1% of what variable? Not
Source: Paul Stein, 26th September 2011,via Wikimedia Commons,
always stated; sometimes https://commons.wikimedia.org/wiki/File:We_Are_The_99%25.jpg
income is implied, This file is licensed under the Creative Commons Attribution-Share Alike 2.0 Generic license.

sometimes wealth (including


accumulated assets)

47
Example of using box plots for comparisons
 European Social Survey (https://www.europeansocialsurvey.org/)
 2018 wave, UK respondents
 Asked to what extent they trust a range of institutions on a scale of 0-10

48 Source: ESS Round 9: European Social Survey Round 9 Data (2018). Data file edition 2.0. NSD - Norwegian Centre for Research Data, Norway – Data
Archive and distributor of ESS data for ESS ERIC. doi:10.21338/NSD-ESS9-2018
Another way of describing dispersion
 Deviations from the mean

 The difference between an observed value and the 𝑌 𝑖 −𝑌


mean of a variable:

 A sample with little variation will have small


deviations, and a sample with a lot of variation will
have many large deviations

 So, a summary of the variation might be obtained


by adding up all the deviations in the data…
∑ ¿¿¿
49
But deviations from the mean add up to zero
 The sum of the Case TV/Video Deviation
differences between the Hours from the mean
mean and each of the 1 1.33 -1.05
scores is zero
2 6.33 3.95
 Not just for this 3 4.16 1.78
example – this is true by 4 4.50 2.12

∑ ¿¿¿
definition: 5 1.16 -1.22
6 3.00 0.62
7 0.83 -1.55
8 1.00 -1.38
9 1.50 -0.88
10 0.00 -2.38
 …so the sum of Mean: 2.38 Total = 0
deviations is not a good
summary of variation in
the data!
50
A solution?
 We could take the absolute values of the deviations,
and sum them
∑ |(𝑌 𝑖 −𝑌 )| Bars either side mean ‘take
absolute value’

 But then, the more scores we have, the larger the sum
will be
 not ideal – for example would make this a bad measure
for comparing dispersion in samples of different sizes

 We can normalise the measure by dividing by the


number of values in the distribution (sample size n)
 This is the mean average deviation (MAD) ∑ |( 𝑌 𝑖 − 𝑌 )|
 In principle, MAD is fine as a measure of typical 𝑛
deviation, but almost never used in practice (largely
for technical mathematical reasons)

51
Variance
 Another way of turning negative numbers
to positive is to square them, i.e. multiply  Superscript 2 used to denote
them by themselves a number squared
 Sum the squared deviations, normalise  e.g.
again by dividing this sum by number of  In other words or
scores: the result is called the variance
 Conventional symbol used to denote the
sample variance is
 The variance is essentially the mean of the
squared deviations – indicates the average
variation in scores from the mean
 Note we now have rather than in the
𝑠 =
2 ∑ ( 𝑌 𝑖 −𝑌 ) 2

denominator
 (No need to worry or even know why, for this
𝑛 −1
course!)

52
Standard deviation
 Variance is expressed in the original units of
analysis, squared. To return to the original
units of analysis, or metric…
 take the square root of the variance

 The square root of a number


 The standard deviation is the square root of Y is a number whose square
the variance is Y
 Conventionally denoted s or abbreviated s.d.  e.g. , so the square root of 25
is 5
 Symbol for square root is √
 The most commonly used measure of
 e.g. √25=5
deviation

√ ∑ ( 𝑌 𝑖 −𝑌 )
 Useful measure of average deviation, expressed 2
in original units of analysis of variable in
question 𝑠=
𝑛−1
53
Variance and standard deviation for the example data

Case TV/Video hours Deviations Squared deviations

1 1.33 -1.05 1.105


2 6.33 3.95 15.595
3 4.16 1.78 3.165
4 4.50 2.12 4.490
5 1.16 -1.22 1.491
6 3.00 0.62 0.383
7 0.83 -1.55 2.406
8 1.00 -1.38 1.907
9 1.50 -0.88 0.776
10 0.00 -2.38 5.670

Totals (sums): (Sum = 0)

Divided by n:

Square root:

54
Illustration (fictional example)
 Distributions of scores from , cases in each group:

(for all three groups, mean=6 and median=6)


Group 1
𝑠 2=0 , 𝑠=0
Group 2
𝑠 2=11.69, 𝑠=3.4

Group 3
𝑠 2=1.69 , 𝑠=1.3

55
Some comments on the standard deviation
 For descriptive statistics, not usually very meaningful on its own
 but useful in comparisons, e.g. dispersion of a variable between
different samples (groups)
 Sensitive to extreme values/outliers and skewness, so as with all statistics,
use with caution
 As with means, can be calculated for ordinal variables – strictly speaking
not appropriate, so use with caution if you do this
 Where it is really crucial: in inferential statistics (generalising from sample
to population)
 You will meet the standard deviation again later in the course!

56
Example of comparisons of standard deviations
 ESS 2018, UK respondents: trust in institutions

 Std. devs: 2.7 2.7 2.5 2.5 2.4 2.6 2.6

57 Source: ESS Round 9: European Social Survey Round 9 Data (2018). Data file edition 2.0. NSD - Norwegian Centre for Research Data, Norway – Data
Archive and distributor of ESS data for ESS ERIC. doi:10.21338/NSD-ESS9-2018
Another example of comparisons of standard deviations
 UK ONS official statistics
 Standard deviation around mean of mothers’ age at birth of first child
 A selection of years:
Standard
Year Mean deviation
1970 23.7 4.6
1975 24.2 4.7
1980 24.7 4.7
1985 25.1 4.9
1990 25.5 5.3
1995 26.1 5.7
2000 26.5 6.0
2005 27.2 6.1
2010 27.7 6.1
2015 28.6 5.9

58 Source: Office for National Statistics,


https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/adhocs/009572standarddeviationofthemeanageofmotherat1st2nd3rd4thand5thbirth1969t
o2017englandandwales
Quiz: Dispersion
Which variable has the
greatest dispersion?

A. Sleep hours
B. TV/Video hours
C. Employment hours

59
Quiz: Box plots

Which of the following


statements about the box
plot are true?

A. Sleep hours has the


highest median
B. Sleep hours has the
largest interquartile
range
C. Sleep hours has the
largest range

60
What we’ve covered so far on description
Statistical or graphical tool Nominal Ordinal Interval
Frequency table Yes Yes Yes, but if many observed values, likely
to be larger than is useful – or, can group
the values
Bar chart Yes Yes Yes, but if many observed values, a
histogram will probably give a clearer
summary
Mode Yes Yes Yes, but if many observed values, likely
to get many modes
Median No Yes Yes
Mean No (No)* Yes
Variance/std. deviation No (No)* Yes
Histogram No No Yes
Stem and leaf plot No Yes (but not Yes
very useful)

Box plot No Yes (but not Yes


very useful)

61 * Not strictly appropriate, but often used – with caution!


Same table transposed
Level of Descriptive statistics Graphical tools
measurement
Nominal • Frequency table, with proportions/percentages • Bar chart
• Mode • Pie chart
Ordinal • Frequency table, with proportions/percentages • Bar chart
• Mode • Pie chart
• Median • [Box plot,
• Range (min and max values), Inter-quartile range Stem and leaf
• [Mean? Sometimes, with caution] plot – can be
used, but not very
• [Standard deviation or variance? Sometimes, with helpful]
caution]
Interval/ratio • Frequency table, but only if few actual values • Histogram
• Mode, but only if reasonable • Box plot
• Median • Stem and leaf
• Range (min and max values), Inter-quartile range plot
• Mean
• Standard deviation or variance

62
Concluding remarks
 We’ve covered a range of statistics and charts for variables of different types
 Conditioning on which ones are appropriate for a particular variable, no particular one
is always ‘better’ than another
 E.g. median isn’t always more informative than mean
 In your own analyses, good practice to consider all appropriate statistics and charts –
they convey different information, highlight different features of your data. Consider
all, then make a reasoned decision on what to present
 In reading other people’s analyses, bear in mind not only what’s given but also
what’s not conveyed (is it important/would it help you understand the data better?)
 Please work through this week’s analysis homework to review in next week’s seminar:
 Tables of descriptive statistics
 Mean, median, min, max, range, quartiles
 Charts for single continuous variables:
 Histograms, stem and leaf plots, box plots

 Next week’s lecture: inferential statistics for contingency tables (how we go from
sample to population: example of chi-squared test)
63

You might also like