ST2187 Block 2
ST2187 Block 2
statistics
There is an introduction video on the VLE, you can access it here:
https://emfssvideo.s3.amazonaws.com/MT%26ST/ST2187/ST2187_Block2_Intro.mp4
Starting with univariate situations, we consider ways to describe the distribution of variables. Graphs
and charts have little intrinsic value per se, however their main function is to bring out interesting
features of a dataset. For this reason, simple descriptions should be preferred to complicated graphics.
Although data visualisation is useful as a preliminary form of data analysis to get a ‘feel’ for the data,
in practice we also need to be able to summarise data numerically. We review descriptive statistics
and distinguish between measures of location, measures of dispersion and skewness. All these
statistics provide useful summaries of raw datasets.
After completing this block, you should be able to:
interpret and summarise raw data on variables graphically
interpret and summarise raw data on variables numerically
calculate basic measures of location and dispersion.
Readings
Albright, S and Winston, W.L, Business Analytics Data Analysis & Decision Making, (Cengage
Learning, 2017) 6th edition [ISBN 9781305947542] Chapter 2.
Types of variables
Different variables may have different properties. These determine which kinds of statistical methods
are suitable for the variables.
Continuous and discrete numeric variables
A continuous variable can, in principle, take any real values within some interval.
In Example 1.1, GDP per capita is continuous, taking any non-negative value.
A variable is discrete if it is not continuous, i.e. if it can only take certain values, but not any
others.
In Example 1.1, region and the level of democracy are discrete, with possible values of 1;
2; : : : ; 6, and 0; 1; 2; : : : ; 10, respectively.
Many discrete variables have only a finite number of possible values. In Example 1.1, the region
variable has 6 possible values, and the level of democracy has 11 possible values. The simplest
possibility is a binary, or dichotomous, variable, with just two possible values. For example, a
person's gender could be recorded as 1 = female and 2 = male.1
A discrete variable can also have an unlimited number of possible values.
For example, the number of visitors to a website in a day: 0; 1; 2; 3; 4; : : :2
Example 1.2
In Example 1.1, the levels of democracy have a meaningful ordering, from less democratic to more
democratic countries. The numbers assigned to the different levels must also be in this order, i.e. a
larger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, Latin America, Northern America and Oceania)
do not have such an ordering. The numbers used for the region variable are just labels for different
regions. A different numbering (such as 6 = Africa, 5 = Asia, 1 = Europe, 3 = Latin America, 2 =
Northern America and 4 = Oceania) would be just as acceptable as the one we originally used.
Some statistical methods are appropriate for variables with both ordered and unordered values,
some only in the ordered case. Unordered categories are nominal data; ordered categories
are ordinal data.
A categorical variable is ordinal if there is a natural ordering of its possible values. If there is no
natural ordering, it is nominal. Categorical variables can be coded numerically or left uncoded.
A dummy variable is a 0-1 coded variable for a specific category. It is coded as 1 for all observations
in that category and 0 for all observations not in that category. Categorising a numerical variable as
categorical is called binning (putting the data into discrete bins) or discretising.
Cross-sectional data are data on a cross section of a population at a distinct point in time. Time series
(longitudinal) data are data collected over time.
1
Note that because gender is a nominal variable, the coding is arbitrary. We could also have, for example, 0 =
male and 1 = female, or 0 = female and 1 = male. However, it is important to remember which coding has been
used!
2
In practice, of course, there is a finite number of internet users in the world. However, it is reasonable to treat
this variable as taking an unlimited number of possible values.
the number of times each value occurs (the counts or frequencies of the observed values).
When the number of different observed values is small, we can show the whole sample distribution as
a frequency table of all the values and their frequencies.
Continuing with Example 1.1, the observations of the region variable in the sample are:
3 1 1 4 2 6 3 2 2 2 3 3 1 2 4
1 4 3 1 2 1 1 2 1 5 1 4 2 4 1
1 4 1 3 4 2 3 3 1 4 2 4 1 4 1
1 3 1 6 3 3 1 1 2 3 1 3 4 1 1
4 4 4 3 2 2 2 2 3 2 3 4 2 2 2
1 2 2 2 3 1 1 1 3 3 1 1 2 1 1
1 4 3 2 1 1 2 1 2 3 4 1 1 3 6
2 2 4 4 4 2 6 3 3 2 3 3 1 1 2
2 1 3 1 2 3 3 3 2 1 1 3 3 2 2
2 1 2 1 4 1 2 2 2 1 3 3 4 5 2
4 2 2 1 1
Here '%' is the percentage of countries in a region, out of the 155 countries in the sample. This is a
measure of proportion (that is, relative frequency).
Similarly, for the level of democracy, the frequency table is:
Cumulative
Level of democracy Frequency %
%
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
'Cumulative %' for a value of the variable is the sum of the percentages for that value and all lower-
numbered values.
A bar chart is the graphical equivalent of the table of frequencies. Figure 1.2 displays the region
variable data as a bar chart. The relative frequencies of each region are clearly visible.
[0; 2) 49 31.6
[2; 5) 32 20.6
Figure 1.4: Diastolic blood pressure of 4,489 respondents aged 25 or over, Health Survey for
England, 2002.
In Excel, a measure of skewness can be calculated with the SKEW function.
Kurtosis has to do with the ‘fatness’ of the tails of the distribution relative to the tails of a normal
distribution. A distribution with high kurtosis has many extreme observations.
Kurtosis can be calculated in Excel with the KURT function.
Introduction
Frequency tables, bar charts and histograms aim to summarise the whole sample distribution of a
variable. Next we consider descriptive statistics, which summarise one feature of the sample
distribution in a single number: summary statistics.
Summation notation
Let X1, X2, …, Xn (i.e.\ Xi, for i = 1, …, n) be a set of n numbers. The sum of the numbers is written as:
𝑛
∑ 𝑋𝑖 = 𝑋1 + 𝑋2 + … + 𝑋𝑛 .
𝑖=1
This may be written as Σ𝑖 𝑋𝑖 , or just Σ 𝑋𝑖 . Other versions of the same idea are:
infinite sums:
∞
∑ 𝑋𝑖 = 𝑋1 + 𝑋2 + ⋯
𝑖=1
∑ 𝑋𝑖 = 𝑋2 + 𝑋3 + ⋯ + 𝑋𝑛/2 .
𝑖=2
The sample mean
The sample mean ('arithmetic mean', 'mean' or 'average') is the most common measure of central
tendency. The sample mean of a variable X is denoted 𝑋̅. It is the ‘sum of the observations’ divided by
the ‘number of observations’ (sample size) expressed as:
For Excel datasets, the mean can be calculated with the AVERAGE function.
The mean 𝑋̅ = ∑𝑖 𝑋𝑖 /𝑛 of the numbers 1, 4 and 7 is:
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
In our example, the mean of the level of democracy (where K = 11) is:
Deviations:
1 1 -3 9 -2 4
2 2 -2 4 -1 1
3 3 -1 1 0 0
4 5 +1 1 +2 4
5 9 +5 25 +6 36
Sum 20 0 40 +5 45
𝑋̅ = 4
We see that the sum of deviations from the mean is 0, i.e. we have:
The mean is 'in the middle' of the observations 𝑋1 , … , 𝑋𝑛 , in the sense that positive and negative
values of the deviations 𝑋𝑖 − 𝑋̅ cancel out, when summed over all the observations.
Also, the smallest possible value of the sum of squared deviations for any
constant C is obtained when 𝐶 = 𝑋̅.
The (sample) median
Let X(1), X(2), … , X(n) denote the sample values of X when ordered from the smallest to the largest,
known as the order statistics, such that:
X(1), is the smallest observed value (the minimum) of X
X(n0) is the largest observed value (the maximum) of 𝑋X.
Median
The (sample) median, q50, of a variable X is the value which is 'in the middle' of the ordered
sample.
If n is odd, then q50 = X((n + 1)/2)
For example, if n = 3, q50 = X(2) : (1)(2)(3)
If n is even, q50 = (X(n/2) + X(n/2 + 1))/2
For example, if n = 4, q50 = (X(2) + X(3))/2: (1)(2)(3) (4).
Continuing with Example 1.1, n = 155, so q50 = X(78). For the level of democracy, the median is 6.
From a table of frequencies, the median is the value for which the cumulative percentage first reaches
50\% (or, if a cumulative % is exactly 50%, the average of the corresponding value of X and the next
highest value).
The ordered values of the level of democracy are:
(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)
(0.) 0 0 0 0 0 0 0 0 0
(1.) 0 0 0 0 0 0 0 0 0 0
(2.) 0 0 0 0 0 0 0 0 0 0
(3.) 0 0 0 0 0 0 1 1 1 1
(4.) 1 1 1 1 1 1 1 1 2 2
(5.) 2 2 3 3 3 3 3 3 4 4
(6.) 4 4 4 5 5 5 5 5 6 6
(7.) 6 6 6 6 6 6 6 6 6 6
(8.) 7 7 7 7 7 7 7 7 7 7
(9.) 7 7 7 8 8 8 8 8 8 8
(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9
(12.) 9 9 9 9 10 10 10 10 10 10
(13.) 10 10 10 10 10 10 10 10 10 10
(14.) 10 10 10 10 10 10 10 10 10 10
(15.) 10 10 10 10 10 10
The median can be determined from the frequency table of the level of democracy:
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Sensitivity to outliers
For the following small ordered dataset, the mean and median are both 4:
1, 2, 4, 5, 8.
Suppose we add one observation to get the ordered sample:
1, 2, 4, 5, 8, 100.
The median is now 4.5, and the mean is 20. In general, the mean is affected much more than the
median by outliers, i.e. unusually small or large observations. Therefore, you should identify outliers
early on and investigate them - perhaps there has been a data entry error, which can simply be
corrected. If deemed genuine outliers, a decision has to be made about whether or not to remove them.
Skewness, means and medians
Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the longer tail of
the sample distribution.
For a positively-skewed distribution, the mean is larger than the median.
For a negatively-skewed distribution, the mean is smaller than the median.
For an exactly symmetric distribution, the mean and median are equal.
When summarising variables with skewed distributions, it is useful to report both the mean and the
median.
For the datasets considered previously:
Mean Median
Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e. appears most
often) in the data.
Example 1.12
For Example 1.1, the modal region is 1 (Africa) and the mode of the level of democracy is 0.
The mode is not very useful for continuous variables which have many different values, such as GDP
per capita in Example 1.1. A variable can have several modes (i.e. be multimodal). For example, GDP
per capita has modes 0.8 and 1.9, both with 5 countries out of the 155.
The mode is the only measure of central tendency which can be used even when the values of a
variable have no ordering, such as for the (nominal) region variable in Example 1.1.
In most cases where a variable is essentially continuous, the mode is not very interesting because it is
often the result of a few lucky ties.
In Excel, the mode can be calculated with the MODE function.
Measures of dispersion
Introduction
Central tendency is not the whole story. The two sample distributions in Figure 1.6 have the same
mean, but they are clearly not the same. In one (red) the values have more dispersion (variation) than
in the other.
Deviations from 𝑋̅
1 1 1 -3 9
2 2 4 -2 4
3 3 9 -1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 40
𝑋̅ = 4
The sample standard deviation of 𝑋X, denoted 𝑆S (or 𝑆𝑋SX), is the positive square root of the
sample variance:
These are the most commonly-used measures of dispersion. The standard deviation is more
understandable than the variance, because the standard deviation is expressed in the same units as X
(rather than the variance, which is expressed in squared units).
A useful rule-of-thumb for interpretation is that for many symmetric distributions, such as the
`normal' distribution:
about 2/3 of the observations are between 𝑋̅ − 𝑆 and 𝑋̅ + 𝑆, that is, within one (sample)
standard deviation about the (sample) mean
about 95% of the observations are between 𝑋̅ − 2 × 𝑆 and 𝑋̅ + 2 × 𝑆, that is, within two
(sample) standard deviations about the (sample) mean
about 99.7% of the observations are between 𝑋̅ − 3 × 𝑆 and 𝑋̅ + 3 × 𝑆, that is, within three
(sample) standard deviations about the (sample) mean.
Remember that standard deviations (and variances) are never negative, and they are zero only if all
the 𝑋𝑖 observations are the same (that is, there is no variation in the data).
If we are using a frequency table, we can also calculate:
Deviations from
1 1 1 -3 9
2 2 4 -2 4
3 3 9 -1 1
4 5 25 +1 1
5 6 81 +5 25
Sum 20 120 0 40
We have:
The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the extremes of
the distribution, i.e. the minimum and maximum observations. The IQR focuses on the middle 50% of
the distribution, so it is completely insensitive to outliers.
Figure 1.7
Excel self-study examples
Supermarket transactions
The data file Supermarket_transactions.xlsx contains over 14,000 (hypothetical) transactions of a
supermarket.
Our goal is to summarise categorical variables in a large dataset. To achieve this, each of the counts in
column S can be obtained with Excel’s COUNTIF function.
The function takes two arguments: the data range and a criterion, and works well for counting
observations in a category. To get the percentages in column T, each count is divided by the total
number of observations.
If you use a chart, be careful to use appropriate scales.
Another efficient way to find the counts and percentages for a categorical variable is to use dummy
(0-1) variables. Recode each variable so that one category is replaced by 1 and all others by 0. Now,
find the count of one category by summing 0s and 1s.
Baseball salaries
The file Baseball_salaries.xlsx contains data on 818 Major League Baseball players from 2009.
Various summary statistics are calculated.
Variability
The example contained in Variability.xlsx indicates why variability (i.e. measure of dispersion), along
with measures of central tendency, is important.
On average, both Supplier 1 and Supplier 2 produce parts close to the target of 100. However, the
increased variability of Supplier 2 makes this supplier much less attractive.
With a standard deviation slightly larger than 25, the second empirical rule implies that about 1 out of
every 20 of this supplier’s parts will be below 50 or above 150.
Catalogue marketing
The file Catalogue_marketing.xlsx contains data on 1,000 customers of a fictional company.
With these data, it is possible to illustrate Excel tables for analysing the data.
The Table button is contained within the Insert ribbon. A number of options can be applied to tables:
a number of table styles are available for making the table attractive
in the Tools group, you can click on Convert to Range. This undesignates the range as a table
(and the drop-down arrows disappear).
in the Properties group, you can change the name of the table; you can also click on
the Resize Table button to expand or contract the table range
a particularly useful option is the Total Row in the Table Style Options group. If you check
this, a new row is appended to the bottom of the table.
58 54 33 29 73 69
6 2 14 10 26 22
59 55 48 44 64 70
71 57 20 16 59 55
30 26 24 20 11 7
38 34 82 78 70 66
36 32 95 97 31 27
33 29 12 8 92 88
72 68 93 89 115 111
1 0 51 47 23 19
27 23 22 18 34 30
22 47 50 75 36 61
79 104 96 121
Count 50.000
Mean 0.873
Median 0.885
Minimum 0.077
Maximum 1.608
Variance 0.187
Skewness -0.003