0% found this document useful (0 votes)
94 views

ST2187 Block 2

This document discusses univariate data visualization and descriptive statistics. It introduces graphs and charts as a way to describe the distribution of variables in a dataset and highlight interesting features. Both graphical and numerical summaries are useful for analyzing raw data. The document then discusses a sample dataset containing information on 155 countries, including their region, democracy level, and GDP per capita. It provides frequency tables that summarize the distribution of each variable in the sample.

Uploaded by

Joseph Matthew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

ST2187 Block 2

This document discusses univariate data visualization and descriptive statistics. It introduces graphs and charts as a way to describe the distribution of variables in a dataset and highlight interesting features. Both graphical and numerical summaries are useful for analyzing raw data. The document then discusses a sample dataset containing information on 155 countries, including their region, democracy level, and GDP per capita. It provides frequency tables that summarize the distribution of each variable in the sample.

Uploaded by

Joseph Matthew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Block 2: Univariate data visualisation and descriptive

statistics
There is an introduction video on the VLE, you can access it here:
https://emfssvideo.s3.amazonaws.com/MT%26ST/ST2187/ST2187_Block2_Intro.mp4
Starting with univariate situations, we consider ways to describe the distribution of variables. Graphs
and charts have little intrinsic value per se, however their main function is to bring out interesting
features of a dataset. For this reason, simple descriptions should be preferred to complicated graphics.
Although data visualisation is useful as a preliminary form of data analysis to get a ‘feel’ for the data,
in practice we also need to be able to summarise data numerically. We review descriptive statistics
and distinguish between measures of location, measures of dispersion and skewness. All these
statistics provide useful summaries of raw datasets.
After completing this block, you should be able to:
 interpret and summarise raw data on variables graphically
 interpret and summarise raw data on variables numerically
 calculate basic measures of location and dispersion.

Readings
Albright, S and Winston, W.L, Business Analytics Data Analysis & Decision Making, (Cengage
Learning, 2017) 6th edition [ISBN 9781305947542] Chapter 2.

The sample distribution


Starting point
Starting point: a collection of numerical data (a sample) has been collected in order to answer some
questions. Statistical analysis may have two broad aims.
1. Descriptive statistics: summarise the data which were collected, in order to make them more
understandable.
2. Statistical inference: use the observed data to draw conclusions about some broader
population.
Sometimes '1.' is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential first step.
Data do not just speak for themselves. There are usually simply too many numbers to make sense of
just by staring at them. Descriptive statistics attempt to summarise some key features of the data to
make them understandable and easy to communicate. These summaries may
be graphical or numerical (tables or individual summary statistics).
We consider data for 155 countries on three variables from around 2002. The data can be found in the
file Countries.xlsx. The variables are the following.
 Region of the country.
o This is a nominal variable coded (in alphabetical order) as follows: 1 = Africa, 2 =
Asia, 3 = Europe, 4 = Latin America, 5 = Northern America, 6 = Oceania.
 The level of democracy, i.e. a democracy index, in the country.
o This is an 11-point ordinal scale from 0 (lowest level of democracy) to 10 (highest
level of democracy).
 Gross domestic product per capita (GDP per capita) (i.e. per person, in $000s) which is a
ratio scale.
The statistical data in a sample are typically stored in a data matrix, as shown in Figure 1.1.
Rows of the data matrix correspond to different units (subjects/observations). An observation (or case
or record) is a list of all variable values for a single member of a population.
 Here, each unit is a country.
The number of units in a dataset is the sample size, typically denoted by n.
 Here, n = 155 countries.
Columns of the data matrix correspond to variables, i.e. different characteristics of the units.
 Here, region, the level of democracy, and GDP per capita are the variables.

Types of variables
Different variables may have different properties. These determine which kinds of statistical methods
are suitable for the variables.
Continuous and discrete numeric variables
A continuous variable can, in principle, take any real values within some interval.
 In Example 1.1, GDP per capita is continuous, taking any non-negative value.
A variable is discrete if it is not continuous, i.e. if it can only take certain values, but not any
others.
 In Example 1.1, region and the level of democracy are discrete, with possible values of 1;
2; : : : ; 6, and 0; 1; 2; : : : ; 10, respectively.

Many discrete variables have only a finite number of possible values. In Example 1.1, the region
variable has 6 possible values, and the level of democracy has 11 possible values. The simplest
possibility is a binary, or dichotomous, variable, with just two possible values. For example, a
person's gender could be recorded as 1 = female and 2 = male.1
A discrete variable can also have an unlimited number of possible values.
 For example, the number of visitors to a website in a day: 0; 1; 2; 3; 4; : : :2

Example 1.2
In Example 1.1, the levels of democracy have a meaningful ordering, from less democratic to more
democratic countries. The numbers assigned to the different levels must also be in this order, i.e. a
larger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, Latin America, Northern America and Oceania)
do not have such an ordering. The numbers used for the region variable are just labels for different
regions. A different numbering (such as 6 = Africa, 5 = Asia, 1 = Europe, 3 = Latin America, 2 =
Northern America and 4 = Oceania) would be just as acceptable as the one we originally used.
Some statistical methods are appropriate for variables with both ordered and unordered values,
some only in the ordered case. Unordered categories are nominal data; ordered categories
are ordinal data.

A categorical variable is ordinal if there is a natural ordering of its possible values. If there is no
natural ordering, it is nominal. Categorical variables can be coded numerically or left uncoded.
A dummy variable is a 0-1 coded variable for a specific category. It is coded as 1 for all observations
in that category and 0 for all observations not in that category. Categorising a numerical variable as
categorical is called binning (putting the data into discrete bins) or discretising.

Cross-sectional data are data on a cross section of a population at a distinct point in time. Time series
(longitudinal) data are data collected over time.

Examples of sample distributions


The sample distribution of a variable consists of:
 a list of the values of the variable which are observed in the sample

1
Note that because gender is a nominal variable, the coding is arbitrary. We could also have, for example, 0 =
male and 1 = female, or 0 = female and 1 = male. However, it is important to remember which coding has been
used!
2
In practice, of course, there is a finite number of internet users in the world. However, it is reasonable to treat
this variable as taking an unlimited number of possible values.
 the number of times each value occurs (the counts or frequencies of the observed values).
When the number of different observed values is small, we can show the whole sample distribution as
a frequency table of all the values and their frequencies.
Continuing with Example 1.1, the observations of the region variable in the sample are:

3 1 1 4 2 6 3 2 2 2 3 3 1 2 4

1 4 3 1 2 1 1 2 1 5 1 4 2 4 1

1 4 1 3 4 2 3 3 1 4 2 4 1 4 1

1 3 1 6 3 3 1 1 2 3 1 3 4 1 1

4 4 4 3 2 2 2 2 3 2 3 4 2 2 2

1 2 2 2 3 1 1 1 3 3 1 1 2 1 1

1 4 3 2 1 1 2 1 2 3 4 1 1 3 6

2 2 4 4 4 2 6 3 3 2 3 3 1 1 2

2 1 3 1 2 3 3 3 2 1 1 3 3 2 2

2 1 2 1 4 1 2 2 2 1 3 3 4 5 2

4 2 2 1 1

We may construct a frequency table for the region variable as follows:

Region Frequency (count) Relative frequency (%)

(1) Africa 48 100 × (48/155)


31.0

(2) Asia 45 29.0

(3) Europe 34 21.9

(4) Latin America 23 14.8

(5) Northern America 2 1.3

(6) Oceania 3 1.9

Total 155 100

Here '%' is the percentage of countries in a region, out of the 155 countries in the sample. This is a
measure of proportion (that is, relative frequency).
Similarly, for the level of democracy, the frequency table is:
Cumulative
Level of democracy Frequency %
%

0 35 22.6 22.6

1 12 7.7 30.3

2 4 2.6 32.9

3 6 3.9 36.8

4 5 3.2 40.0

5 5 3.2 43.2

6 12 7.7 50.9

7 13 8.4 59.3

8 16 10.3 69.6

9 15 9.7 79.3

10 32 20.6 100

Total 155 100

'Cumulative %' for a value of the variable is the sum of the percentages for that value and all lower-
numbered values.
A bar chart is the graphical equivalent of the table of frequencies. Figure 1.2 displays the region
variable data as a bar chart. The relative frequencies of each region are clearly visible.

Example of a bar chart showing the region variable.


If a variable has many distinct values, listing frequencies of all of them is not very practical.
A solution is to group the values into non-overlapping intervals, and produce a table or graph of the
frequencies within the intervals. The most common graph used for showing the distribution of a
numerical variable is a histogram. It is based on binning the variable, i.e. dividing it up into discrete
categories. In general, a histogram is great for showing the shape of a distribution.
A histogram is like a bar chart, but without gaps between bars, and often uses more bars (intervals of
values) than is sensible in a table. Histograms are usually drawn using statistical software, such as
Minitab, R or SPSS. You can let the software choose the intervals and the number of bars.
Continuing with Example 1.1, a table of frequencies for GDP per capita where values have been
grouped into non-overlapping intervals is shown below. Figure 1.3 shows a histogram of GDP per
capita with a greater number of intervals to better display the sample distribution.

GDP per capita (in $000s) Frequency %

[0; 2) 49 31.6

[2; 5) 32 20.6

[5; 10) 29 18.7

[10; 20) 21 13.5

[20; 30) 13 12.3

[30; 50) 5 3.2

Total 155 100

Figure 1.3: Histogram of GDP per capita


Skewness and kurtosis of distributions
Skewness and symmetry are terms used to describe the general shape of a sample distribution.
From Figure 1.3 , it is clear that a small number of countries has much larger values of GDP per
capita than the majority of countries in the sample. The distribution of GDP per capita has a 'long
right tail'. Such a distribution is called positively skewed (or skewed to the right).
A distribution with a longer left tail (i.e. toward small values) is negatively skewed (or skewed to the
left). A distribution is symmetric if it is not skewed in either direction.
Figure 1.4 shows a (more-or-less) symmetric sample distribution for diastolic blood pressure.

Figure 1.4: Diastolic blood pressure of 4,489 respondents aged 25 or over, Health Survey for
England, 2002.
In Excel, a measure of skewness can be calculated with the SKEW function.
Kurtosis has to do with the ‘fatness’ of the tails of the distribution relative to the tails of a normal
distribution. A distribution with high kurtosis has many extreme observations.
Kurtosis can be calculated in Excel with the KURT function.

Outliers and missing values


An outlier is literally a value or an entire observation that lies well outside of the norm. Statisticians
(unfortunately!) disagree on an exact definition of an outlier. You might define an outlier as any value
more than three standard deviations from the mean, but this is only a rule of thumb.
Sometimes an outlier is easy to detect and deal with. Sometimes a careful check of the variable
values, one variable at a time, will not reveal any outliers, but there still might be unusual
combinations of values. Probably the best advice for dealing with outliers is to run the analyses two
ways: with the outliers and without them.
For missing values, as with outliers, there are two issues - how to detect missing values and what to
do about them.
Missing data are coded in a variety of strange ways. If you know the code in Excel, you can perform a
global search and replace all of the missing value codes with blanks.
The more important issue is what to do about missing values.
 One option is to simply ignore them, in which case you will have to be aware of how the
software deals with missing values.
 Another option is to fill in missing values with averages of existing values. This option may
not be the best.
 Yet another option is to examine the existing values in the row of a missing value; they may
provide information on what a missing value should be.

Measures of central tendency

Introduction
Frequency tables, bar charts and histograms aim to summarise the whole sample distribution of a
variable. Next we consider descriptive statistics, which summarise one feature of the sample
distribution in a single number: summary statistics.

Figure 1.5: Final examination marks of a first-year statistics course.


We begin with measures of central tendency. These answer the question: where is the 'centre' or
'average' of the distribution?
We consider the following measures of central tendency:
 mean (i.e. the average, sample mean or arithmetic mean)
 median
 mode.
If we refer back to skewness of distributions we can see that Figure 1.5 shows a (slightly) negatively-
skewed distribution of marks in an examination. Note the data relate to all candidates sitting the
examination. Therefore, the histogram shows the population distribution, not a sample distribution.

Notation for variables


In formulae, a generic variable is denoted by a single letter. In these course notes, usually X.
However, any other letter (Y, W etc.)\ can also be used, as long as it is used consistently. A letter with
a subscript denotes a single observation of a variable.
We use Xi to denote the value of X for unit i, where i can take values 1, 2, 3, …, n, and n is the sample
size.
Therefore, the n observations of X in the dataset (the sample) are X1, X2, X3, …, Xn. These can also be
written as Xi, for i = 1, …, n.

Summation notation
Let X1, X2, …, Xn (i.e.\ Xi, for i = 1, …, n) be a set of n numbers. The sum of the numbers is written as:
𝑛

∑ 𝑋𝑖 = 𝑋1 + 𝑋2 + … + 𝑋𝑛 .
𝑖=1

This may be written as Σ𝑖 𝑋𝑖 , or just Σ 𝑋𝑖 . Other versions of the same idea are:
 infinite sums:

∑ 𝑋𝑖 = 𝑋1 + 𝑋2 + ⋯
𝑖=1

 Sums of sets of observations other than 1 to n, for example:


𝑛/2

∑ 𝑋𝑖 = 𝑋2 + 𝑋3 + ⋯ + 𝑋𝑛/2 .
𝑖=2
The sample mean
The sample mean ('arithmetic mean', 'mean' or 'average') is the most common measure of central
tendency. The sample mean of a variable X is denoted 𝑋̅. It is the ‘sum of the observations’ divided by
the ‘number of observations’ (sample size) expressed as:

For Excel datasets, the mean can be calculated with the AVERAGE function.
The mean 𝑋̅ = ∑𝑖 𝑋𝑖 /𝑛 of the numbers 1, 4 and 7 is:

For the variables in Example 1.1:


 the level of democracy has 𝑋̅ = 5.3.
 GDP per capita has 𝑋̅ = 8.6 (in 000's)
 for region the mean is not meaningful(!), because the values of the variable do not have
a meaningful ordering.
The frequency table of the level of democracy is:

Level of democracy Frequency Cumulative


%
Xj Fj %

0 35 22.6 22.6

1 12 7.7 30.3

2 4 2.6 32.9

3 6 3.9 36.8

4 5 3.2 40.0

5 5 3.2 43.2

6 12 7.7 50.9

7 13 8.4 59.3

8 16 10.3 69.6

9 15 9.7 79.3

10 32 20.6 100

Total 155 100


If a variable has a small number of distinct values, 𝑋̅ is easy to calculate from the frequency table. For
example, the level of democracy has just 11 different values which occur in the sample 35, 12, 4, … ,
32 times each, respectively.
Suppose X has K different values 𝑋1 , 𝑋2 , … , 𝑋𝐾 , with corresponding frequencies 𝑓1 , 𝑓2 , … , 𝑓3 .
Therefore, and:

In our example, the mean of the level of democracy (where K = 11) is:

Why is the mean a good summary of the central tendency?


Consider the following small dataset:

Deviations:

From 𝑋̅ (= 4) From the median (= 3)

i 𝑋𝑖 𝑋𝑖 − 𝑋̅ (𝑋𝑖 − 𝑋̅)2 𝑋𝑖 − 3 (𝑋𝑖 − 3)2

1 1 -3 9 -2 4
2 2 -2 4 -1 1
3 3 -1 1 0 0
4 5 +1 1 +2 4
5 9 +5 25 +6 36

Sum 20 0 40 +5 45
𝑋̅ = 4

We see that the sum of deviations from the mean is 0, i.e. we have:

The mean is 'in the middle' of the observations 𝑋1 , … , 𝑋𝑛 , in the sense that positive and negative
values of the deviations 𝑋𝑖 − 𝑋̅ cancel out, when summed over all the observations.
Also, the smallest possible value of the sum of squared deviations for any
constant C is obtained when 𝐶 = 𝑋̅.
The (sample) median
Let X(1), X(2), … , X(n) denote the sample values of X when ordered from the smallest to the largest,
known as the order statistics, such that:
 X(1), is the smallest observed value (the minimum) of X
 X(n0) is the largest observed value (the maximum) of 𝑋X.

Median

The (sample) median, q50, of a variable X is the value which is 'in the middle' of the ordered
sample.
If n is odd, then q50 = X((n + 1)/2)
 For example, if n = 3, q50 = X(2) : (1)(2)(3)
If n is even, q50 = (X(n/2) + X(n/2 + 1))/2
 For example, if n = 4, q50 = (X(2) + X(3))/2: (1)(2)(3) (4).

In Excel, the median can be calculated with the MEDIAN function.

Continuing with Example 1.1, n = 155, so q50 = X(78). For the level of democracy, the median is 6.
From a table of frequencies, the median is the value for which the cumulative percentage first reaches
50\% (or, if a cumulative % is exactly 50%, the average of the corresponding value of X and the next
highest value).
The ordered values of the level of democracy are:

(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)

(0.) 0 0 0 0 0 0 0 0 0

(1.) 0 0 0 0 0 0 0 0 0 0

(2.) 0 0 0 0 0 0 0 0 0 0

(3.) 0 0 0 0 0 0 1 1 1 1

(4.) 1 1 1 1 1 1 1 1 2 2

(5.) 2 2 3 3 3 3 3 3 4 4

(6.) 4 4 4 5 5 5 5 5 6 6

(7.) 6 6 6 6 6 6 6 6 6 6

(8.) 7 7 7 7 7 7 7 7 7 7

(9.) 7 7 7 8 8 8 8 8 8 8

(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9

(12.) 9 9 9 9 10 10 10 10 10 10

(13.) 10 10 10 10 10 10 10 10 10 10

(14.) 10 10 10 10 10 10 10 10 10 10

(15.) 10 10 10 10 10 10

The median can be determined from the frequency table of the level of democracy:

Level of democracy Frequency Cumulative


%
Xj fj %

0 35 22.6 22.6

1 12 7.7 30.3

2 4 2.6 32.9

3 6 3.9 36.8

4 5 3.2 40.0

5 5 3.2 43.2

6 12 7.7 50.9

7 13 8.4 59.3

8 16 10.3 69.6

9 15 9.7 79.3

10 32 20.6 100

Total 155 100

Sensitivity to outliers
For the following small ordered dataset, the mean and median are both 4:
1, 2, 4, 5, 8.
Suppose we add one observation to get the ordered sample:
1, 2, 4, 5, 8, 100.
The median is now 4.5, and the mean is 20. In general, the mean is affected much more than the
median by outliers, i.e. unusually small or large observations. Therefore, you should identify outliers
early on and investigate them - perhaps there has been a data entry error, which can simply be
corrected. If deemed genuine outliers, a decision has to be made about whether or not to remove them.
Skewness, means and medians
Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the longer tail of
the sample distribution.
 For a positively-skewed distribution, the mean is larger than the median.
 For a negatively-skewed distribution, the mean is smaller than the median.
 For an exactly symmetric distribution, the mean and median are equal.
When summarising variables with skewed distributions, it is useful to report both the mean and the
median.
For the datasets considered previously:

Mean Median

Level of democracy 5.3 6

GDP per capita 8.6 4.7

Diastolic blood pressure 74.2 73.5

Examination marks 59.7 60.5

Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e. appears most
often) in the data.

Example 1.12
For Example 1.1, the modal region is 1 (Africa) and the mode of the level of democracy is 0.

The mode is not very useful for continuous variables which have many different values, such as GDP
per capita in Example 1.1. A variable can have several modes (i.e. be multimodal). For example, GDP
per capita has modes 0.8 and 1.9, both with 5 countries out of the 155.
The mode is the only measure of central tendency which can be used even when the values of a
variable have no ordering, such as for the (nominal) region variable in Example 1.1.
In most cases where a variable is essentially continuous, the mode is not very interesting because it is
often the result of a few lucky ties.
In Excel, the mode can be calculated with the MODE function.

Minimum, maximum, percentiles and quartiles


For any percentage p, the pth percentile is the value such that a percentage p of all values are less
than it.
The quartiles divide the data into four groups, each with (approximately) a quarter of all observations.
Naturally, the first, second and third quartiles are the percentiles corresponding to p = 25%, p =50%,
p = 75%.
By definition, the second quartile (p = 50%)is equal to the median.
The minimum and maximum values can be calculated with Excel’s MIN and MAX functions, and the
percentiles and quartiles with Excel’s PERCENTILE and QUARTILE functions.

Measures of dispersion

Introduction
Central tendency is not the whole story. The two sample distributions in Figure 1.6 have the same
mean, but they are clearly not the same. In one (red) the values have more dispersion (variation) than
in the other.

Figure 1.6: Two sample distributions.


A small example determining the sum of the squared deviations from the (sample) mean, used to
calculate common measures of dispersion.

Deviations from 𝑋̅

i 𝑋𝑖 𝑋𝑖2 𝑋𝑖 − 𝑋̅ (𝑋𝑖 − 𝑋̅)2

1 1 1 -3 9
2 2 4 -2 4
3 3 9 -1 1
4 5 25 +1 1
5 9 81 +5 25

Sum 20 120 0 40
𝑋̅ = 4

Variance and standard deviation


The first measures of dispersion, the sample variance and its square root, the sample standard
deviation, are based on (𝑋𝑖 − 𝑋̅)2 i.e. the squared deviations from the mean.

Sample variance and standard deviation


The sample variance of a variable X, denoted S2 (or 𝑆𝑋2 ), is defined as:

The sample standard deviation of 𝑋X, denoted 𝑆S (or 𝑆𝑋SX), is the positive square root of the
sample variance:

These are the most commonly-used measures of dispersion. The standard deviation is more
understandable than the variance, because the standard deviation is expressed in the same units as X
(rather than the variance, which is expressed in squared units).
A useful rule-of-thumb for interpretation is that for many symmetric distributions, such as the
`normal' distribution:
 about 2/3 of the observations are between 𝑋̅ − 𝑆 and 𝑋̅ + 𝑆, that is, within one (sample)
standard deviation about the (sample) mean
 about 95% of the observations are between 𝑋̅ − 2 × 𝑆 and 𝑋̅ + 2 × 𝑆, that is, within two
(sample) standard deviations about the (sample) mean
 about 99.7% of the observations are between 𝑋̅ − 3 × 𝑆 and 𝑋̅ + 3 × 𝑆, that is, within three
(sample) standard deviations about the (sample) mean.
Remember that standard deviations (and variances) are never negative, and they are zero only if all
the 𝑋𝑖 observations are the same (that is, there is no variation in the data).
If we are using a frequency table, we can also calculate:

Consider the following simple dataset:

Deviations from

1 1 1 -3 9

2 2 4 -2 4

3 3 9 -1 1

4 5 25 +1 1

5 6 81 +5 25

Sum 20 120 0 40

We have:

and 𝑆 = √𝑆 2 = √10 = 3.16.


To calculate the variance in Excel, the VAR (sample) or VARP (population) function is used. To
calculate the standard deviation, the STDEV (sample) or STDEVP (population) function is used.
Sample quantiles
The median, q50, is basically the value which divides the sample into the smallest 50% of
observations and the largest 50%. If we consider other percentage splits, we get other
(sample) quantiles percentiles), qc.
Some special quantiles are given below.
 The first quartile, q25 or Q1, is the value which divides the sample into the smallest 25% of
observations and the largest 75%.
 The third quartile, q75 or Q3, gives the 75\%--25\% split.
 The extremes in this spirit are the minimum, X(1) (the '0% quantile', so to speak), and
the maximum, X(n) (the '100\% quantile').
These are no longer 'in the middle' of the sample, but they are more general measures of location of
the sample distribution.

Quantile-based measures of dispersion


Range and interquartile range
Two measures based on quantile-type statistics are the:
 range: 𝑋𝑛 − 𝑋1 = maximum – minimum
 interquartile range (IQR): IQR = 𝑞75 − 𝑞25 = 𝑄3 − 𝑄1

The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the extremes of
the distribution, i.e. the minimum and maximum observations. The IQR focuses on the middle 50% of
the distribution, so it is completely insensitive to outliers.

Mean absolute deviation


The mean absolute deviation (MAD) is another measure of variability - although some might think it
‘mad’ to use it! (The MAD is technically problematic as it is not differentiable.)
For many variables, the standard deviation is approximately 25% larger than the MAD:
𝑠 ≈ 1.5 × MAD
The formula for the MAD is:

To calculate the MAD in Excel, the AVEDEV function is used.


Boxplots
A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample distribution
using quantiles. The plot is comprised of the following.
 The line inside the box, which is the median.
 The box, whose edges are the first and third quartiles (Q1 and Q3). Hence the box captures the
middle 50% of the data. Therefore, the length of the box is the interquartile range.
 The bottom whisker extends either to the minimum or up to a length of 1.5 times the
interquartile range below the first quartile, whichever is closer to the first quartile.
 The top whisker extends either to the maximum or up to a length of 1.5 times the interquartile
range above the third quartile, whichever is closer to the third quartile.
 Points beyond 1.5 times the interquartile range below the first quartile or above the third
quartile are regarded as outliers, and plotted as individual points.
A much longer whisker (and/or outliers) in one direction relative to the other indicates a skewed
distribution, as does a median line not in the middle of the box.
Figure 1.7 displays a boxplot of GDP per capita using the sample of 155 countries introduced
in Example 1.1. Some summary statistics for this variable are reported below.

Mean Median Standard IQR Range


deviation

GPD per capita 8.6 4.7 9.5 9.7 37.3

Figure 1.7
Excel self-study examples

Data from an environmental survey


We consider US environmental survey data in the file Questionnaire_data.xlsx.
Each person’s age, gender, state of residence, number of children, annual salary, and opinion of the
president’s environmental policies represent the variables. Variable names should be concise but
meaningful. Index of the observation is often included in column A.

Supermarket transactions
The data file Supermarket_transactions.xlsx contains over 14,000 (hypothetical) transactions of a
supermarket.
Our goal is to summarise categorical variables in a large dataset. To achieve this, each of the counts in
column S can be obtained with Excel’s COUNTIF function.
The function takes two arguments: the data range and a criterion, and works well for counting
observations in a category. To get the percentages in column T, each count is divided by the total
number of observations.
If you use a chart, be careful to use appropriate scales.
Another efficient way to find the counts and percentages for a categorical variable is to use dummy
(0-1) variables. Recode each variable so that one category is replaced by 1 and all others by 0. Now,
find the count of one category by summing 0s and 1s.

Baseball salaries
The file Baseball_salaries.xlsx contains data on 818 Major League Baseball players from 2009.
Various summary statistics are calculated.

Variability
The example contained in Variability.xlsx indicates why variability (i.e. measure of dispersion), along
with measures of central tendency, is important.
On average, both Supplier 1 and Supplier 2 produce parts close to the target of 100. However, the
increased variability of Supplier 2 makes this supplier much less attractive.
With a standard deviation slightly larger than 25, the second empirical rule implies that about 1 out of
every 20 of this supplier’s parts will be below 50 or above 150.

Catalogue marketing
The file Catalogue_marketing.xlsx contains data on 1,000 customers of a fictional company.
With these data, it is possible to illustrate Excel tables for analysing the data.
The Table button is contained within the Insert ribbon. A number of options can be applied to tables:
 a number of table styles are available for making the table attractive
 in the Tools group, you can click on Convert to Range. This undesignates the range as a table
(and the drop-down arrows disappear).
 in the Properties group, you can change the name of the table; you can also click on
the Resize Table button to expand or contract the table range
 a particularly useful option is the Total Row in the Table Style Options group. If you check
this, a new row is appended to the bottom of the table.

Block 2 learning activities

Conceptual questions on the block’s topics


Solutions can be viewed on the VLE through the link below:
https://emfss.elearning.london.ac.uk/mod/book/view.php?id=28421&chapterid=7561
1. The number of children living in each of a large number of randomly-selected households is
an example of which data type?
2. Does it make sense to construct a histogram for the city of residence of randomly-selected
individuals in a sample? Explain why or why not.
3. Characterise the likely shape of a histogram of the distribution of examination scores in a
statistics course.
4. A researcher is interested in determining whether there is a relationship between the number
of room air-conditioning units sold each week and the time of year. What type of descriptive
chart would be most useful in performing this analysis?
5. Suppose that the histogram of a given income distribution is positively-skewed. What does
this fact imply about the relationship between the mean and median of this distribution?
6. Explain why the standard deviation would likely not be a reliable measure of variability for a
distribution of data which includes at least one extreme outlier.
7. Explain how a boxplot can be used to determine whether the associated distribution of values
is essentially symmetric.
8. Suppose that you collect a random sample of 250 salaries for the salespersons employed by a
large PC manufacturer. Furthermore, assume that you find that two of these salaries are
considerably higher than the others in the sample. Before analysing this dataset, should you
delete the unusual observations? Explain why or why not.

Spreadsheet solutions to the end-of-chapter cases


Solutions can be downloaded from the VLE at the link below:
https://emfss.elearning.london.ac.uk/mod/book/view.php?id=28421&chapterid=7590

Practice problems – Boxplots


A manager for Marko Manufacturing has recently been hearing some complaints that women are
being paid less than men for the same type of work in one of their manufacturing plants. The boxplots
shown below represent the annual salaries for all salaried workers in that facility (40 men and 34
women).
Solutions can be viewed on the VLE through the link below:
https://emfss.elearning.london.ac.uk/mod/book/view.php?id=28421&chapterid=7608
a. Would you conclude that there is a difference between the salaries of women and men in this
plant? Justify your answer.
b. How large must a person’s salary should be to qualify as an outlier on the high side? How
many outliers are there in these data?
c. What can you say about the shape of the distributions given the accompanying boxplots?

Practice problem – Income comparisons


The data shown below contain family incomes (in $000s) for a set of 50 families sampled in 2007 and
2017. Assume that these families are good representatives of the entire population.

2007 2017 2007 2017 2007 2017

58 54 33 29 73 69

6 2 14 10 26 22

59 55 48 44 64 70

71 57 20 16 59 55

30 26 24 20 11 7

38 34 82 78 70 66

36 32 95 97 31 27

33 29 12 8 92 88

72 68 93 89 115 111

100 96 100 102 62 58

1 0 51 47 23 19

27 23 22 18 34 30
22 47 50 75 36 61

141 166 124 149 125 150

72 97 113 138 121 146

165 190 118 143 88 113

79 104 96 121

Solutions can be viewed on the VLE through the link below:


https://emfss.elearning.london.ac.uk/mod/book/view.php?id=28421&chapterid=7609
a. Find the mean, median, standard deviation, first and third quartiles, and the 95th percentile for
family incomes in both years.
b. A political figure running for re-election claimed that the country was better off in 2017 than
in 2007, because the average income increased. Do you agree? Explain your answer.
c. Generate a boxplot to summarise the data. What does the boxplot indicate?

Practice problem – Customer services performance


In an effort to provide more consistent customer service, the manager of a local fast food restaurant
would like to know the dispersion of customer service times in relation to their average value for the
facility’s drive-up window. The table below provides summary measures for the customer service
times (in minutes) for a sample of 50 customers collected over the past week.

Count 50.000

Mean 0.873

Median 0.885

Standard deviation 0.432

Minimum 0.077

Maximum 1.608

Variance 0.187

Skewness -0.003

Solutions can be viewed on the VLE through the link below:


https://emfss.elearning.london.ac.uk/mod/book/view.php?id=28421&chapterid=7610
a. Interpret the variance and standard deviation of this sample.
b. Are the empirical rules applicable in this case? If so, apply them and interpret your results. If
not, explain why the empirical rules are not applicable here.
c. Explain why the mean is slightly lower than the median in this case.
Practical problems – Retail sales
The data below represent monthly sales for two years of beanbag animals at a local retail store
(Month 1 represents January and Month 12 represents December). Given the time series plot below,
do you see any obvious patterns in the data? Explain.

Solutions can be viewed on the VLE through the link below:


https://emfss.elearning.london.ac.uk/mod/book/view.php

Practice problem – Job applicants


The histogram below represents scores achieved by 250 job applicants on a personality profile.

a. What percentage of the job applicants scored between 30 and 40?


b. What percentage of the job applicants scored below 60?
c. How many job applicants scored between 10 and 30?
d. How many job applicants scored above 50?
e. Seventy percent of the job applicants scored above what value?
f. Half of the job applicants scored below what value?
Solutions can be viewed on the VLE through the link below:
https://emfss.elearning.london.ac.uk/mod/book/view.php?id=28421&chapterid=7612

Textbook problem – Descriptive measures for categorical variables


The file DJIA Monthly Close.xlsx contains monthly values of the Dow Jones Industrial Average from
1950 through to July 2015. It also contains the percentage changes from month to month. Create a
new column for recoding the percentage changes into six categories:
o Large negative (<−3%<−3%)
o Medium negative (<−1%<−1%)
o Small negative (<0%,≥−1%<0%,≥−1%)
o Small positive (<1%,≥0%<1%,≥0%)
o Medium positive (<3%,≥1%<3%,≥1%)
o Large positive (≥3%≥3%).
Then create a column chart of the counts of this categorical variable. Comment on its shape.

A solution file is available here: Chapter 2_Problem_5_solutions.xlsx

Textbook problem – Descriptive measures for numerical variables


The file P02_18.xlsx contains daily values of the Standard & Poor’s 500 Index from 1970 to mid-
2015. It also contains percentage changes in the index from each day to the next.
a. Create a histogram of the percentage changes and describe its shape.
b. Check the percentage of these percentage changes that are more than k standard deviations
from the mean for k = 1, 2, 3, 4 and 5. Are these approximately what the empirical rules
indicate or are there ‘fat’ tails? Do you think this has any real implications for the financial
markets?
A solution file is available here: Chapter_2_Problem_18_solutions.xlsx

Textbook problem – Time series data


The file P02_28.xlsx contains total monthly US retail sales data for a number of years. There are
really two series, one of actual sales and one of seasonally adjusted sales. The latter adjusts for any
possible seasonality, such as higher sales in December and lower sales in February, so that any trends
are more apparent.
a. Create a graph of both time series and comment on any observable trends, including a
possible seasonal pattern, in the data. Does seasonal adjustment make a difference? How?
b. Based on your time series graph of actual sales, make a qualitative projection about the total
retail sales levels for the next 12 months. Specifically, in which months of the subsequent
year do you expect retail sales levels to be highest? In which months of the subsequent year
do you expect retail sales levels to be lowest?
A solution file is available here: Chapter_2_Problem_28_solutions.xlsx
Textbook problem – Outliers and missing values
Sometimes it is possible that missing data are predictive in the sense that rows with missing data are
somehow different from rows without missing data. Check this with the file P02_32.xlsx, which
contains blood pressures for 1000 (fictional) people, along with variables that can be related to blood
pressure. These other variables have a number of missing values, presumably because the people did
not want to report certain information.
a. For each of these other variables, find the mean and standard deviation of blood pressure for
all people without missing values and for all people with missing values. Can you conclude
that the presence or absence of data for any of these other variables has anything to do with
blood pressure?
b. Some analysts suggest filling in missing data for a variable with the mean of the non-missing
values for that variable. Do this for the missing data in the blood pressure data. In general, do
you think this is a valid way of filling in missing data? Why or why not?
A solution file is available here: Chapter_2_Problem_32_solutions.xlsx

Textbook problem – Excel tables for filtering, sorting and summarising


The file P02_35.xlsx contains data from a survey of 500 randomly selected households. Use Excel
filters to answer the following questions.
a. What are the average monthly home mortgage payment, average monthly utility bill, and
average total debt (excluding the home mortgage) of all homeowners residing in the southeast
part of the city?
b. What are the average monthly home mortgage payment, average monthly utility bill, and
average total debt (excluding the home mortgage) of all homeowners residing in the
northwest part of the city? How do these results compare to those found in part a?
c. What is the average annual income of the first household wage earners who rent their home
(house or apartment)? How does this compare to the average annual income of the first
household wage earners who own their home?
d. What proportion of the surveyed households contains a single person who owns his or her
home?
A solution file is available here: Chapter_2_Problem_35_solutions.xlsx

Textbook problem – Consolidation exercise


The file P02_51.xlsx contains data on US home-ownership rates.
a. Employ numerical summary measures to characterise the changes in home-ownership rates
across the country during this period.
b. Do the trends appear to be uniform across the United States or are they unique to certain
regions of the country? Explain
A solution file is available here: Chapter_2_Problem_51_solutions.xlsx

Block 2: Test your understanding


This activity is a multiple-choice quiz, please complete it on the VLE using the link below:
https://emfss.elearning.london.ac.uk/mod/quiz/view.php?id=28446
Learning Outcomes Checklist
Use this to assess your own understanding of the chapter. You can always go back and amend the
checklist when it comes to revision!
 Interpret and summarise raw data on variable graphically
 Interpret and summarise raw data on variables numerically
 Calculate basic measures of location and dispersion

You might also like