Lecture 12 1
Lecture 12 1
LECTURE 12
1
Today is mainly about outliers
1) Definitions
What do we mean by an outlier, exactly?
2) Motivation
Do outliers really matter?
3) Detection
How to detect outliers?
4) Treatment
How to deal with outliers?
2
Definitions
3
What is an outlier?
5
Other classical definitions
6
What causes outliers?
7
A dilemma
8
Do outliers matter?
9
Theory first
▪ Three papers:
I. 1996a
Frank Cowell and Maria-Pia Victoria-Feser
II. 2007
Frank Cowell and Emmanuel Flachaire (*)
III. 1996b
Frank Cowell and Maria-Pia Victoria-Feser
Outliers and inequality measures – I
Cowell and Victoria-Feser (1996a)
14
Sensitivity of the Gini index to extreme values
iterative trimming
84%
74%
72%
70%
68%
Outliers and poverty measures
Cowell and Victoria-Feser (1996b)
20
Visual inspection
▪ Our procedures are part graphical, and part automatic. For each
commodity, we draw histograms and one-way plots of the logarithms
of the unit values, using each to detect the presence of gross outliers
for further investigations. […] [Automatic method] does not remove
the need for the graphical inspection
(Deaton and Tarozzi 2005)
21
Visual inspection
Malawi IHS3, Cassava tuber expenditure
22
Visual inspection
Malawi IHS3, Cassava tuber expenditure
▪ Example 1: look at descriptive statistics
23
Visual inspection
Malawi IHS3, Cassava tuber expenditure
24
Visual inspection
Malawi IHS3, Cassava tuber expenditure
25
Statistical methods
▪ The literature is rich with methods to identify outliers; in practice,
most methods used in empirical work hinge on the underlying
distribution of the data.
28
Why 2.5, 3, or any other number?
29
Deaton and Tarozzi (2005)
In the case of India, D&T
(2005) flagged as outliers
prices whose logarithms
exceeded the mean of
logarithms by more than 2.5
standard deviations:
𝑙𝑛 𝑥 − 𝐸 𝑙𝑛 𝑥
> 2.5
𝑠𝑑 𝑙𝑛 𝑥
Transformation and thresholds
Raw untransformed data Transformed data
.00025
.6
.0002
.4
.00015
kdensity pcexp
.0001
.2
.00005
0
0
𝑙𝑛 𝑥 − 𝐸 𝑙𝑛 𝑥
> 2.5
𝑠𝑑 𝑙𝑛 𝑥
▪ We can do better
A popular strategy
robustification
36
The median absolute deviation (MAD)
𝑥ℎ − 𝑥ҧ 𝑥ℎ − 𝑚𝑒𝑑 𝑥ℎ
𝑧ℎ = 𝑧ℎ =
𝑠 𝑀𝐴𝐷
b = 1.4826
▪ S = c × medi medj xj − xi
𝑥ℎ − 𝑚𝑒𝑑 𝑥ℎ
𝑧ℎ = c = 1.1926 at the Gaussian model.
𝑆
Recap
41
How to deal with outliers?
(in one slide)
42
Treatment of outliers
Three main methods of dealing with outliers, apart from removing them
from the dataset:
1) reducing the weights of outliers (trimming weight)
2) changing the values of outliers (Winsorisation, trimming, imputation)
3) using robust estimation techniques (M-estimation).
▪ Outliers can be genuine observations… be gentle to the data and document each
and every step of the data processing
▪ As far as inequality is concerned, outliers are the worst enemy (unbounded IF)
▪ Outlier detection:
▪ go beyond the “take the log and run” strategy. It works well only if you can describe the data
with a Gaussian distribution. Typically, however, distributions are skewed.
▪ Use a “take the log, robustify the z-score and run”, strategy.
46
Homework
47
Exercise 1 - Engaging with the literature
48
Exercise 2 - Do-it-yourself….
English Stata/R/SPSS/Excel/…
1) Generate a log-normal looking
wealth distribution
53
Exercise 3 – Inequality measures
OECD (2013)
54