0% found this document useful (0 votes)
7 views

Lecture 12 1

The document discusses outlier detection and treatment. It defines outliers and discusses their impact on measures of inequality and poverty. Common methods to detect outliers include visual inspection of distributions and setting thresholds based on a variable's transformed distribution, such as more than 2.5 standard deviations from the mean of the logarithmically transformed variable.

Uploaded by

yesaya.tommy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 12 1

The document discusses outlier detection and treatment. It defines outliers and discusses their impact on measures of inequality and poverty. Common methods to detect outliers include visual inspection of distributions and setting thresholds based on a variable's transformed distribution, such as more than 2.5 standard deviations from the mean of the logarithmically transformed variable.

Uploaded by

yesaya.tommy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Outlier detection and treatment

LECTURE 12

1
Today is mainly about outliers

1) Definitions
What do we mean by an outlier, exactly?

2) Motivation
Do outliers really matter?

3) Detection
How to detect outliers?
4) Treatment
How to deal with outliers?

2
Definitions

3
What is an outlier?

▪ An outlier is an observation “that


appears to deviate markedly from
other members of the sample in
which it occurs” (Grubbs, 1969)
▪ Note: we focus on univariate
outliers, those found when looking
at a distribution of values in a single
dimension (e.g. income).
Highest sea-levels in Venice

5
Other classical definitions

▪ An outlier is “an observation that deviates so much from other


observations as to arouse suspicion that it was generated by a
different mechanism” (Hawkins 1980)

▪ Aguinis et al (2013) provide 14 definitions of outliers based on a


litterature review of 28 papers.

6
What causes outliers?

▪ Human errors, e.g. data entry errors

▪ Instrument errors, e.g. measurement errors

▪ Data processing errors, e.g. data manipulation

▪ Sampling errors, e.g. extracting data from wrong sources

▪ Not an error, the value is extreme, just a ‘novelty’ in the data

7
A dilemma

▪ Outliers can be genuine values

▪ The trade-off is between the loss of accuracy if we throw away “good”


observations, and the bias of our estimates if we keep “bad” ones

▪ The challenge is twofold:


1. to figure out whether an extreme value is good (genuine) or bad (error)

2. to assess its impact on the statistics of interest

8
Do outliers matter?

9
Theory first

▪ Three papers:

I. 1996a
Frank Cowell and Maria-Pia Victoria-Feser
II. 2007
Frank Cowell and Emmanuel Flachaire (*)
III. 1996b
Frank Cowell and Maria-Pia Victoria-Feser
Outliers and inequality measures – I
Cowell and Victoria-Feser (1996a)

▪ This is a beautiful paper


▪ Explains why outliers (contaminants)
are a serious threat to most inequality
measures.
▪ “if the mean has to be estimated from
the sample then all scale independent
or translation independent and
decomposable measures have an
unbounded influence function” (p. 89)
▪ An unbounded IF is a catastrophe.
The catastrophe

▪ Suppose the shape of the income distribution


is represented by the continuous frequency
distribution in part A

▪ Suppose that in the sample there are some


rogue observations represented by the point
mass labelled “contamination”.

▪ Then, according to inequality statistics that


are sensitive to the top end of the
distribution, the income distribution in A will
be indistinguishable from that represented in
B (that is, IF is unbounded).
In practice
Hlasny and Verme (2018: 191)

▪ Many researchers routinely trim outliers or problematic observations


or apply top coding with little consideration of the implications for
the measurement of inequality

▪ One example to illustrate

14
Sensitivity of the Gini index to extreme values
iterative trimming
84%

74%
72%

70%
68%
Outliers and poverty measures
Cowell and Victoria-Feser (1996b)

▪ Explains why outliers only rarely


are a serious threat to most
poverty measures.

▪ Poverty measures are not


sensitive to the values (real or
contaminated) of the incomes of
the rich
Recap

▪ The answer to the question on whether outliers matter depends on


the statistic of interest
▪ Inequality: both theory (unbounded IF) and practice (incremental
truncation) suggest that they matter (tremendously). Not taking this
issue into proper account puts inequality comparisons at risk.
▪ Poverty: not so much
How to detect outliers?

20
Visual inspection

▪ Our procedures are part graphical, and part automatic. For each
commodity, we draw histograms and one-way plots of the logarithms
of the unit values, using each to detect the presence of gross outliers
for further investigations. […] [Automatic method] does not remove
the need for the graphical inspection
(Deaton and Tarozzi 2005)

21
Visual inspection
Malawi IHS3, Cassava tuber expenditure

22
Visual inspection
Malawi IHS3, Cassava tuber expenditure
▪ Example 1: look at descriptive statistics

23
Visual inspection
Malawi IHS3, Cassava tuber expenditure

▪ Example 2: graph the distribution of the data

24
Visual inspection
Malawi IHS3, Cassava tuber expenditure

▪ Example 3: use graphical diagnostic tools, e.g. the boxplot graph

25
Statistical methods
▪ The literature is rich with methods to identify outliers; in practice,
most methods used in empirical work hinge on the underlying
distribution of the data.

▪ The idea is simple:


▪ transform the variable to induce normality
▪ set thresholds to identify extreme values
Transform the variable to induce normality

▪ The easiest transformation relies on taking the logarithm of the variable of


interest
▪ The log “squeezes” large values more, so that skewed distributions become more
symmetrical and closer to a Normal distribution.
Set a threshold

▪ We must specify a threshold for deciding whether each observation is ‘too


extreme’ (outlier or not?)

▪ Common ‘thumb-rule’ thresholds : an observation is considered an outlier if it is


more than 2.5, 3, 3.5 standard deviations far from the mean of the distribution

▪ In formulas: 𝑥 is an outlier if 𝑥 > 𝑥ҧ + 𝑧𝛼 𝑠

where 𝑧𝛼 equals, say, 2.5.


𝑥 − 𝑥ҧ
▪ We can express the same criterion as > 𝑧𝛼
𝑠
where the left-hand side is called a z-score (a variable with mean = 0 and var = 1)

28
Why 2.5, 3, or any other number?

▪ Under the assumption of normality:

▪ 𝑧𝛼 = 2.5 implies that outliers are in the region where 𝛼 = 0.5


percent of other observations normally are.

29
Deaton and Tarozzi (2005)
In the case of India, D&T
(2005) flagged as outliers
prices whose logarithms
exceeded the mean of
logarithms by more than 2.5
standard deviations:

𝑙𝑛 𝑥 − 𝐸 𝑙𝑛 𝑥
> 2.5
𝑠𝑑 𝑙𝑛 𝑥
Transformation and thresholds
Raw untransformed data Transformed data
.00025

.6
.0002

.4
.00015
kdensity pcexp

.0001

.2
.00005

0
0

0 5000 10000 15000 20000 25000 -5 0 5


x x

N(0,1) Std Box-Cox


Two questions

1) How good is such an approach?

2) What to do after flagging outliers?


How good is such an approach?

▪ Log-transformation is very basic – how to deal with negative values?


▪ Not recommended when the log-distribution can not be assumed to be a Normal
distribution
▪ Why should we set the threshold using the mean and standard deviation, which
are sensitive to extreme values, if this is exactly what we are worried about?

𝑙𝑛 𝑥 − 𝐸 𝑙𝑛 𝑥
> 2.5
𝑠𝑑 𝑙𝑛 𝑥

▪ We can do better
A popular strategy
robustification

▪ While there is no agreement on the best method, a common solution


is to use robust measures of scale and location to set the threshold
for flagging outliers

▪ the idea is to replace the sample average 𝑥ҧ with a robust estimator


(e.g. the median), and the standard deviation 𝑠 with a robust
estimator. A popular option is the median absolute deviation (MAD).

36
The median absolute deviation (MAD)

𝑥ℎ − 𝑥ҧ 𝑥ℎ − 𝑚𝑒𝑑 𝑥ℎ
𝑧ℎ = 𝑧ℎ =
𝑠 𝑀𝐴𝐷

𝑀𝐴𝐷 = 𝑏 × 𝑚𝑒𝑑 𝑥 − 𝑚𝑒𝑑 𝑥

b = 1.4826

if the distribution is Gaussian


We can do better
Rousseeuw and Croux (1993, JASA)
Rousseeuw and Croux (1993)

▪ Rousseeuw and Croux (1993) propose to substitute the MAD with a


different estimator:

▪ S = c × medi medj xj − xi

▪ For each i we compute the median of |xi – xj| (j = 1, …, n ). This yields


n numbers, the median of which gives our final estimate S.

𝑥ℎ − 𝑚𝑒𝑑 𝑥ℎ
𝑧ℎ = c = 1.1926 at the Gaussian model.
𝑆
Recap

▪ “take the log and run” is not a recommended practice


▪ taking the log and robustifying the z-score is a better practice
▪ Belotti and Vecchi (2019) provide outdetect.ado
Malawi, 2013

▪ ‘take the log and run’:


2.08% of outliers (most of which
in the right tail)
▪ ‘take the log, robustify the z-
score, and run’:
3.00% (most of which in the
right tail)

41
How to deal with outliers?
(in one slide)

42
Treatment of outliers

Three main methods of dealing with outliers, apart from removing them
from the dataset:
1) reducing the weights of outliers (trimming weight)
2) changing the values of outliers (Winsorisation, trimming, imputation)
3) using robust estimation techniques (M-estimation).

▪ Documentation, transparency & reproducibility


Lessons learned

▪ Outliers can be genuine observations… be gentle to the data and document each
and every step of the data processing

▪ As far as inequality is concerned, outliers are the worst enemy (unbounded IF)

▪ Outlier detection:

▪ go beyond the “take the log and run” strategy. It works well only if you can describe the data
with a Gaussian distribution. Typically, however, distributions are skewed.

▪ Use a “take the log, robustify the z-score and run”, strategy.

▪ Outlier treatment: it depends. Quantile regression is a good candidate.


44
References Cowell, F. A., & Victoria-Feser, M. P. (1996). Poverty
measurement with contaminated data: A robust approach.
European Economic Review, 40(9), 1761-1771.
Required readings Deaton, A., & Tarozzi, A. (2005). “Prices and Poverty in India.”
Barnett, V., & Lewis, T. (1994). Outliers in Statistical The Great Indian Poverty Debate. New Delhi : MacMillan.
Data. 3rd edition. J. Wiley & Sons (Chapter 1 & 2) Dupriez, O. (2007). Building a household consumption database
Suggested readings for the calculation of poverty PPPs. Technical note. Available at:
Alvarez, E., Garcıa-Fernández, R. M., Blanco-Encomienda, http://go. worldbank. org/4YG7I5RGT0.
F. J., & Munoz, J. F. (2014). The effect of outliers on the Grubbs, F. E. (1969). Procedures for detecting outlying
economic and social survey on income and living conditions. observations in samples. Technometrics, 11(1), 1-21.
World Acad. Sci., Eng. Technol., Int. J. Soc., Behav., Educ., Hlasny, V., & Verme, P. (2018). Top Incomes and Inequality
Econ., Bus. Ind. Eng, 8, 3276-3280. Measurement: A Comparative Analysis of Correction Methods
Belotti, F., & Vecchi, G. (2019). Take the Log and Run: Using the EU SILC Data. Econometrics, 6(2), 30.
Outliers and Welfare Measurement, mimeo. Mancini, G., & Vecchi, G. (2019). On the Construction of a
Cowell, F. A., & Flachaire, E. (2007). Income distribution and Welfare Indicator for Inequality and Poverty Analysis, mimeo.
inequality measurement: The problem of extreme values. OECD (2013). OECD Guidelines for Micro Statistics on Household
Journal of Econometrics, 141(2), 1044-1072. Wealth
Cowell, F., & Victoria-Feser, M. (1996). Robustness Rousseeuw, P. J., & Croux, C. (1993). Alternatives to the median
Properties of Inequality Measures. Econometrica, 64(1), 77- absolute deviation. Journal of the American Statistical
101 association, 88(424), 1273-1283.
Thank you for your attention

46
Homework

47
Exercise 1 - Engaging with the literature

Summarize the main conclusions


of the paper: do outliers matter?
Why or why not?

48
Exercise 2 - Do-it-yourself….

English Stata/R/SPSS/Excel/…
1) Generate a log-normal looking
wealth distribution

2) Estimate the Gini index

3) Contaminate the distribution


with a few extreme values

4) Re-estimate the Gini index


Exercise 3 – Inequality measures

▪ Comment on table 7.3 from OECD (2013)


p.172 (see next slide).
▪ What can you say about the sensitivity of
estimates to the treatment of outliers?

53
Exercise 3 – Inequality measures
OECD (2013)

54

You might also like