0% found this document useful (0 votes)
84 views

Handling Outliers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Handling Outliers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Handling Outliers

What are Outliers


In simple terms, an outlier refers to a data point that appears unusual when compared to the
rest of the data. Let's take the example of height as a single characteristic. If there's a boy who
is much taller than all the other kids in his kindergarten class, he would be considered an
outlier.
However, outliers aren't always extremely distant from the other values. For instance, if an
average-height woman entered a room full of only basketball players and jockeys, she would
also be an outlier because her height would noticeably differ from that of the others[1].
Barnett and Lewis characterize an outlier as an observation (or group of observations) that
seems to deviate from the rest of the dataset in a way that appears inconsistent[2].
Such a data point frequently holds valuable insights into unusual patterns within the system
represented by the data[3]. Conversely, numerous data mining algorithms in existing
literature identify outliers as an incidental outcome of clustering algorithms. From the
perspective of a clustering algorithm, outliers refer to objects that don't fall within the clusters
of the dataset and are often referred to as noise.
How do outliers get included in datasets?
Following are some causes lead to presence of outliers
1. Data Entry Errors: These outliers occur due to inaccuracies and mistakes made while
collecting, recording, or entering data. Such errors can stem from human fallibility during the
initial stages of data management, leading to irregularities within the dataset.
2. Measurement Error (Instrument Errors): Outliers attributed to measurement errors
arise from faulty or imprecise measuring instruments. Inaccuracies in the tools used for data
collection can result in values that significantly deviate from the actual measurements.
3. Experimental Errors: Experiment-related outliers emerge from errors in the experimental
process, including flaws in design, execution, and data extraction. These deviations from the
expected data pattern can distort the overall analysis.
4. Intentional Outliers: Intentionally introduced outliers serve the purpose of testing the
effectiveness of outlier detection methods. These outliers, often inserted during the data
collection phase, allow researchers to evaluate the robustness of their analysis techniques.
5. Data Processing Errors: Outliers arising from data processing errors result from mistakes
made during data manipulation, transformation, or cleaning. Inaccurate data handling
procedures can introduce values that don't align with the rest of the dataset.
6. Sampling Errors: Sampling errors lead to outliers when data is extracted from incorrect
sources or when a mix of data from various sources is improperly combined. These outliers
can misrepresent the true nature of the dataset.
7. Natural Outliers: Natural outliers are genuine data points that deviate significantly from
the norm but aren't the result of errors. These outliers, characteristic of real-world datasets,
offer unique insights into the underlying processes generating the data.
How to detect them
Outlier detection stands as a prominent challenge within the realm of Data Mining, entailing
the identification of exceptional data points within a set of patterns. This predicament holds
considerable prominence within the domain of data mining and is of paramount significance
across diverse application domains. Once deemed merely indicative of noisy data, outliers
have progressively transformed into a substantive complication, progressively unveiled
across multiple research sectors[4].
The exploration of outliers has been a subject of statistical investigation for many centuries.
Particularly over the past two decades, the field of database and data mining has exhibited a
growing enthusiasm towards constructing scalable techniques for identifying outliers. While
these techniques initially stemmed from statistical foundations, they gradually relinquished
their direct probabilistic explications, thus altering the interpretability of the resultant outlier
scores[5].
In the realm of outlier detection methods, a distinction can be drawn between univariate
approaches, which were introduced in earlier studies, and multivariate techniques that
currently constitute the majority of ongoing research efforts. Furthermore, a fundamental
categorization of these methods lies in the differentiation between parametric (statistical)
approaches and nonparametric methods that operate independently of specific models[6].
Following are some methods to detect outliers:
Z Score:
This approach presupposes that the variable follows a Gaussian distribution. It quantifies the
extent to which an observation deviates from the mean in terms of the number of standard
deviations.
(x−μ)
z=
σ
z= (data points-mean) / standard deviation
Here, we normally define outliers as points whose modulus of z-score is greater than a
threshold value.
IQR method:
he IQR method is a way to detect potential outliers in a dataset by using the interquartile
range. The interquartile range (IQR) is a measure of statistical dispersion, which describes the
spread of data within the middle 50% of a dataset.
Here's how the IQR method works:
 Arrange your dataset in ascending order.
 Calculate the first quartile (Q1), which is the median of the lower half of the dataset.
 Calculate the third quartile (Q3), which is the median of the upper half of the dataset.
 Calculate the interquartile range (IQR) by subtracting Q1 from Q3: IQR = Q3 - Q1.
Determine the lower bound and upper bound for potential outliers:
Lower Bound: Q1 - 1.5 * IQR
Upper Bound: Q3 + 1.5 * IQR
Any data points that fall below the lower bound or above the upper bound are considered
potential outliers.
Here's the formula for the IQR method:
IQR = Q3 - Q1

Visualize the data:


Data visualization serves several purposes in data analysis, including aiding in data cleansing,
facilitating data exploration, pinpointing outliers and exceptional groups, recognizing patterns
and clusters, and more. The following is a compilation of data visualization plots that are
employed to identify outliers effectively.
a) Box and whisker plot (box plot)
b) Scatter plot
c) Histogram
d) Distribution Plot
e) QQ plot(quantile-quantile plot)
Univariate Methods:
Univariate methods focus on analyzing individual variables in isolation. These techniques
assess the distribution and characteristics of a single variable to identify outliers. Common
univariate methods include the Z-score method and the IQR method.
The Z-score method measures how many standard deviations a data point is from the mean of
the variable. The IQR method, as previously mentioned, uses the interquartile range to
identify outliers.
Multivariate Methods:
Multivariate methods take into account relationships between multiple variables
simultaneously. These techniques consider the interactions and dependencies among variables
to identify outliers that might not be evident when examining variables in isolation.
Multivariate outlier detection methods often involve techniques from statistics, machine
learning, and data visualization.
For example, Mahalanobis distance is a common multivariate method that considers the
correlations between variables to identify outliers.
Why is it important to identify the outliers?
The characteristics of individual outliers within panel data can significantly impact parameter
estimates and the outcomes of statistical tests. Detecting and identifying outliers in such data
can offer valuable insights that enhance the resilience of statistical models and the accuracy
of their predictions[7].
Outliers, which are unusual data points, are sometimes removed from a dataset because they
can change how we see the overall pattern and the statistical conclusions. This is a good idea
when the outliers are caused by mistakes like wrong measurements or data problems. But
sometimes, we don't know why the outliers are there. What to do with outliers depends on the
research goals and situation, and it's important to explain the choice in any method
description.
Types of Outliers
Global outliers: It often referred to as data anomalies, pertain to observations that stand out
significantly compared to the majority of data points within a specific feature. In simpler
terms, a data point earns the classification of a global outlier when its value resides far
beyond the entire range of values encompassed by the dataset it is a part of.
Let's consider a dataset that records the daily temperatures in a particular city over a year.
Most of the temperatures fall within the range of 20°C to 35°C. However, there is an
exceptionally low temperature recorded at -10°C. This particular data point qualifies as a
global outlier due to its stark contrast with the temperature values of the other days
throughout the year.
Contextual outliers: It is also known as conditional outliers, are observations that stand out
within a specific context. A data point becomes a contextual outlier when its value
significantly differs from the majority of data points in the same scenario. Unlike global
outliers, contextual anomalies don't necessarily fall outside the typical overall range; rather,
they deviate from the anticipated pattern for that context.
In a time-series of monthly energy consumption in a household, there's a consistent increase
in energy usage during winter months. However, if there's an unexpectedly high energy
consumption in the middle of summer, it would be identified as a contextual outlier since it
contrasts with the usual seasonal consumption pattern.
Collective outliers: These are a cluster of observations with similar unusual values. While
each data point might not be anomalous individually, when taken as a group, their values
significantly deviate from the rest of the dataset.
In a dataset of company sales, a specific product category's sales figures for a few
consecutive months are relatively consistent but unexpectedly low compared to the overall
trend. Individually, these monthly sales might not be outliers, but collectively, they form a
group of collective outliers due to their shared deviation from the larger sales pattern.
Univariate Outliers: These are individual data points that deviate significantly from the rest
of the data in a single variable or feature. In other words, they are extreme values within a
specific attribute while not considering other attributes.
Multivariate Outliers: These outliers involve multiple variables or features and are
identified by examining their collective behavior. In multivariate analysis, these outliers can't
be detected by looking at each variable in isolation. They are points that stand out when
considering the relationships between different variables.
How to Prevent them?
Upon identifying outliers, addressing them becomes imperative due to their potentially
detrimental impact. Indeed, outliers can be likened to covert agents of disruption within the
data.
 The presence of outliers can significantly distort both the mean and the standard
deviation of the dataset, resulting in misleading statistical interpretations.
 The amplification of error variance induced by outliers can undermine the
robustness of statistical tests, diminishing their ability to detect significant effects
accurately.
 Non-uniformly distributed outliers have the capacity to undermine the assumption
of normality, thereby impeding proper statistical analyses.
 Many machine learning algorithms falter when confronted with outliers. Thus, it
becomes crucial to detect and eliminate these aberrations to ensure model
reliability.
When considering the outlined array of reasons, exercising caution when dealing with
outliers prior to constructing statistical or machine learning models becomes of utmost
importance. Various methods have been developed to effectively address these outliers.
1. Exclusion of Observations: One strategy involves eliminating outlier observations from
the dataset. While this can restore statistical metrics, it's crucial to approach this with care as
it might result in losing potentially valuable insights.
2. Transformation of Values: An alternate approach is to transform the values within the
dataset. Techniques such as logarithmic or Box-Cox transformations can help bring extreme
values closer to the rest of the distribution, reducing their impact on the analysis.
3. Imputation: Imputation entails substituting outliers with estimated values based on the
distribution of the remaining data. This helps preserve the dataset's integrity while
minimizing the disruptive effect of outliers.
4. Distinct Treatment: In situations where outliers can be linked to specific groups or
phenomena, a separate treatment can be employed. This recognizes the distinct nature of
these observations and prevents them from unduly skewing the overall analysis.
Adopting one or a combination of these techniques aligns with the objective of enhancing the
dependability and precision of statistical and machine learning models when outliers are
present.

References
1. Knorr, E.M., Outliers and data mining: finding exceptions in data. 2002, University of British
Columbia.
2. Barnett, V. and T. Lewis, Outliers in statistical data. Vol. 3. 1994: Wiley New York.
3. Bakar, Z.A., et al. A comparative study for outlier detection techniques in data mining. in
2006 IEEE conference on cybernetics and intelligent systems. 2006. IEEE.
4. Bansal, R., N. Gaur, and S.N. Singh. Outlier detection: applications and techniques in data
mining. in 2016 6th International conference-cloud system and big data engineering
(Confluence). 2016. IEEE.
5. Zimek, A., P.J.W.I.R.D.M. Filzmoser, and K. Discovery, There and back again: Outlier detection
between statistical reasoning and data mining algorithms. 2018. 8(6): p. e1280.
6. Ben-Gal, I.J.D.m. and k.d. handbook, Outlier detection. 2005: p. 131-146.
7. Lyu, Y.J.M.P.i.E., Detection of outliers in panel data of intervention effects model based on
variance of remainder disturbance. 2015.

You might also like