Handling Outliers

Uploaded by

Muhammad Saood Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views

Handling Outliers

Uploaded by

Muhammad Saood Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Handling Outliers

What are Outliers

In simple terms, an outlier refers to a data point that appears unusual when compared to the
rest of the data. Let's take the example of height as a single characteristic. If there's a boy who
is much taller than all the other kids in his kindergarten class, he would be considered an
outlier.
However, outliers aren't always extremely distant from the other values. For instance, if an
average-height woman entered a room full of only basketball players and jockeys, she would
also be an outlier because her height would noticeably differ from that of the others[1].
Barnett and Lewis characterize an outlier as an observation (or group of observations) that
seems to deviate from the rest of the dataset in a way that appears inconsistent[2].
Such a data point frequently holds valuable insights into unusual patterns within the system
represented by the data[3]. Conversely, numerous data mining algorithms in existing
literature identify outliers as an incidental outcome of clustering algorithms. From the
perspective of a clustering algorithm, outliers refer to objects that don't fall within the clusters
of the dataset and are often referred to as noise.
How do outliers get included in datasets?
Following are some causes lead to presence of outliers
1. Data Entry Errors: These outliers occur due to inaccuracies and mistakes made while
collecting, recording, or entering data. Such errors can stem from human fallibility during the
initial stages of data management, leading to irregularities within the dataset.
2. Measurement Error (Instrument Errors): Outliers attributed to measurement errors
arise from faulty or imprecise measuring instruments. Inaccuracies in the tools used for data
collection can result in values that significantly deviate from the actual measurements.
3. Experimental Errors: Experiment-related outliers emerge from errors in the experimental
process, including flaws in design, execution, and data extraction. These deviations from the
expected data pattern can distort the overall analysis.
4. Intentional Outliers: Intentionally introduced outliers serve the purpose of testing the
effectiveness of outlier detection methods. These outliers, often inserted during the data
collection phase, allow researchers to evaluate the robustness of their analysis techniques.
5. Data Processing Errors: Outliers arising from data processing errors result from mistakes
made during data manipulation, transformation, or cleaning. Inaccurate data handling
procedures can introduce values that don't align with the rest of the dataset.
6. Sampling Errors: Sampling errors lead to outliers when data is extracted from incorrect
sources or when a mix of data from various sources is improperly combined. These outliers
can misrepresent the true nature of the dataset.
7. Natural Outliers: Natural outliers are genuine data points that deviate significantly from
the norm but aren't the result of errors. These outliers, characteristic of real-world datasets,
offer unique insights into the underlying processes generating the data.
How to detect them
Outlier detection stands as a prominent challenge within the realm of Data Mining, entailing
the identification of exceptional data points within a set of patterns. This predicament holds
considerable prominence within the domain of data mining and is of paramount significance
across diverse application domains. Once deemed merely indicative of noisy data, outliers
have progressively transformed into a substantive complication, progressively unveiled
across multiple research sectors[4].
The exploration of outliers has been a subject of statistical investigation for many centuries.
Particularly over the past two decades, the field of database and data mining has exhibited a
growing enthusiasm towards constructing scalable techniques for identifying outliers. While
these techniques initially stemmed from statistical foundations, they gradually relinquished
their direct probabilistic explications, thus altering the interpretability of the resultant outlier
scores[5].
In the realm of outlier detection methods, a distinction can be drawn between univariate
approaches, which were introduced in earlier studies, and multivariate techniques that
currently constitute the majority of ongoing research efforts. Furthermore, a fundamental
categorization of these methods lies in the differentiation between parametric (statistical)
approaches and nonparametric methods that operate independently of specific models[6].
Following are some methods to detect outliers:
Z Score:
This approach presupposes that the variable follows a Gaussian distribution. It quantifies the
extent to which an observation deviates from the mean in terms of the number of standard
deviations.
(x−μ)
z=
σ
z= (data points-mean) / standard deviation
Here, we normally define outliers as points whose modulus of z-score is greater than a
threshold value.
IQR method:
he IQR method is a way to detect potential outliers in a dataset by using the interquartile
range. The interquartile range (IQR) is a measure of statistical dispersion, which describes the
spread of data within the middle 50% of a dataset.
Here's how the IQR method works:
 Arrange your dataset in ascending order.
 Calculate the first quartile (Q1), which is the median of the lower half of the dataset.
 Calculate the third quartile (Q3), which is the median of the upper half of the dataset.
 Calculate the interquartile range (IQR) by subtracting Q1 from Q3: IQR = Q3 - Q1.
Determine the lower bound and upper bound for potential outliers:
Lower Bound: Q1 - 1.5 * IQR
Upper Bound: Q3 + 1.5 * IQR
Any data points that fall below the lower bound or above the upper bound are considered
potential outliers.
Here's the formula for the IQR method:
IQR = Q3 - Q1

Visualize the data:

Data visualization serves several purposes in data analysis, including aiding in data cleansing,
facilitating data exploration, pinpointing outliers and exceptional groups, recognizing patterns
and clusters, and more. The following is a compilation of data visualization plots that are
employed to identify outliers effectively.
a) Box and whisker plot (box plot)
b) Scatter plot
c) Histogram
d) Distribution Plot
e) QQ plot(quantile-quantile plot)
Univariate Methods:
Univariate methods focus on analyzing individual variables in isolation. These techniques
assess the distribution and characteristics of a single variable to identify outliers. Common
univariate methods include the Z-score method and the IQR method.
The Z-score method measures how many standard deviations a data point is from the mean of
the variable. The IQR method, as previously mentioned, uses the interquartile range to
identify outliers.
Multivariate Methods:
Multivariate methods take into account relationships between multiple variables
simultaneously. These techniques consider the interactions and dependencies among variables
to identify outliers that might not be evident when examining variables in isolation.
Multivariate outlier detection methods often involve techniques from statistics, machine
learning, and data visualization.
For example, Mahalanobis distance is a common multivariate method that considers the
correlations between variables to identify outliers.
Why is it important to identify the outliers?
The characteristics of individual outliers within panel data can significantly impact parameter
estimates and the outcomes of statistical tests. Detecting and identifying outliers in such data
can offer valuable insights that enhance the resilience of statistical models and the accuracy
of their predictions[7].
Outliers, which are unusual data points, are sometimes removed from a dataset because they
can change how we see the overall pattern and the statistical conclusions. This is a good idea
when the outliers are caused by mistakes like wrong measurements or data problems. But
sometimes, we don't know why the outliers are there. What to do with outliers depends on the
research goals and situation, and it's important to explain the choice in any method
description.
Types of Outliers
Global outliers: It often referred to as data anomalies, pertain to observations that stand out
significantly compared to the majority of data points within a specific feature. In simpler
terms, a data point earns the classification of a global outlier when its value resides far
beyond the entire range of values encompassed by the dataset it is a part of.
Let's consider a dataset that records the daily temperatures in a particular city over a year.
Most of the temperatures fall within the range of 20°C to 35°C. However, there is an
exceptionally low temperature recorded at -10°C. This particular data point qualifies as a
global outlier due to its stark contrast with the temperature values of the other days
throughout the year.
Contextual outliers: It is also known as conditional outliers, are observations that stand out
within a specific context. A data point becomes a contextual outlier when its value
significantly differs from the majority of data points in the same scenario. Unlike global
outliers, contextual anomalies don't necessarily fall outside the typical overall range; rather,
they deviate from the anticipated pattern for that context.
In a time-series of monthly energy consumption in a household, there's a consistent increase
in energy usage during winter months. However, if there's an unexpectedly high energy
consumption in the middle of summer, it would be identified as a contextual outlier since it
contrasts with the usual seasonal consumption pattern.
Collective outliers: These are a cluster of observations with similar unusual values. While
each data point might not be anomalous individually, when taken as a group, their values
significantly deviate from the rest of the dataset.
In a dataset of company sales, a specific product category's sales figures for a few
consecutive months are relatively consistent but unexpectedly low compared to the overall
trend. Individually, these monthly sales might not be outliers, but collectively, they form a
group of collective outliers due to their shared deviation from the larger sales pattern.
Univariate Outliers: These are individual data points that deviate significantly from the rest
of the data in a single variable or feature. In other words, they are extreme values within a
specific attribute while not considering other attributes.
Multivariate Outliers: These outliers involve multiple variables or features and are
identified by examining their collective behavior. In multivariate analysis, these outliers can't
be detected by looking at each variable in isolation. They are points that stand out when
considering the relationships between different variables.
How to Prevent them?
Upon identifying outliers, addressing them becomes imperative due to their potentially
detrimental impact. Indeed, outliers can be likened to covert agents of disruption within the
data.
 The presence of outliers can significantly distort both the mean and the standard
deviation of the dataset, resulting in misleading statistical interpretations.
 The amplification of error variance induced by outliers can undermine the
robustness of statistical tests, diminishing their ability to detect significant effects
accurately.
 Non-uniformly distributed outliers have the capacity to undermine the assumption
of normality, thereby impeding proper statistical analyses.
 Many machine learning algorithms falter when confronted with outliers. Thus, it
becomes crucial to detect and eliminate these aberrations to ensure model
reliability.
When considering the outlined array of reasons, exercising caution when dealing with
outliers prior to constructing statistical or machine learning models becomes of utmost
importance. Various methods have been developed to effectively address these outliers.
1. Exclusion of Observations: One strategy involves eliminating outlier observations from
the dataset. While this can restore statistical metrics, it's crucial to approach this with care as
it might result in losing potentially valuable insights.
2. Transformation of Values: An alternate approach is to transform the values within the
dataset. Techniques such as logarithmic or Box-Cox transformations can help bring extreme
values closer to the rest of the distribution, reducing their impact on the analysis.
3. Imputation: Imputation entails substituting outliers with estimated values based on the
distribution of the remaining data. This helps preserve the dataset's integrity while
minimizing the disruptive effect of outliers.
4. Distinct Treatment: In situations where outliers can be linked to specific groups or
phenomena, a separate treatment can be employed. This recognizes the distinct nature of
these observations and prevents them from unduly skewing the overall analysis.
Adopting one or a combination of these techniques aligns with the objective of enhancing the
dependability and precision of statistical and machine learning models when outliers are
present.

References
1. Knorr, E.M., Outliers and data mining: finding exceptions in data. 2002, University of British
Columbia.
2. Barnett, V. and T. Lewis, Outliers in statistical data. Vol. 3. 1994: Wiley New York.
3. Bakar, Z.A., et al. A comparative study for outlier detection techniques in data mining. in
2006 IEEE conference on cybernetics and intelligent systems. 2006. IEEE.
4. Bansal, R., N. Gaur, and S.N. Singh. Outlier detection: applications and techniques in data
mining. in 2016 6th International conference-cloud system and big data engineering
(Confluence). 2016. IEEE.
5. Zimek, A., P.J.W.I.R.D.M. Filzmoser, and K. Discovery, There and back again: Outlier detection
between statistical reasoning and data mining algorithms. 2018. 8(6): p. e1280.
6. Ben-Gal, I.J.D.m. and k.d. handbook, Outlier detection. 2005: p. 131-146.
7. Lyu, Y.J.M.P.i.E., Detection of outliers in panel data of intervention effects model based on
variance of remainder disturbance. 2015.

Engineering Economy 7th Edition - Solution Manual (Arrastrado)
100% (2)
Engineering Economy 7th Edition - Solution Manual (Arrastrado)
22 pages
P 16 Mba 3
No ratings yet
P 16 Mba 3
4 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
What is Outlier
No ratings yet
What is Outlier
3 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
Datamining_seminar
No ratings yet
Datamining_seminar
19 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Outliers
No ratings yet
Outliers
3 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
WINSEM2024-25_CBS3006_ETH_VL2024250505168_2025-01-09_Reference-Material-III
No ratings yet
WINSEM2024-25_CBS3006_ETH_VL2024250505168_2025-01-09_Reference-Material-III
4 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
5 Ways To Find Outliers in Your Data - Statistics by Jim
No ratings yet
5 Ways To Find Outliers in Your Data - Statistics by Jim
35 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Guide on Outlier Detection Methods
No ratings yet
Guide on Outlier Detection Methods
11 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
Outliers PDF
No ratings yet
Outliers PDF
5 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
4_Outliers_+Transformaations ML
No ratings yet
4_Outliers_+Transformaations ML
28 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
A Review of Statistical Outlier Methods
No ratings yet
A Review of Statistical Outlier Methods
8 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
UNIT 4
No ratings yet
UNIT 4
17 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
ADS EXP 7
No ratings yet
ADS EXP 7
10 pages
Advanced Data Analysis Techniques 3
No ratings yet
Advanced Data Analysis Techniques 3
31 pages
Outlier
No ratings yet
Outlier
9 pages
Outliers CW
No ratings yet
Outliers CW
6 pages
Lecture 12 1
No ratings yet
Lecture 12 1
46 pages
Outlier treatment
No ratings yet
Outlier treatment
16 pages
Outlier or Anomaly Detection
No ratings yet
Outlier or Anomaly Detection
9 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
1outlier - Wikipedia
No ratings yet
1outlier - Wikipedia
47 pages
Notes PDF ML Day 17
No ratings yet
Notes PDF ML Day 17
9 pages
Outlier: Occurrence and Causes
No ratings yet
Outlier: Occurrence and Causes
6 pages
Test To Identify Outliers in Data Series
No ratings yet
Test To Identify Outliers in Data Series
16 pages
fundamentals stats
No ratings yet
fundamentals stats
44 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
Finding Outliers 2 Wayes Z-Score and Interquortile Range
No ratings yet
Finding Outliers 2 Wayes Z-Score and Interquortile Range
1 page
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Discusion Forum Unit 2
No ratings yet
Discusion Forum Unit 2
2 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Introduction To Outlier Analysis Complete
No ratings yet
Introduction To Outlier Analysis Complete
12 pages
Outlier
No ratings yet
Outlier
2 pages
LECTURE 12
No ratings yet
LECTURE 12
54 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Statistical Test Methods For Hypothesis Testing
No ratings yet
Statistical Test Methods For Hypothesis Testing
6 pages
12Outlier
No ratings yet
12Outlier
16 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Lab 3 Frequency Domain Processing
No ratings yet
Lab 3 Frequency Domain Processing
14 pages
Cohen Sutherland Line Clipping
No ratings yet
Cohen Sutherland Line Clipping
21 pages
Calculus in Management Sciences Syllabus
No ratings yet
Calculus in Management Sciences Syllabus
3 pages
Comparative Study of PID Tuning Methods For Processes With Large & Small Delay Times
No ratings yet
Comparative Study of PID Tuning Methods For Processes With Large & Small Delay Times
7 pages
Linear Regression Homework
100% (1)
Linear Regression Homework
5 pages
SriHarsha - Kaustuv - EE401 - Reactive Power Planning in Distribution Networks - 1
No ratings yet
SriHarsha - Kaustuv - EE401 - Reactive Power Planning in Distribution Networks - 1
7 pages
Problem 1: One Side Only (Simple Interview Question For Phone Screening Usually)
No ratings yet
Problem 1: One Side Only (Simple Interview Question For Phone Screening Usually)
7 pages
Numerical Methods - E. Balaguruswamy
No ratings yet
Numerical Methods - E. Balaguruswamy
124 pages
Jobshop
No ratings yet
Jobshop
16 pages
Chapter5-Technologies Enabling Information Insights and Decision
No ratings yet
Chapter5-Technologies Enabling Information Insights and Decision
32 pages
3.3 A Review of Unsupervised Feature Selection Methods
No ratings yet
3.3 A Review of Unsupervised Feature Selection Methods
42 pages
Machine Learning Methods in Finance
No ratings yet
Machine Learning Methods in Finance
64 pages
06 - Normal Distribution Template
No ratings yet
06 - Normal Distribution Template
16 pages
1 marks dsa
No ratings yet
1 marks dsa
2 pages
3.6.7.a-Research-Design-Sampling-Randomization-_-Xyrrie-and-Anren
No ratings yet
3.6.7.a-Research-Design-Sampling-Randomization-_-Xyrrie-and-Anren
28 pages
STAT202-homework2 HW21
No ratings yet
STAT202-homework2 HW21
2 pages
ML Exam
No ratings yet
ML Exam
5 pages
Data Asimilasi Untuk Pemula
No ratings yet
Data Asimilasi Untuk Pemula
24 pages
Final+review Signals and Systems
No ratings yet
Final+review Signals and Systems
11 pages
Submitted 2022 SFORUM DIGITAL FILTER
No ratings yet
Submitted 2022 SFORUM DIGITAL FILTER
4 pages
Matrix Analysis LN
No ratings yet
Matrix Analysis LN
331 pages
MCQ Nmcpmix
No ratings yet
MCQ Nmcpmix
5 pages
Math 202 PS-11
No ratings yet
Math 202 PS-11
16 pages
Ebooks File Spectral Theory and Quantum Mechanics Mathematical Foundations of Quantum Theories Symmetries and Introduction To The Algebraic Formulation Valter Moretti All Chapters
100% (3)
Ebooks File Spectral Theory and Quantum Mechanics Mathematical Foundations of Quantum Theories Symmetries and Introduction To The Algebraic Formulation Valter Moretti All Chapters
62 pages
Sobolev space
No ratings yet
Sobolev space
18 pages
Potential Energy Method: Based
No ratings yet
Potential Energy Method: Based
8 pages
Pollaczek-Khinchine Formula - Wikipedia
No ratings yet
Pollaczek-Khinchine Formula - Wikipedia
3 pages
CNN Examples
No ratings yet
CNN Examples
2 pages

Handling Outliers

Uploaded by

Handling Outliers

Uploaded by

Handling Outliers

What are Outliers

Visualize the data:

You might also like