0% found this document useful (0 votes)

28 views66 pages

Lec 3 Data Preprocessing and Transformation

Uploaded by

Rao aafaq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views66 pages

Lec 3 Data Preprocessing and Transformation

Uploaded by

Rao aafaq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Big Data Analytics

Data Preprocessing and Transformation

Data Collection

Issues with Data

Data Cleaning, dealing with missing values, noise and outliers

Data Integration, removing inconsistencies, and deduplication

Data Reduction - Sampling and Feature Selection

Data Transformation - Scaling and Standardization, Numeric

Transformation

Imdad ullah Khan

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 1 / 66
Data Collection

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 2 / 66

Data Collection

Data collection is the first step in the data anlysis pipeline

▷ Often from multiple sources

Importance: The quality and quantity of collected data directly influence

the insights derived from big data analytics

Challenges: Ensuring data accuracy, dealing with large volumes, and

integrating diverse data formats

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 3 / 66

Issues in Data Collection and Techniques

Identifying and addressing common issues in data collection is essential for

ensuring the integrity of data

Incomplete data collection

Biases in data due to collection methods
Collection of irrelevant or redundant data

To overcome common issues, several techniques can be employed:

Automation: Use scripts and APIs to collect data systematically

Validation: Implement real-time data validation to catch errors early
Sampling: Employ statistical sampling techniques to manage large
volumes

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 4 / 66

Data Preprocessing

Data preprocessing is a very important step

It helps improve quality of data
Makes the data ready and more suitable for analytics
Should be followed and guided by a thorough EDA
EDA helps identify quality issues in data that are dealt with in this step

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 5 / 66

Issues with data

Bad Formatting: Grade ’A’ vs. ’a’

Trailing Space: Extra spaces in commentary, white font ’,’ to avoid
plagiarism detection
Duplicates and Redundant Data: A ball repeated could be confused
with a wide/No ball, a grade repeated confused with repetition
Empty Rows: Could cause a lot of troubles during programming
Synonyms, Abbreviations: rhb, right hand batsman
Skewed Distribution and Outliers: Outliers could be points of interest
or could be just noise, errors, extremities
Missing Values: Missing grades, missing score
Different norms, units, and standards: miles vs. kilometers
1999: NASA lost equipment worth $125m because of an engineering
mistake of not converting English to Metric unit
Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 6 / 66
Steps in Preprocessing

Steps and processes are performed when necessary

Data
Integration

Data
Preprocessing
Data Data
Transformation Cleaning

Data
Reduction

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 7 / 66

Data Cleaning

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 8 / 66

Data Cleaning

Data cleaning is a critical process that ensures the accuracy and

completeness of data in analytics
It involves correcting or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset

Objective: Enhance data quality to produce reliable analytics

Common Issues: Inconsistencies, missing values, noise, and outliers.

Also called data scrubbing, data munging, data wrangling

Dealing with Missing values

Noise Smoothing
Correcting Inconsistencies
Identifying Outliers

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 9 / 66

Data Cleaning: Missing Values

Missing data is very common and generally significantly consequential

Causes:
Changes in experiments
human/data entry error
measurement impossible
hardware failure
human bias
combined datasets
source: Azure AI Gallery

Missing values can have a meaning, e.g. absence of a medical test

could mean that it was not conducted for a reason
Knowing why and how data is missing could help in data imputation

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 10 / 66

Data Cleaning: Missing Values

Knowing why and how data is missing could help in data imputation

Missing Completely at Random (MCAR)

Missingness independent of any observed or unobserved variables

Missing at Random (MAR)

Missingness independent of missing values or unobserved variables
Missingness depend on observed variables with complete info

Missing Not at Ranodm (MNAR)

Missingness depends on the missing values or unobserved variable

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 11 / 66

Data Cleaning: Missing Values - MCAR

Missing Completely at Random (MCAR)

Missingness independent of any observed or unobserved variables
Values of a variable being missing is completely unsystematic
This assumption can somewhat be verified by examining complete
and incomplete cases
Data is likely representative sample and analysis will be unbiased

Age 25 26 29 30 30 31 44 46 48 51 52 54
IQ 121 91 110 118 93 116

Note that values of age variable are roughly the ”same” when IQ value is
missing and when it is not

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 12 / 66

Data Cleaning: Missing Values - MAR

Missing at Random (MAR)

Missingness independent of missing values or unobserved variables
Missingness depend on observed variables with complete info
The event that a value for Variable 1 is missing depends only on
another observed variables with no missing values
Not statistically verifiable (rely on subjective judgment)

Age 25 26 29 30 30 31 44 46 48 51 52 54
IQ 118 93 116 141 97 104

Note that only young people have missing values for IQ

Shouldn’t be the case that only high IQ people have missing values
Or that only males have IQ values missing (unobserved variable)

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 13 / 66

Data Cleaning: Missing Value - MNAR

Missing Not at Random (MNAR)

Missingness depends on the missing values or unobserved variable(s)
Pattern is non-random, non-ignorable, and typically arises due to the
variable on which the data is missing
Generally very hard to ascertain the assumption
e.g. only low IQ people have missing values
Or only males have missing IQ values

Age 25 26 29 30 30 31 44 46 48 51 52 54
IQ 133 121 110 118 116 141 104

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 14 / 66

Data Cleaning: Dealing with missing values

Ignore the objects with missing attributes

May lose many objects
Ignore the attribute which has “many” missing values
May lose many meaningful attributes what if class label is missing?
Impute Data
Domain knowledge and understanding of missing values help

source: towards data science

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 15 / 66
Data Cleaning: Data Imputation

Manually fill in, works for small data and few missing values
Use a global constant, e.g. MGMT Major, or Unknown, or ∞
Substitute a measure of central tendency, e.g. mode, mean or median
Missed Quiz: student mean, class mean, class mean in this or all
quizzes, the student mean in remaining quizzes
Cricket DLS system
Use class-wise mean or median
for missing players score in a match, use player’s average, average of
Pak batsmen, average of Pak batsmen against India, average of middle
order Pak batsmen again India in Summer in Sharjah

Use average of top k similar objects ▷ based on non-missing attributes

can be weighted by similarity average of all other data objects

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 16 / 66

Data Cleaning: Data Imputation

Advanced techniques for imputing missing values

Expectation Maximization Imputation

Regression based Imputation

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 17 / 66

Data Cleaning: Noise

Noise: Random error or variation in measured data

Elimination is generally difficult
Analytics should be robust to have acceptable quality despite
presence of noise

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 18 / 66

Data Cleaning: Handling Noise and Outliers

Noise and outliers can distort the true picture of data insights and must
be managed carefully

Age Salary
25 50,000
30 55,000
35 60,000
40 650,000
Table: Data with Outlier in Salary

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 19 / 66

Data Cleaning: Noise

Dealing with noise

Smoothing by Binning
Essentially replace each value by the average of values in the bin
Could be mean, median, midrange etc. of values in the bin
Could use equal width or equal depth (sized) bins

Smoothing by local neighborhoods

k-nearest neighbors, blurring, boundaries
Smoothing is also used for data reduction and discretization

Smoothing Time Series

Moving Average
Divide by variance of each period/cycle

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 20 / 66

Data Cleaning: Correcting Inconsistencies

Inconsistencies in data can arise from various sources such as human error,
data migration, or integration of multiple datasets

ID Product Name Price

1 Product-A 20
2 product-a 20
3 PRODUCT-A 19
Table: Inconsistent Data Entries

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 21 / 66

Data Cleaning: Correcting Inconsistencies

Data can contain inconsistent values

e.g. an address with both ZIP code and city, but they don’t match

source: medium.com

Some are easy to detect, e.g. negative age of a person

Some require consulting an external source
Correcting inconsistencies may requires additional information

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 22 / 66

Data Cleaning: Identifying Outliers

Outliers are either

Objects that have characteristics substantially different from most other data
▷ the object is an outlier
Value of a variable that is substantially different than the variable’s typical
values
▷ the feature value is an outlier

Unlike noise, outliers can be legitimate data or values

Outliers could be points of interest
Consider students record in Zambeel, what values of age could be
noise
inconsitency
outlier

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 23 / 66

Data Integration

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 24 / 66

Data Integration

Data integration involves combining data from different sources to provide

a unified view. This process is crucial for comprehensive analysis but
comes with challenges

Objective: To merge diverse datasets into a coherent whole

Common Issues: Inconsistencies, entity resolution, duplication

Inconsistencies arise when data from different sources conflict in format,

scale, or interpretation

Date (Source 1) Date (Source 2)

2024-04-14 14/04/2024
2024-04-15 15/04/2024
Table: Format inconsistencies in date fields from two sources.

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 25 / 66

Data Integration

Merging data from multiple sources

e.g. RO and Admissions Data Cricinfo and PCB Data

Entity identification problem

Data merging causes or require
Data duplication and redundancy
Data conflict & inconsistencies

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 26 / 66

Data Integration

Entity Identification Problem: Objects do not have same IDs in all sources
e.g. Sentiment analysis on cricket match tweets to assess player contribution
Network Reconciliation Project

Schema Integration
Object Matching
Make sure that player ID in cricinfo dataset is the same as player code
in PCB data (source of domestic games)

Check metadata, names of attributes, range, data types and formats

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 27 / 66

Data Integration

Object Duplication: instance/object etc. may be duplicated

Occasionally two or more object can have all feature values identical,
yet they could be different instances
e.g. two students with the same grades in all courses

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 28 / 66

Data Integration

Redundancy and Correlation Analyses

Redundant (not necessarily duplicate) features

Sometimes caused by data integration ▷ Data duplication
An attribute is redundant if it can be derived from one or more others
e.g. if runs scored and balls faced are given, then no need to store
strike rate
If aggregate score in course is given in absolute grading, then no need
to store letter grade

Covariance/Correlation and χ2 -statistics are used for pairs of

numerical or ordinal/categorical attributes

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 29 / 66

Data Integration

Data Value Conflict Detection and Resolution

Sometimes there are two conflicting values in different sources

e.g. name is spelled differently in educational and NADRA’s record
This might require expert knowledge

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 30 / 66

Entity Resolution

Entity resolution is the process of linking and merging records that

correspond to the same entities from different databases.

Name (Source 1) Email (Source 1) Email (Source 2)

John Doe [email protected] [email protected]
Jane Smith [email protected] [email protected]
Table: Different email formats for the same individuals across sources.

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 31 / 66

Data Integration: Data Duplication

Duplication occurs when identical or nearly identical records exist across

datasets, leading to redundancy and possible errors in analysis.

Customer ID Name
1 John Doe
1 John Doe
Table: Duplicate records in customer data.

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 32 / 66

Data Reduction

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 33 / 66

Data Reduction

Sometime we do not need all the data

We reduce the data in either direction
Reduce instances
Reduce dimensions

Helps reduce computational complexity

Reduces storage requirements
Make data visualization more effective
Get a representative sample of data
Potentially enhanced model performance

Four Classes Dataset Random Sample

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 34 / 66

Data Reduction: Sampling

Equal probability sampling of k out of n objects

select objects from an ordered sampling window
first select an object, then every (n/k)th element (going circular)
If there is some peculiar regularity in the how the objects are ordered,
there is a risk of getting a very bad sample

Random Sampling of k out of n objects

Randomly permute objects (shuffle)
Select the first k in this order
Deals with the above regularity issue, but if there is big imbalance
among classes or groups, we can get very bad sample

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 35 / 66

Data Reduction: Sampling

Stratified Sampling of k out of n objects

Suppose data is grouped into groups (strata)
Randomly sample k/n fraction from each stratum
New sample will exhibits the distribution of population
Works for imbalanced classes but is computationally expensive

Clustered Sampling of k out of n objects

Cluster data items based on some ‘similarity’ (details later)
Randomly sample k/n fraction from each cluster
Efficient but not necessarily optimal, similarity definition is crucial
Underlying assumption is that similarity captures the classes

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 36 / 66

Data Reduction: Sampling

Imbalanced Classes: Classes or groups have huge difference in frequencies

and the target class is rare
Class imbalance is a common issue where some classes are significantly
underrepresented in the data, potentially leading to biased models.

Attrition prediction: 97% stay, 3% attrite (in a month)

Medical diagnosis: 95% healthy, 5% diseased
eCommerce: 99% do not buy, 1% buy
Security: > 99.99% of people are not terrorists
Similar situation with multiple classes
Predictions can be 97% correct, but useless
Requires special sampling methods, oversampling, undersampling

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 37 / 66

Data Reduction: Feature Selection

More importantly, one does dimensionality reduction

We will study in quite detail the Curse of Dimensionality (problems

associated with high dimensions and difficulties in dealing with higher
dimensional vectors)

We will discuss these techniques for dimensionality reduction (time

permitting)
Locality Sensitive Hashing
Johnson-Lindenstrauss Transform
AMS Sketch
PCA and SVD

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 38 / 66

Data Reduction: Feature Selection & Extraction

Represent data by fewer (and “better”) attributes

The new features should be so that the probability distribution of

class is roughly the same as the one obtained from original features

Feature Selection Feature Extraction

orginal
data

new
represent.

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 39 / 66

Data Reduction: Feature Selection and Correlation Analysis

Feature selection reduces the number of input variables by selecting only

the relevant features, often using statistical tests for association like
correlation coefficients or chi-square tests.

High correlation between two features might mean redundancy.

Chi-square tests are used to determine the independence of two
categorical variables.

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 40 / 66

Data Transformation

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 41 / 66

Data Transformation

Data transformation involves converting raw data into a format that is

more appropriate for analysis.
Values in original data is transformed via a mathematical function so that

Compatibility with machine learning algorithms

Analytics is more efficient - improved data consistency
Analytics is more meaningful - Enhanced model accuracy
Visualization is more meaningful and easier

Data Transformation

source: 7B Software

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 42 / 66

Data Transformation

Values in original data is transformed via a mathematical function

Depending on given data and requirements of analytics, this include

Ordinal to Numeric ▷ We will discuss it later

Smoothing ▷ e.g. by binning see dealing with noise
Aggregation (e.g. GPA from grades)
Discretization and Quantization ▷ needed e.g. for decision trees

source: www.audiolabs-erlangen.de

Standardization, scaling and normalization

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 43 / 66

Standardization and Scaling

The goal is to make an entire set of values have a particular property

e.g. variables to have the same range, same unit (or lack thereof)
to shift the data to a manageable range e.g. shifting to positive

Variety of possibilities for different applications

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 44 / 66

Standardization and Scaling

Scaling data so it falls in a smaller, comparable or manageable range

Data could be in different units e.g. kilometers and miles

Units might not be known
Small units means larger values and larger ranges
In values of “norms” and many distance measures, attributes of
smaller units get more weights than attributes with larger units
All attributes will get the same weight
Huge implications in distance values (see clustering & recommenders)

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 45 / 66

max-min Scaling

Transform the data (values of an attribute X ) to the ≤ 1

xi
xi′ =
Xmax
Xmin Xmax
70 100
X
x0i = xi
Xmax

X0
0 1

new max is 1 ▷ new min could be negative

Preserves relationships among original objects
max, min, median and all quantiles are the same objects
May get very narrow range within [0, 1]
Original Value Scaled Value
10 0
20 0.5
30 1
Table: Data
Imdad ullah Khan (LUMS) before
Data and and
Preprocessing after Min-Max Scaling
Transformation 46 / 66
max-min Scaling

Transform the data (values of an attribute X ) to the interval [0, 1]

xi − Xmin
xi′ =
Xmax − Xmin
Xmin Xmax
X
xi −Xmin
x0i = Xmax −Xmin

X0
0 1

First shift everything to [0, sth] by subtracting Xmin

We get different (scaled) std-dev, can suppress effect of outliers
If attribute Y is also scaled similarly, then X and Y are comparable
Two sections one with harsh and lenient grading, giki and lums gpa

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 47 / 66

z-score Normalization

Transform the data to a scale with mean 0 and std-dev 1

xi − x
xi′ =
σx
Good, if we don’t know min/max (no full data) or outliers are dominant
in such cases max-min scaled data is harder to interpret
Stable data, common scale, all variables are unit-less and scalar
Resulting data have properties of standard normal ▷ µ = 0, σ = 1
Again the relative order of points is maintained
It makes no difference to the shape of a distribution

Sec1 90 10 50 30 40 80 74 68 61
Sec2 63 40 35 38 21 18 28 19 30
Sec1 1.4 −1.9 −.24 −1.07 −.65 .99 0.75 .5 .21
Sec2 2.3 .3 −.14 .13 .3 −1.6 −.74 .04 −.57
Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 48 / 66
Other families of transformation

In statistical analysis we often transform a variable X by a function f (X )

of that variable

It changes the distribution of X or the relationship of X with another

variable Y
“Transformations are needed because there is no guarantee that the
world works on the scales it happens to be measured on”
Often it helps and is needed to transform the results back to the
original scale by taking the inverse transform
Mathematical transformations are applied to data to improve its
properties for analysis, which includes enhancing normality, linear
relationships, and uniformity across features
Objectives: Address skewness, improve model performance, and
simplify relationships between variables
Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 49 / 66
Reasons for Transformation

In statistical analysis we often transform a variable X by a function f (X )

of that variable

Convenience
Improve the statistical properties of the data
Reduced skew
Equal Spreads - homogeneity of variance

Linear relationship: Normalize relationships between features for

better correlation analysis
Additive relations
Enhance algorithm convergence speeds and accuracy
For one variable the first three reasons apply

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 50 / 66

Reasons for Transformation

In statistical analysis we often transform a variable X by a function f (X )

Convenience

The transformed scale may be as natural as the original and more

convenient for a specific purpose
Since transformation often change units, one can transform the data
to a unit that is easier to think about
z-score normalization is extremely useful for comparing variables
expressed in different units
Rather than 101/120, 130/140, and 10/73, easier to work with
percentages. We might want to work with sines rather than degrees

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 51 / 66

Reasons for Transformation

In statistical analysis we often transform a variable X by a function f (X )

Reucing Skew

Many statistical model assume data is from certain distribution with

fixed parameters ▷ Generally the (easiest) normal distribution
Needed to say something like the probability to get a max/mean etc.
Assumption doesn’t have to be true ▷ Data might have skew

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 52 / 66

Reasons for Transformation

In statistical analysis we often transform a variable X by a function f (X )

Equal Spread, Homoskedasticity

Data is transformed to achieve approximately equally spread across
the regression line (marginals)
Homoskedasticity: Subsets of data having roughly equal spread
Its opposite property is heteroskedasticity

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 53 / 66

Common Transformations

In statistical analysis we often transform a variable X by a function f (X )

All the following transformations improves normality

Some reduce the relative distance among values while still preserving
the relative order
They reduce the relative distance of values on the right sides (larger
values) more than the values on the left side
They are used to reduce right skew of data
Issue of dealing with left skew of data is discussed afterwards

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 54 / 66

Transformations to Reduce Right Skew

Right skew in data can be handled effectively using transformations that

compress large values more than smaller ones

Logarithmic Transformation: Reduces multiplicative relationships

to additive.
Square Root Transformation: Mildly reduces skew and is useful for
count data.

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 55 / 66

Common Transformations: Logarithms

x ′ = log x

It has major effect on the shape of the distribution

Commonly used to reduce right skewness
Often appropriate for measured variables (real numbers)
Since log of negative numbers are not defined and that of numbers
0 < x < 1 are negatives, we must shift values to a minimum of 1.00
Can use different bases (commonly used: natural log, base 2, base 10)
One often tries multiple first to settle on one

Higher bases pull larger values drastically

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 56 / 66

Common Transformations: Logarithms

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 57 / 66

Common Transformations: Cube-root

x ′ = x /3
1

Has significant effect on shape of distribution ▷ weaker than log

Reduces right skew
Can be applied to 0 and negative numbers
Cube root of a volume has the units of a length

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 58 / 66

Common Transformations: Square-root
√
x′ = x

Reduces right skew,

square root of an area has unit of a length
Commonly applied to counted data
Negative values must first be shifted to positives
Important consideration: roots of x ∈ (0, 1) is ≥ x, while roots of
x ∈ [1, ∞)] decreases (≤ x), so we must be careful
Might not be desirable to treat some number differently than others,
though the relative order of values will be maintained

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 59 / 66

Reciprocal and Negative Reciprocal Transformations

1 1
x′ = or x′ = −
x x
Cannot be applied to 0 ▷ used when all data is positive or negative
population density (people per unit area) becomes area/person
persons per doctor becomes doctors per person
rates of erosion become time to erode a unit depth

Reciprocal reverses order among values of the same sign

Makes very large number very small and very small numbers very large
Negative reciprocal preserves order among values of the same sign,
this is commonly used
This has the strongest effect

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 60 / 66

Left Skewed Data: Squares and higher powers

All the above transformation essentially deal with right skew

Left skew (or negative skew) can be reduced by applying transformations
that expand smaller values more significantly.
For left skew first reflect the data (multiply −1) and then apply these
transformations
Generally one needs to shift the data to a new minimum of 1.0 after
reflection and then apply the transform

Squaring: Amplifies larger values disproportionately compared to

smaller ones, suitable for data with negative values after adjustment.
Cubing: Stronger effect than squaring, can also handle zero and
negative values.
x′ = x2
moderate affect on shape of distribution
can be used to reduce left skew
Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 61 / 66
Transformation to make linear relationship

Suppose we want to describe a variable Y in terms of X

We want to express it as linear relationship

Y = aX + b

Transformation in many cases helps us fit a good line

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 62 / 66

Transformation to make linear relationship

Y = aX + b

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 63 / 66

Transformation to make linear relationship

Y = aX + b

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 64 / 66

Transformation to make linear relationship

Y = aX + b

Instead, express as Y = aX 2 + b

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 65 / 66

Transformation to make linear relationship

Y = aX + b

Can also do log Y = aX + b

Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 66 / 66

Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
36.why Data Preprocessing Introduction
No ratings yet
36.why Data Preprocessing Introduction
37 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Lec 3 Data Preprocessing and Transformation(1)
No ratings yet
Lec 3 Data Preprocessing and Transformation(1)
73 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Data processing_unit-3
No ratings yet
Data processing_unit-3
38 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
TE7 Service-Manual
100% (1)
TE7 Service-Manual
211 pages
Unit3
No ratings yet
Unit3
41 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Data Mining
No ratings yet
Data Mining
22 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
Server Virtualization
No ratings yet
Server Virtualization
248 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data_Link_Layer_Lecture
No ratings yet
Data_Link_Layer_Lecture
3 pages
Learn Hindi Through Telugu PDF Books PDF
0% (1)
Learn Hindi Through Telugu PDF Books PDF
4 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
DWM
No ratings yet
DWM
14 pages
OCLC
No ratings yet
OCLC
17 pages
Kak Penyuluhan Kb. Kespro
No ratings yet
Kak Penyuluhan Kb. Kespro
1,302 pages
Cv pro YAG - Anglais
No ratings yet
Cv pro YAG - Anglais
5 pages
Lec_16_PCA
No ratings yet
Lec_16_PCA
64 pages
Software Engineering Questions
No ratings yet
Software Engineering Questions
38 pages
Group_2_Assignment_2
No ratings yet
Group_2_Assignment_2
20 pages
Keys to successful inkjet printing for labels and packaging
No ratings yet
Keys to successful inkjet printing for labels and packaging
24 pages
BPCM0507 - PowerGem Plus RT Brochure Issue 3 1
No ratings yet
BPCM0507 - PowerGem Plus RT Brochure Issue 3 1
6 pages
003-KNN Complete Updated
No ratings yet
003-KNN Complete Updated
72 pages
Java Abstract Window Toolkit
No ratings yet
Java Abstract Window Toolkit
24 pages
002-Supervised Learning Setup 00 W2L1
No ratings yet
002-Supervised Learning Setup 00 W2L1
18 pages
12c On Oracle Linux 5 - Red Hat
No ratings yet
12c On Oracle Linux 5 - Red Hat
98 pages
Csuite Magazine - V7 2022
No ratings yet
Csuite Magazine - V7 2022
48 pages
Transaction
No ratings yet
Transaction
43 pages
Xam Idea English Class 12 Writing Portion + Summary
No ratings yet
Xam Idea English Class 12 Writing Portion + Summary
174 pages
Chapter 2 - Communication: Distributed Systems (IT 441)
No ratings yet
Chapter 2 - Communication: Distributed Systems (IT 441)
59 pages
eureka_i5
No ratings yet
eureka_i5
14 pages
CS436_CS5310_EE513_L05_CNN2
No ratings yet
CS436_CS5310_EE513_L05_CNN2
27 pages
03-Multimedia Software Tools
No ratings yet
03-Multimedia Software Tools
29 pages
June 2023 QP - Paper 2 OCR Computer Science a-level
No ratings yet
June 2023 QP - Paper 2 OCR Computer Science a-level
32 pages
Company Profile Al Nafi
100% (1)
Company Profile Al Nafi
13 pages
11+ Related Numbers (Year 5) Black & White - V11WAE14719
No ratings yet
11+ Related Numbers (Year 5) Black & White - V11WAE14719
4 pages
Step Into RPA
No ratings yet
Step Into RPA
22 pages
WickedRose andNCPH PDF
No ratings yet
WickedRose andNCPH PDF
10 pages
Battery Model Parameter Estimation Using A Layered Technique: An Example Using A Lithium Iron Phosphate Cell
No ratings yet
Battery Model Parameter Estimation Using A Layered Technique: An Example Using A Lithium Iron Phosphate Cell
15 pages
A Survey On Wireless Mesh Network and Its Challenges at The Transport Layer
No ratings yet
A Survey On Wireless Mesh Network and Its Challenges at The Transport Layer
10 pages
1mathematics SM025 (Integration) by Azmil Hafez (Student)
No ratings yet
1mathematics SM025 (Integration) by Azmil Hafez (Student)
32 pages
Rubric Group Assignment March 2020
No ratings yet
Rubric Group Assignment March 2020
3 pages
Bringing The Power of AI and RPA Together With AI Fabric
No ratings yet
Bringing The Power of AI and RPA Together With AI Fabric
9 pages
Project Based Learning
No ratings yet
Project Based Learning
13 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet