Lec 3 Data Preprocessing and Transformation
Lec 3 Data Preprocessing and Transformation
Data Collection
Data
Integration
Data
Preprocessing
Data Data
Transformation Cleaning
Data
Reduction
Causes:
Changes in experiments
human/data entry error
measurement impossible
hardware failure
human bias
combined datasets
source: Azure AI Gallery
Knowing why and how data is missing could help in data imputation
Age 25 26 29 30 30 31 44 46 48 51 52 54
IQ 121 91 110 118 93 116
Note that values of age variable are roughly the ”same” when IQ value is
missing and when it is not
Age 25 26 29 30 30 31 44 46 48 51 52 54
IQ 118 93 116 141 97 104
Age 25 26 29 30 30 31 44 46 48 51 52 54
IQ 133 121 110 118 116 141 104
Manually fill in, works for small data and few missing values
Use a global constant, e.g. MGMT Major, or Unknown, or ∞
Substitute a measure of central tendency, e.g. mode, mean or median
Missed Quiz: student mean, class mean, class mean in this or all
quizzes, the student mean in remaining quizzes
Cricket DLS system
Use class-wise mean or median
for missing players score in a match, use player’s average, average of
Pak batsmen, average of Pak batsmen against India, average of middle
order Pak batsmen again India in Summer in Sharjah
Noise and outliers can distort the true picture of data insights and must
be managed carefully
Age Salary
25 50,000
30 55,000
35 60,000
40 650,000
Table: Data with Outlier in Salary
Inconsistencies in data can arise from various sources such as human error,
data migration, or integration of multiple datasets
source: medium.com
Entity Identification Problem: Objects do not have same IDs in all sources
e.g. Sentiment analysis on cricket match tweets to assess player contribution
Network Reconciliation Project
Schema Integration
Object Matching
Make sure that player ID in cricinfo dataset is the same as player code
in PCB data (source of domestic games)
Occasionally two or more object can have all feature values identical,
yet they could be different instances
e.g. two students with the same grades in all courses
Customer ID Name
1 John Doe
1 John Doe
Table: Duplicate records in customer data.
new
represent.
Data Transformation
source: 7B Software
source: www.audiolabs-erlangen.de
xi
xi′ =
Xmax
Xmin Xmax
70 100
X
x0i = xi
Xmax
X0
0 1
xi − Xmin
xi′ =
Xmax − Xmin
Xmin Xmax
X
xi −Xmin
x0i = Xmax −Xmin
X0
0 1
xi − x
xi′ =
σx
Good, if we don’t know min/max (no full data) or outliers are dominant
in such cases max-min scaled data is harder to interpret
Stable data, common scale, all variables are unit-less and scalar
Resulting data have properties of standard normal ▷ µ = 0, σ = 1
Again the relative order of points is maintained
It makes no difference to the shape of a distribution
Sec1 90 10 50 30 40 80 74 68 61
Sec2 63 40 35 38 21 18 28 19 30
Sec1 1.4 −1.9 −.24 −1.07 −.65 .99 0.75 .5 .21
Sec2 2.3 .3 −.14 .13 .3 −1.6 −.74 .04 −.57
Imdad ullah Khan (LUMS) Data Preprocessing and Transformation 48 / 66
Other families of transformation
Convenience
Improve the statistical properties of the data
Reduced skew
Equal Spreads - homogeneity of variance
Convenience
Reucing Skew
x ′ = log x
x ′ = x /3
1
1 1
x′ = or x′ = −
x x
Cannot be applied to 0 ▷ used when all data is positive or negative
population density (people per unit area) becomes area/person
persons per doctor becomes doctors per person
rates of erosion become time to erode a unit depth
Y = aX + b
Y = aX + b
Y = aX + b
Y = aX + b
Instead, express as Y = aX 2 + b
Y = aX + b