Week 3
Week 3
com
DLZNK464L9 Data Preprocessing
● Completeness: not recorded, un-available, missing values, important variables not included
● Consistency: dangling and some features are modified but some features not
[email protected]
DLZNK464L9
● Interpretability: how easily the data can be understood, codes as variable names, or coded values,
● Believability: how much data is trustable are as perceived by the end user
● Evaluate all of the above to assess data’s fitness for the task
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Formats: Tidy Data
1. Each variable forms a column.
Tasks Methods
Binning, Histogram analysis
Missing values
Regression
Noisy data
Clustering, Classification
Outliers Correlation/covariance
[email protected]
Redundancy
DLZNK464L9
PCA, Feature selection
Box plots
Dimensionality reduction
Sampling
Numerosity reduction
Data compression
Data discretization Data Normalization
Scale differences Concept hierarchy
transformation.
DLZNK464L9
○ Noisy data
● Replace empty cells with ‘NA’, “Missing”, etc. More see https://support.datacite.org/docs/schema-
values-unknown-information-v42
Adjusted R Squared
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Simple Linear Regression
y y = bx + a
250
Residual
200
r = 100 - 150 = -50
Weight (lbs)
150
[email protected]
DLZNK464L9
r
100 (55, 100)
50
x
10 20 30 40 50 60
Height (inches)
‘r’ here shows a residual, the difference between the true value and the predicted value.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Multiple Linear Regression
● Multiple linear regression (more than 1 independent variables, X and beta are vectors).
● Tips on choosing the best model.
○ http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-choose-the-best-regression-
model
● Use for:
○ missing values: use predicted values to replace missing values.
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
● Clustering
○ Smooth data by cluster centres
○ detect and remove outliers/errors
[email protected]
DLZNK464L9
● Empty cells or cells filled with “NA”-like tokens are referred to as missing data.
● Noisy Data can be implicit errors introduced by measurement tools, such as different types of
sensors, or random errors.
● There are different ways to handle missing data and noisy data, including various imputation
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
● Clustering
○ Outliers form small and distant clusters or not be included in any cluster.
● Attributes that are correlated but not redundant should often be kept.
● Careful integration of data from various sources may aid in the reduction/avoidance of redundancies
[email protected]
DLZNK464L9
and inconsistencies, as well as the improvement of mining speed and quality.
● Using the Χ2 table (next slide), we find the critical value=10.828 for the alpha and d.f.=1
● Χ2 > 10.828, reject H0, so A and B are correlated.
● Most tests will give you a p-value; if p-value < alpha, reject H0.
[email protected]
DLZNK464L9
○ Where n is the number of tuples, and are the respective means of A and B,
[email protected]
coefficient from –1 to 1.
○ where n is the number of tuples, are the respective mean or expected values (E) of A
and B, σA and σB are the respective standard deviation of A and B.
● Negative covariance: CovA and B < 0, indicating two variables change in different directions: one is
larger and the other one is smaller than their expected values.
[email protected]
● Independence: CovA and B= 0, but the reverse is not true:
DLZNK464L9
○ Some random variable pairings may have a covariance of zero but they are not independent. A
covariance of 0 implies independence only under certain additional conditions (for example,
the data have multivariate normal distributions).
● Suppose two stocks A and B have the following values in one week: (2,5), (3, 8), (5, 10), (4,
11), (6, 14).
[email protected]
DLZNK464L9
● Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a long time on the complete data set.
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
[email protected]
DLZNK464L9
● Allow mining algorithms to run at a complexity that is possibly sub-linear to data size.
● Data reduction obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results.
● Data reduction can be done by:
○ Dimensionality reduction - It is the process of removing unimportant attributes.
[email protected]
DLZNK464L9
○ Numerosity reduction - It reduces data volume; uses smaller forms of data representation.
○ Data compression
● Sampling is about obtaining a small sample s to represent the whole data set N.
○ Top-down split
[email protected]
DLZNK464L9
○ Unsupervised
● If two adjacent intervals have low χ2 values (less correlated to the class labels), merge them to form
[email protected]
DLZNK464L9
a larger interval (keeping them separate does not offer more information on how to classify objects).
● Interval/class contingency 4 8 1
tables:
[email protected]
DLZNK464L9
5 9 1
Sample K=1 K=2 6 11 2
2 0 1 1 7 23 2
3 1 0 1
8 37 1
total 1 1 2
9 39 2
Sample K=1 K=2
10 45 1
3 1 0 1
11 46 1
4 1 0 1
total 2 0 2 12 59
This file is meant for personal use by [email protected] only. 1
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Sampl K=1 K=2 E11 = (1/2)*1 = .05
e E12 = (1/2)*1 = .05
E21 = (1/2)*1 = .05
2 0 1 1
E22 = (1/2)*1 = .05
3 1 0 1
total 1 1 2
[email protected]
DLZNK464L9 X2 = (0-.5)2/.5 + (0-.5)2/.5 + (0-.5)2/.5 + (0-.5)2/.5 = 2
Sampl K=1 K=2
E11 = (1/2)*2 = 1
e
E12 = (0/2)*2 = 0
3 1 0 1 E21 = (1/2)*2 = 1
4 1 0 1 E22 = (0/2)*2 = 0
total 2 0 2
X2 = (1-1)2/1+(0-0)2/0+ (1-1)2/1+(0-0)2/0 = 0
Sig Level 0.1 with df=1 from Chi square distribution X2 critical value = 2.7024. Not
correlated, can be merged. ProprietarySharing
This file is meant for personal use by [email protected] only.
content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
10 45 1
11 46 1 {42,60}
12 59 1 This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Concept Hierarchy Generation
● Concept hierarchy organises concepts (attribute values) hierarchically and is typically associated with
each dimension in a data warehouse.
● In data warehouses, concept hierarchies enable drilling and rolling to see data at various
granularities.
● Concept hierarchy generation
[email protected]
DLZNK464L9
● Normalization – The data is scaled to fall within a smaller, specified range for more meaningful
comparison.
● Discretization divides the range of a continuous attribute into the interval.
● Chi-Merge Discretization example
[email protected]
DLZNK464L9
● Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated
with each dimension in a data warehouse.
● Concept hierarchy generation for nominal data
● Apply data pre-processing tasks and methods to prepare data for a data mining task.
● Summarize the importance of outlier removal and redundant data removal from data sets.
● Explain the methods for dimensionality reduction and numerosity reduction.
● Implement data transformation strategies, such as normalization, discretization, and concept
[email protected]
DLZNK464L9
hierarchy generation.
● Perform typical data pre-processing tasks in Python.