0% found this document useful (0 votes)
5 views

Week2-2

The document covers data preprocessing in data mining, focusing on data cleaning, integration, reduction, and transformation. It emphasizes the importance of handling missing and noisy data through various techniques such as binning and regression. Additionally, it discusses the challenges of data integration, including schema integration and redundancy management.

Uploaded by

sidramughal1011
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week2-2

The document covers data preprocessing in data mining, focusing on data cleaning, integration, reduction, and transformation. It emphasizes the importance of handling missing and noisy data through various techniques such as binning and regression. Additionally, it discusses the challenges of data integration, including schema integration and redundancy management.

Uploaded by

sidramughal1011
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

DATA MINING:

LECTURE 4
Chapter 2-Data Preprocessing

Lets prepare data for mining!


Agenda

• Data Preprocessing

• Data cleaning

• Data Integration
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Obtains reduced representation in volume but produces
the same or similar analytical results
• Data transformation
• Normalization and aggregation
DATA PREPROCESSING
Data Cleaning

• Importance
• “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey

• Data cleaning tasks


• Fill in missing values

• Identify outliers and smooth out noisy data

• Correct inconsistent data

• Resolve redundancy caused by data integration


Missing Data

• Data is not always available


• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus
deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the
time of entry
• did not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible?

• Fill in automatically with

• a global constant : e.g., “unknown”, a new class?!


• the attribute mean
• the attribute mean for all samples belonging to the same
class: smarter
• the most probable value: inference-based such as
Regression, Bayesian formula or decision tree
Noisy Data

• Noise: random error or variance in a


measured variable
• Incorrect attribute values may occur due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data
cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?

• Binning
• first sort data and partition into (equal-frequency)
bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Outlier analysis
• Clustering may be used detect and remove outliers.
• Combined computer and human inspection
• detect suspicious values and check by human (e.g.,
deal with possible outliers)
Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


• Divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
Binning Methods for Data

Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Binning(Equal Width)

■ Data: [5, 7, 10, 15, 18, 21, 22, 25, 27, 30]
■ We want to divide it into 3 bins (n = 3).
■ Step 1: Find min and max
a=5 (minimum value), b=30 (maximum value)
Step 2: Calculate bin width
Bin width=30−53=
25/3≈8.33
Step 3: Assign values to bins
Bin 1 (5 – 13.33): 5, 7, 10
Bin 2 (13.33 – 21.66): 15, 18, 21
Bin 3 (21.66 – 30): 22, 25, 27, 30
Now these bins can be smoothed by mean, median or boundary.
Regression
 Data is fitted to a
y
function
 Linear regression is the Y1
line that best fits 2
attributes
 One is used to predict Y1’ y=x+1
the other
 Multiple linear regression
is where more than 2 X1 x
attributes are involved
Analysis

 Data is clustered
 Values that fall
outside the
clusters are
considered noise
Data Cleaning and Data
Reduction

• Binning techniques reduce the number of distinct


values per attribute
• Binning may reduce prices for products in to
inexpensive, moderate and expensive
• Useful for decision tree induction
• May be bad in some circumstances, name one?
Data Cleaning as a Process

• Data discrepancy detection is the first task.


• Types of discrepancies
• Errors in data collection
• Deliberate errors (data providers concealing data)
• Data decay (outdated data such as change in addresses etc.)
• Inconsistent data representation and use of codes
• Data integration errors
• Use metadata (e.g., domain, range, dependency,
distribution)
• Data about data
• Outliers identified and removed
• Check field overloading
• Field overloading is adding data in to defined fields not intended for
the purpose originally (eg. 31 bits are used out of 32 bits, we add
extra information for 1 bit)
• Check uniqueness rule, consecutive rule and null rule
Data Cleaning as a Process
• Using commercial tools for Data Cleaning
• Data scrubbing:
• use simple domain knowledge (e.g., postal code, spell-check etc.) to detect
errors and make corrections
• Techniques used are: parsing and fuzzy matching techniques
• Data auditing:
• by analyzing data to discover rules and relationship to detect violators
• Techniques used are correlation, clustering and descriptive data summaries
to find outliers

• Data transformation: migration and integration


• Data migration tools: allow transformations to be
specified
• Eg. age by birthdate
• ETL (Extraction/Transformation/Loading) tools: allow
users to specify transformations through a graphical
user interface
• Only specific transforms are allowed which sometimes require custom
scripts
Data Cleaning as a Process

• Two step process involves discrepancy detection


and transformation
• Process is error-prone and time consuming
• Iterative process and problems may be removed after
various iterations
• An incorrect year entry 20004 may only be fixed after
correcting all date entries
• Latest techniques emphasize interactivity
• e.g., Potter’s Wheels – http://control.cs.berkeley.edu/abc
• Declarative languages have been developed for
specification of data transformations as extensions to
SQL
• Meta data must be updated to speed up future
cleaning
DATA INTEGRATION
Data Integration
• Data integration:
• Combines data from multiple sources (data cubes,
databases and flat files) into a coherent store such as
a data warehouse
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate data from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources,

• Detecting and resolving data value conflicts


• For the same real world entity, attribute values from
different sources are different
• Possible reasons: different representations, different
scales, e.g., metric vs. British units
• Meta-data may be used to resolve the problem
Handling Redundancy in Data Integration

• Redundant data occur often when integration


of multiple databases
• Object identification: The same attribute or object
may have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be
detected by correlation analysis
• Careful integration of the data from multiple
sources may help
• reduce/avoid redundancies and inconsistencies
• improve mining speed and quality
Correlation Analysis (Numerical Data)

■ Correlation coefficient (also called Pearson’s product


moment coefficient)

rA, B 
 (A  A)( B  B)

 ( AB)  n AB
(n  1)AB (n  1)AB

where n is the number of tuples,


A and
B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.

■ If rA,B > 0, A and B are positively correlated (A’s values


increase as B’s). The higher, the stronger correlation.
■ rA,B = 0: independent; rA,B < 0: negatively correlated
Correlation Analysis (Categorical Data)

■ (chi-square) test or Pearson’s chi-square statistic


2
(Observed  Expected )
 2 
Expected
■ The larger the value, the more likely the variables are
related
■ The cells that contribute the most to the value are those
whose actual count is very different from the expected
count
■ Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Male Female Sum
(row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

■ Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution in
the 2two(250
categories)
 90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
     507.93
90 210 360 840

■ It shows that gender and preferred reading are correlated in the


group
■ Degree of freedom may be calculated as (r-1)X(c-1)=(2-1)X(2-
1)=1 and corresponding value is obtained from Χ2 distribution
which is 10.828

You might also like