0% found this document useful (0 votes)
2 views

unit_1

The document provides an overview of data warehousing and data mining, emphasizing the importance of data preprocessing, cleaning, and integration for effective data analysis. It discusses various data mining functionalities, algorithms, and processes, highlighting the distinction between data mining and traditional data analysis. Additionally, it outlines the knowledge discovery process and the significance of ensuring data quality for successful mining outcomes.

Uploaded by

editorvar4444
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

unit_1

The document provides an overview of data warehousing and data mining, emphasizing the importance of data preprocessing, cleaning, and integration for effective data analysis. It discusses various data mining functionalities, algorithms, and processes, highlighting the distinction between data mining and traditional data analysis. Additionally, it outlines the knowledge discovery process and the significance of ensuring data quality for successful mining outcomes.

Uploaded by

editorvar4444
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 102

Data Warehousing and Data Mining

with R-Programming

—UNIT 1 —
Recommended Book:
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
April 22, 2025 UNIT 1 1
Introduction
 Motivation (for Data Mining)
 Data Mining-Definition & Functionalities
 Data Processing
 Form of Data Preprocessing
 Data Cleaning: Missing Values, Noisy Data,
(Binning, Clustering, Regression, Computer and
Human inspection),Inconsistent Data
 Data Integration and Transformation.
 Data Reduction:-Data Cube Aggregation,
Dimensionality reduction, Data Compression,
 Numerosity Reduction, Clustering
 Discretization and Concept hierarchy generation.
April 22, 2025 UNIT 1
Motivation
 In real world applications data can be
inconsistent ,incomplete and or noisy.
Errors can happen:
 Faulty data collection instruments
 Data entry problems.
 Human misjudgment during data entry
 Data transmission problems.
 Technology limitations
 Discrepancy in naming conventions
Results:
 Duplicated records
 Incomplete data
 Contradictions in data.

April 22, 2025 UNIT 1


Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

April 22, 2025 UNIT 1


What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer?
 The exploration and analysis, by Automatic or semiautomatic means, of
large quantities of data in order to discover meaningful patterns.
 The extraction of implicit, previously unknown, and potentially useful
information from data or the process of discovery advantages patterns in
data.
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

April 22, 2025 UNIT 1


Data Mining Algorithm

 Objective: Fit Data to a Model


 Descriptive (characterize the general

properties of the data in the database)


 Predictive (perform inference on the

current data in order to make prediction)


 Preference – Technique to choose the best
model
 Search – Technique to search the data
 “Query”

April 22, 2025 UNIT 1


Data Mining Process

Define & Understanding the Problem.


Data Warehousing

Collect / Extract data

Clean Data

Data Engineering

Algorithm selection / Engineering

Run Mining Algorithm

Analyze the Results

April 22, 2025 UNIT 1


Database Processing vs. Data Mining
Processing
Query
 Query
•Defined Poorly
 Well defined
•No precise query language
 SQL

 Data  Data
– Operational data – Not operational data
 Output  Output
– Precise – Fuzzy
– Subset of database – Not a subset of database

April 22, 2025 UNIT 1


Data Warehousing and Data Mining
Statistics and Data Mining
 Data Warehousing

Provides The enterprise with a memory
 Data Mining

Provides the Enterprise with Intelligence.

 Statistics

Confirmatory

Small samples

In-sample Performance
 Data Mining

Exploratory

Large Samples

Out-of-Samples Performance

April 22, 2025 UNIT 1


Query Examples

 Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
 Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)

April 22, 2025 UNIT 1


Data Mining Models and Tasks

April 22, 2025 UNIT 1


Basic Data Mining Tasks

 Classification maps data into predefined groups


or classes
 Supervised learning

 Pattern recognition

 Prediction

 Regression is used to map a data item to a real


valued prediction variable.
 Clustering groups similar data together into
clusters.
 Unsupervised learning

 Segmentation

 Partitioning

April 22, 2025 UNIT 1


Basic Data Mining Tasks (cont’d)

 Summarization maps data into subsets with


associated simple descriptions.
 Characterization

 Generalization

 Link Analysis uncovers relationships among data.


 Affinity Analysis

 Association Rules (Finds rule of the form: X=>Y Or

“ If X then Y”)
 Sequential Analysis determines sequential

patterns.
 (Artificial) Neural Networks
 Genetic algorithms
 Hypothesis Testing.
April 22, 2025 UNIT 1
Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
April 22, 2025 UNIT 1
Data Mining vs. KDD

 Knowledge Discovery in Databases


(KDD): process of finding useful
information and patterns in data.
 Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.

April 22, 2025 UNIT 1


Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
April 22, 2025 UNIT 1
KDD Process

Selection: Obtain data from various sources.


Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results to
user in meaningful manner.

April 22, 2025 UNIT 1


KDD Process: Several Key Steps
 Learning the application domain
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant
representation
 Choosing functions of data mining
 summarization, classification, regression, association, clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

April 22, 2025 UNIT 1


KDD Process Ex: Web Log
 Selection:

Select log data (dates and locations) to use
 Preprocessing:

Remove identifying URLs

Remove error logs
 Transformation:

Sessionize (sort and group)
 Data Mining:

Identify and count patterns

Construct data structure
 Interpretation/Evaluation:

Identify and display frequently accessed sequences.
 Potential User Applications:

Cache prediction

Personalization

April 22, 2025 UNIT 1


Are All the “Discovered” Patterns Interesting?

 Data mining may generate thousands of patterns: Not all of them are
interesting

Suggested approach: Human-centered, query-based, focused
mining
 Interestingness measures

A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
 Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.

Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
April 22, 2025 UNIT 1
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms
•Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines

•Bayes Theorem
•Regression Analysis

•EM Algorithm

•K-Means Clustering

•Time Series Analysis

•Algorithm Design Techniques


•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms

April 22, 2025 UNIT 1


Why Not Traditional Data Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
April 22, 2025 UNIT 1
Data Mining Functionalities
( Kind of Patterns To Be Found)

 Multidimensional concept description:


Characterization( Generalization or summarization) and
discrimination ( Comparison)

Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
 Frequent patterns, association, correlation vs. causality
Launch Internet Explorer Browser.lnk


Diaper  Beer [0.5%, 75%] (Correlation or causality?)
 Classification and prediction

Construct models (functions) that describe and distinguish
classes or concepts for future prediction

E.g., classify countries based on (climate), or classify
cars based on (gas mileage)

Predict some unknown or missing numerical values
April 22, 2025 UNIT 1
Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy

April 22, 2025 UNIT 1


Data Preprocessing

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values,

lacking certain attributes of interest, or


containing only aggregate data

e.g., occupation=“ ”
 noisy: containing errors or outliers

e.g., Salary=“-10”
 inconsistent: containing discrepancies in
codes or names

e.g., Age=“42” Birthday=“03/07/1997”

e.g., Was rating “1,2,3”, now rating “A, B, C”

e.g., discrepancy between duplicate records
April 22, 2025 UNIT 1
Why Is Data Dirty?
 Incomplete data may come from

“Not applicable” data value when collected

Different considerations between the time when the data
was collected and when it is analyzed.

Human/hardware/software problems
 Noisy data (incorrect values) may come from

Faulty data collection instruments

Human or computer error at data entry

Errors in data transmission
 Inconsistent data may come from

Different data sources

Functional dependency violation (e.g., modify some linked
data)
 Duplicate records also need data cleaning
April 22, 2025 UNIT 1
Why Is Data Preprocessing
Important?

 No quality data, no quality mining results!


 Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
 Data warehouse needs consistent integration of
quality data
 Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse

April 22, 2025 UNIT 1


Multi-Dimensional Measure of Data
Quality

 A well-accepted multidimensional view:


 Accuracy

 Completeness

 Consistency

 Timeliness

 Believability

 Value added

 Interpretability

 Accessibility

 Broad categories:
 Intrinsic, contextual, representational, and

accessibility

April 22, 2025 UNIT 1


Major Tasks in Data
Preprocessing
 Data cleaning

Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
 Data integration

Integration of multiple databases, data cubes, or files
 Data transformation

Normalization and aggregation
 Data reduction

Obtains reduced representation in volume but produces
the same or similar analytical results
 Data discretization

Part of data reduction but with particular importance,
especially for numerical data

Data discretization is defined as a process of converting
continuous data attribute values into a finite set of
intervals with minimal loss of information.

April 22, 2025 UNIT 1


Forms of Data Preprocessing

April 22, 2025 UNIT 1


Data preprocessing

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Mining Data Descriptive
Characteristics
 Motivation
 For data preprocessing to be successful it is essential to better understand the data (overall
picture of your data)
 To identify the typical properties of your data and highlight which data values should be treated
as noise or outliers.
 Two Approaches
 Measure of Central Tendency - effective measure to find out the degree to which
numerical data tend to occur at the center of the data set. (Mean, Median, Mode , midrange)
 Measures of Data Dispersion - effective measure to find out the degree to which
numerical data tend to spread in the data set. (Range, Quartiles, interquartile range (IQR),
outliers, Box plot, variance, Standard Deviation)
 Kinds of Measure
 Distributive Measure:-that cam be computed for a given data set by partitioning the data
into smaller subsets, computing the measure for each subset. and then merging the results in
order to arrive at the measure’s value for the original data set. (count sum)
 Algebraic Measure: That can be computed by applying an algebraic function to one or more
distributive measure. (mean)
 Holistic Measure:- That must be computed on the entire data set as a whole (median)

April 22, 2025 UNIT 1


Measuring the Central Tendency

 
1 n x
x   xi
 Mean (algebraic measure) (sample vs. population):
n i 1 N
n
 Weighted arithmetic mean: w x i i

 x i 1
Trimmed mean: chopping extreme values n

w
i 1
i
 Median: A holistic measure
 Middle value if odd number of values, or average of the
middle two values otherwise
 Estimated by interpolation (for grouped data): n / 2  (
median L1  (
 f )l )c
 Mode f median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode 3 (mean  median)
April 22, 2025 UNIT 1
Symmetric vs. Skewed
Data

 Median, mean and mode of


symmetric, positively and
negatively skewed data

April 22, 2025 UNIT 1


Measuring the Dispersion of Data
 Range, Quartiles, outliers and boxplots
 Range : The range of the set is the difference between the largest & smallest values
 Percentile :The value of a variable below which a certain percent of observations fall
 Quartiles: Quartile means separating the given set of data into 4 equal parts by 3
divisions. The three separations are lower quartile, median and upper quartile. The
lower quartile is the mid data between the first number and its median and the upper
quartile is the mid data between the median and last number of a given set. The
outlier of a given set of data can be identified with the help of interquartile range. Q 1
(25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, M, Q3, max
 Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
n n
Outlier: usually, a value higher/lower than 1.5 x  1 1
 x
 IQR 2
2
( xi  2
)  i  2
 Variance and standard deviation (sample: s, population: σ)N i 1 N i 1


1 nscalable computation)
Variance: (algebraic, 1 n 2 1 n

s 2

Standard deviation  ( xi  x ) 2

n  1s (or σ) is the square
i 1
 [ xi  
( x 2
i ] 2(
)
n  1 root ofnvariance s or σ
i 1 i 1
2)

April 22, 2025 UNIT 1


Properties of Normal Distribution
Curve
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation)


 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

April 22, 2025 UNIT 1


Graphic Displays of Basic Statistical Descriptions

 Histogram:
 Boxplot:
 Quantile plot: each value xi is paired with fi
indicating that approximately 100 fi % of data are 
xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
 Loess (local regression) curve: add a smooth curve
to a scatter plot to provide better perception of the
pattern of dependence UNIT 1
April 22, 2025
Boxplot Analysis

 Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
 Boxplot

Data is represented with a box

The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ

The median is marked by a line within the box
 Whiskers: two lines outside the box extend to
Minimum and Maximum

April 22, 2025 UNIT 1


Visualization of Data Dispersion: Boxplot
Analysis

April 22, 2025 UNIT 1


OUTLIER Detection

 Quartile means separating the given set of data into 4 equal


parts by 3 divisions. The three separations are lower quartile,
median and upper quartile. The lower quartile is the mid data
between the first number and its median and the upper
quartile is the mid data between the median and last number
of a given set. The outlier of a given set of data can be
identified with the help of interquartile range.
 Formula Involved – Study Outliers in a Set of Data:
 n – n is the total number of elements in a set.
 Median = = Q2 Lower quartile = =
Q1

 Upper quartile = = Q3
 Interquartile range = Q3 – Q1 (IQR)
 The outliers are below Q1 - 1.5 IQR and above Q3+ 1.5IQR
April 22, 2025 UNIT 1
Example Problems – Study Outliers in a Set of
Data:

 Example 1 Calculate the outlier for the given set of data 31, 64, 69,
65, 62, 63, 62.
 Solution:
 The given set of data is 31, 64, 69, 65, 62, 63, 62.
 Organize the given set of data in ascending order. 31, 62, 62, 63, 64, 65, 69.
 The median is = (7+1)/2 = 4 .

The 4th value is the median. Thus Median = 63. = Q2
 Lower quartile = = (7+1)/4 = 8/4 = 2.
 The element which is in 2nd position is a lower quartile. Thus 62 is the lower
quartile.
 Upper quartile = = (3*(7+1))/4 = 6

 The element which is in 6th position is an upper quartile. Thus 65 is an upper


quartile.


Interquartile range = Q3 – Q2 (IQR) = 65-62 = 3

Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR

Q1 - 1.5 IQR = 62-1.5*3 = 57.5

Q3+ 1.5IQR = 65+1.5*3 = 69.5
 The outliers are below 57.5 and above 69.5.
 31 is the outlier of a given set of data.
April 22, 2025 UNIT 1
 Example 2 – Calculate the outlier for the given set of data
50, 61, 65,64, 67, 85, 70.

 Solution: The given set of data is 50, 61, 65, 64, 67, 85, 70.
 Organize the given set of data in ascending order. 50, 61, 64, 65, 67, 70,
85.

 The median is = (7+1)/2 = 8/2= 4. Thus the 4th element is a median.


Median = 65. = Q2

 Lower quartile = = (7+1)/4 = 2. The 2nd element is a lower quartile.


Lower Quartile = 61= Q1

 Upper quartile = =3*(7+1)/4 = 6. The 6th element is an upper


quartile
Upper Quartile= 70= Q3


Interquartile range = Q3 – Q1 (IQR) = 70-61 = 9.

Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR

Q1 - 1.5 IQR = 61-1.5*9 = 47.5

Q3+ 1.5IQR = 70+1.5*9 = 83.5
 The outliers are below 47.5 and above 83.5.
 So 85 is an outlier of a given set of data.
April 22, 2025 UNIT 1
 Example 1 –Calculate the interquartile range outlier for the given set of
data
60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.

 Solution:The given set of data is 60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.
Organize the given set of data in ascending order.
55, 58, 59, 60, 61, 62, 64, 65, 67, 90, 100.


The median is = (11+1)/2 =6. The 6 th value is the median = 62= Q2

 Lower quartile = (11+1)/4= 12/4 =3.The element which is in 3 rd position


is a
lower quartile = 59 = Q1

 Upper quartile =3*(11+1)/4 =9. The element which is in 9 th position is


an
upper quartile = 67 =Q3


Interquartile range = Q3 – Q1 = 67 - 59 = 8


Now to find the outlier it needs to calculate Q 1 - 1.5 IQR and Q3+ 1.5IQR

Q1 - 1.5 IQR = 591-1.5*8 = 47

Q3+ 1.5IQR = 70+1.5*8 = 79

 The outliers are below 47 and above 79.


 So, 90 and 100 are an outlier for a given set of data.
April 22, 2025 UNIT 1
Histogram Analysis

 Graph displays of basic statistical class


descriptions

Frequency histograms

A univariate graphical method

Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data

April 22, 2025 UNIT 1


Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi

April 22, 2025 UNIT 1


Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
 Allows the user to view whether there is a shift in
going from one distribution to another

April 22, 2025 UNIT 1


Scatter plot
 Provides a first look at bivariate data to see
clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

April 22, 2025 UNIT 1


Scatter plot

April 22, 2025 UNIT 1


Loess Curve
 Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of dependence
 Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the
polynomials that are fitted by the regression

April 22, 2025 UNIT 1


Positively and Negatively Correlated
Data

April 22, 2025 UNIT 1


Not Correlated Data

April 22, 2025 UNIT 1


Data Preprocessing

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest

problems in data warehousing”—Ralph Kimball


 “Data cleaning is the number one problem in

data warehousing”—DCI survey


 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

April 22, 2025 UNIT 1


Missing Data

 Data is not always available


 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred.

April 22, 2025 UNIT 1


How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree

April 22, 2025 UNIT 1


Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
 Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data
April 22, 2025 UNIT 1
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency)
bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g.,
deal with possible outliers)

April 22, 2025 UNIT 1


Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
April 22, 2025 UNIT 1
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
April 22, 2025 UNIT 1
Regression

Y1

Y1’ y=x+1

X1 x

April 22, 2025 UNIT 1


Cluster Analysis

April 22, 2025 UNIT 1


Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools


Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface


 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels -Data cleaning

Tool)
April 22, 2025 UNIT 1
Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary

April 22, 2025 UNIT 1


Data Integration
 Data integration:
 Combines data from multiple sources into a

coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

 Entity identification problem:


 Identify real world entities from multiple data

sources, e.g., Bill Clinton = William Clinton


 Detecting and resolving data value conflicts
 For the same real world entity, attribute values

from different sources are different


 Possible reasons: different representations,

different scales, e.g., metric vs. British units

April 22, 2025 UNIT 1


Handling Redundancy in Data
Integration
 Redundant data occur often when integration of
multiple databases

Object identification: The same attribute or object
may have different names in different databases

Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

April 22, 2025 UNIT 1


Correlation Analysis (Numerical
Data)

 Correlation coefficient (also called Pearson’s product


moment coefficient)

rA, B 
 (A  A)( B  B )

 ( AB )  n AB
( n  1)AB ( n  1)AB

where n is the number of tuples,A and


B are the respective
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(AB) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
April 22, 2025 UNIT 1
Correlation Analysis (Categorical
Data)

 Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
 The larger the Χ2 value, the more likely the
variables are related
 The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
 Correlation does not imply causality

April 22, 2025 UNIT 1


Chi-Square Calculation: An Example

Gender / Preffered Male Female Sum


Reading (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis


are expected counts calculated based on the data
distribution in the two categories)
2 2 2 2
( 250  90) (50  210) ( 200  360) (1000  840)
2     507.93
90 210 360 840
 It shows that like_science_fiction and male are
correlated in the group
April 22, 2025 UNIT 1
Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube
construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range

min-max normalization

z-score normalization

normalization by decimal scaling
 Attribute/feature construction

New attributes constructed from the given ones
April 22, 2025 UNIT 1
Data Transformation:
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to
73,600  12,000
(1.0  0)  0 0.716
[0.0, 1.0]. Then $73,000 is mapped to
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
1.225
 Ex. Let μ = 54,000, σ = 16,000. Then16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
April 22, 2025 UNIT 1
Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Data Reduction Strategies
 Why data reduction?
 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time

to run on the complete data set


 Data reduction
 Obtain a reduced representation of the data set that is

much smaller in volume but yet produce the same (or


almost the same) analytical results
 Data reduction strategies
 Data cube aggregation:

 Attribute Subset Selection

 Dimensionality reduction — e.g., remove unimportant

attributes
 Data Compression

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

April 22, 2025 UNIT 1


Data Cube Aggregation
 The lowest level of a data cube (base cuboid)

The aggregated data for an individual entity of
interest

E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with
 Reference appropriate levels

Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
April 22, 2025 UNIT 1
Data Cube Aggregation

Date
t
uc
1Qtr 2Qtr 3Qtr 4Qtr sum
od

TV
Pr

PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

April 22, 2025 UNIT 1


Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the

probability distribution of different classes given the


values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to

understand
 Heuristic methods (due to exponential # of choices):
 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward

elimination
 Decision-tree induction

April 22, 2025 UNIT 1


Example of Decision Tree
Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

April 22, 2025 UNIT 1


Heuristic Feature Selection
Methods
 There are 2d possible sub-features of d features
 Several heuristic feature selection methods:
 Best single features under the feature

independence assumption: choose by significance


tests
 Best step-wise feature selection:


The best single-feature is picked first

Then next best feature condition to the first, ...
 Step-wise feature elimination:


Repeatedly eliminate the worst feature
 Best combined feature selection and elimination

 Optimal branch and bound:


Use feature elimination and backtracking
April 22, 2025 UNIT 1
Data Compression
 String compression
 There are extensive theories and well-tuned

algorithms
 Typically lossless

 But only limited manipulation is possible without

expansion
 Audio/video compression

 Typically lossy compression, with progressive

refinement
 Sometimes small fragments of signal can be

reconstructed without reconstructing the whole


 Time sequence is not audio

 Typically short and vary slowly with time


April 22, 2025 UNIT 1
Data Compression

Original Data Compressed


Data
lossless

os sy
l
Original Data
Approximated

April 22, 2025 UNIT 1


Dimensionality Reduction:
Wavelet Transformation
Haar2 Daubechie4
 Discrete wavelet transform (DWT): linear signal
processing, multi-resolution analysis
 Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length

April 22, 2025 UNIT 1


DWT for Image Compression
 Image

Low Pass High Pass

Low Pass High Pass

Low Pass High Pass

April 22, 2025 UNIT 1


Dimensionality Reduction:
Principal Component Analysis
(PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
 Steps

Normalize input data: Each attribute falls within the same range

Compute k orthonormal (unit) vectors, i.e., principal components

Each input data (vector) is a linear combination of the k
principal component vectors

The principal components are sorted in order of decreasing
“significance” or strength

Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with
low variance. (i.e., using the strongest principal components, it
is possible to reconstruct a good approximation of the original
data
 Works for numeric data only
 Used when the number of dimensions is large
April 22, 2025 UNIT 1
Principal Component
Analysis
X2

Y1
Y2

X1

April 22, 2025 UNIT 1


Numerosity Reduction
 Reduce data volume by choosing alternative,
smaller forms of data representation
 Parametric methods

Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)

Example: Log-linear models—obtain value at a
point in m-D space as the product on
appropriate marginal subspaces
 Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling
April 22, 2025 UNIT 1
Regression and Log-Linear
Models

 Linear regression: Data are modeled to fit a straight


line

Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions
April 22, 2025 UNIT 1
Regress Analysis and Log-Linear
Models
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the

line and are to be estimated by using the data


at hand
 Using the least squares criterion to the known

values of Y1, Y2, …, X1, X2, ….


 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed

into the above


 Log-linear models:
 The multi-way table of joint probabilities is

approximated by a product of lower-order tables



Probability: p(a, b, c, d) = ab acad bcd
April 22, 2025 UNIT 1
Data Reduction Method (2):
Histograms
 Divide data into buckets and store 40
average (sum) for each bucket
35
 Partitioning rules:

Equal-width: equal bucket range30

Equal-frequency (or equal- 25
depth)
20

V-optimal: with the least
histogram variance (weighted 15
sum of the original values that 10
each bucket represents)

MaxDiff: set bucket boundary
5
between each pair for pairs 0
have the β–1 largest differences
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
April 22, 2025 UNIT 1
Data Reduction Method (3):
Clustering

 Partition data set into clusters based on similarity, and


store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

April 22, 2025 UNIT 1


Data Reduction Method (4):
Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor

performance in the presence of skew


 Develop adaptive sampling methods
 Stratified sampling:


Approximate the percentage of each class (or
subpopulation of interest) in the overall database

Used in conjunction with skewed data
 Note: Sampling may not reduce database I/Os (page
at a time)
April 22, 2025 UNIT 1
Sampling: with or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
April 22, 2025 UNIT 1
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

April 22, 2025 UNIT 1


Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
April 22, 2025 UNIT 1
Discretization
 Three types of attributes:

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic
rank

Continuous — real numbers, e.g., integer or real numbers
 Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis

April 22, 2025 UNIT 1


Discretization and Concept
Hierarchy
 Discretization

Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data
values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute
 Concept hierarchy formation

Recursively reduce the data by collecting and replacing low
level concepts (such as numeric values for age) by higher
level concepts (such as young, middle-aged, or senior)
April 22, 2025 UNIT 1
Discretization and Concept Hierarchy
Generation for Numeric Data
 Typical methods: All the methods can be applied recursively
 Binning (covered above)

Top-down split, unsupervised,
 Histogram analysis (covered above)

Top-down split, unsupervised

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised
 Entropy-based discretization: supervised, top-down split

Interval merging by 2 Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split,
unsupervised

April 22, 2025 UNIT 1


Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S 1
and S2 using boundary T, the information gain after partitioning is
|S | |S |
I ( S , T )  1 Entropy ( S 1)  2 Entropy ( S 2)
|S| |S|
 Entropy is calculated based on class distribution of the samples in
the set. Given m classes, the entropy of S1 is
m
Entropy ( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1


 The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization
 The process is recursively applied to partitions obtained until
some stopping criterion is met
 Such a boundary may reduce data size and improve classification
accuracy
April 22, 2025 UNIT 1
Interval Merge by 2 Analysis
 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge them to
form larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered
to be one interval

2 tests are performed for every pair of adjacent intervals

Adjacent intervals with the least 2 values are merged together,
since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level, max-
interval, max inconsistency, etc.)

April 22, 2025 UNIT 1


Segmentation by Natural
Partitioning

 A simply 3-4-5 rule can be used to segment numeric


data into relatively uniform, “natural” intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals

If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

April 22, 2025 UNIT 1


Example of 3-4-5 Rule
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
April 22, 2025 UNIT 1
Concept Hierarchy Generation for
Categorical Data

 Specification of a partial/total ordering of attributes


explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by
explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state,
country}
April 22, 2025 UNIT 1
Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated
based on the analysis of the number of distinct
values per attribute in the data set
 The attribute with the most distinct values is

placed at the lowest level of the hierarchy


 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


April 22, 2025 UNIT 1

You might also like