0% found this document useful (0 votes)
9 views24 pages

Week2-1

Data preprocessing is essential for ensuring quality data in mining, as real-world data is often dirty, incomplete, noisy, or inconsistent. The document discusses various types of data attributes, methods for descriptive data summarization, and statistical measures such as central tendency and dispersion. It emphasizes that quality data leads to quality mining results and outlines the importance of data cleaning and transformation in building a data warehouse.

Uploaded by

sidramughal1011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

Week2-1

Data preprocessing is essential for ensuring quality data in mining, as real-world data is often dirty, incomplete, noisy, or inconsistent. The document discusses various types of data attributes, methods for descriptive data summarization, and statistical measures such as central tendency and dispersion. It emphasizes that quality data leads to quality mining results and outlines the importance of data cleaning and transformation in building a data warehouse.

Uploaded by

sidramughal1011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

DATA MINING:

LECTURE 4
Chapter 2-Data Preprocessing

Lets prepare data for mining!


Agenda
• Why preprocess the data?

• Descriptive data summarization


DATA PRE-PROCESSING: WHY?
Why Data Preprocessing?

• Data in the real world is dirty


• incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?

• Incomplete data may come from


• “Not applicable” data value when collected
• Different considerations between the time when the data was
collected and when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked
data)
• Duplicate records also need data cleaning
Why Is Data Preprocessing Important?

• No quality data, no quality mining results!


• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or
even misleading statistics
• Data warehouse needs consistent integration of
quality data

• Data extraction, cleaning, and transformation


comprises the majority of the work of building
a data warehouse
Multi-Dimensional Measure of Data Quality

• A well-accepted multidimensional view:


• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
• Broad categories:
• Intrinsic, contextual, representational, and accessible
DESCRIPTIVE DATA SUMMARIZATION
Data attributes and
Attribute types:
■ An attribute is a data field, representing a characteristic
or feature of a data object.
■ The type of attribute can be determined by the set of
values that an attribute can have.
• Nominal Attributes: Value of attribute are symbols or
names of things and is often referred to as categorical.
• Occupation: teacher, dentist, farmer etc.
• Binary Attributes: A nominal attribute with only two
values i.e. 0 or 1.
• Smoker: 0 means person is not a smoker and 1
means he is
• Ordinal Attributes: values with a meaningful order or
ranking.
• Customer satisfaction: 0 very dissatisfied, 1
Data attributes and
Attribute types:
• Numeric Attributes: measurable quantity represented in
integer or real value. Numerical attributes can be Interval
or ratio scaled.
• Interval-Scaled: the attributes that can not be
described as a ratio to zero point.
• Temperature in Celsius or Fahrenheit
• Ratio-Scaled: Numeric attribute with an inherent
value of zero-point.
• Years of experience
• Discrete versus Continuous Attributes: Discrete
attributes have countably infinite set of values. Continuous
attributes are represented as floating point values.
Mining Data Descriptive Characteristics

■ To better understand the data and to have an


overall picture of data many statistical descriptions
are used:
• Measure of central tendency: measure the location of
center or middle of a data distribution.
• Dispersion of Data: How are the data spread out?
• Graphical Display of statistical Description: Visual
representation of data.
Measuring the Central Tendency
• Mean: n
1 n
• arithmetic mean: x   xi w x i i
n i 1 x i 1
n
• Weighted arithmetic mean: w
i 1
i

• Trimmed mean: mean after chopping of extreme values.

• Median:
• Middle value if odd number of values, or average of the middle two
values otherwise

+ width
• Estimated by interpolation (for grouped data):

• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:

mean  mode 3 (mean  median)


Symmetric vs. Skewed Data

■ Median, mean and mode of symmetric,


positively and negatively skewed data
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, M, Q3, max


• Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
• Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)


• Variance: (algebraic, scalable computation)
n n
1 1
 (x   ) x
2
2  i
2
 i  2
N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)


Boxplot Analysis

• Five-number summary of a distribution:


• Minimum, Q1, M, Q3, Maximum

• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
• The median is marked by a line within the box
• Whiskers: two lines outside the box extend to
Minimum and Maximum
Properties of Normal Distribution Curve

■ The normal (distribution) curve


– From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
Histogram Analysis
• Graph displays of basic statistical class
descriptions
• Frequency histograms
• Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in the
given data
Quantile Plot
• Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi
indicates that approximately 100* fi% of the data
are below or equal to the value xi
Quantile plot
■ "Rankit" method F(i) = (i - 0.5) / n
Sample data: 5, 7, 9, 12, 14, 18, 21, 24, 26, 30
■ For our dataset of size n = 10,
■ F values = [(1 - 0.5)/10, (2 - 0.5)/10, ..., (10 -
0.5)/10] = [0.05, 0.15, 0.25, ..., 0.95]
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate
distribution against the corresponding
quantiles of another
• Allows the user to view whether there is a
shift in going from one distribution to another
Scatter plot
• Provides a first look at bivariate data to see
clusters of points, outliers, etc
• Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
Positively and Negatively Correlated Data
Not Correlated Data
Graphic Displays of Basic Statistical Descriptions

• Histogram: (shown before)


• Boxplot: (covered before)
• Quantile plot: each value xi is paired with fi
indicating that approximately 100 fi % of data are 
xi
• Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
• Loess (local regression) curve: add a smooth curve to
a scatter plot to provide better perception of the
pattern of dependence

You might also like