0% found this document useful (0 votes)

9 views24 pages

Week2-1

Data preprocessing is essential for ensuring quality data in mining, as real-world data is often dirty, incomplete, noisy, or inconsistent. The document discusses various types of data attributes, methods for descriptive data summarization, and statistical measures such as central tendency and dispersion. It emphasizes that quality data leads to quality mining results and outlines the importance of data cleaning and transformation in building a data warehouse.

Uploaded by

sidramughal1011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views24 pages

Week2-1

Uploaded by

sidramughal1011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

DATA MINING:

LECTURE 4
Chapter 2-Data Preprocessing

Lets prepare data for mining!

Agenda
• Why preprocess the data?

• Descriptive data summarization

DATA PRE-PROCESSING: WHY?
Why Data Preprocessing?

• Data in the real world is dirty

• incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?

• Incomplete data may come from

• “Not applicable” data value when collected
• Different considerations between the time when the data was
collected and when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked
data)
• Duplicate records also need data cleaning
Why Is Data Preprocessing Important?

• No quality data, no quality mining results!

• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or
even misleading statistics
• Data warehouse needs consistent integration of
quality data

• Data extraction, cleaning, and transformation

comprises the majority of the work of building
a data warehouse
Multi-Dimensional Measure of Data Quality

• A well-accepted multidimensional view:

• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
• Broad categories:
• Intrinsic, contextual, representational, and accessible
DESCRIPTIVE DATA SUMMARIZATION
Data attributes and
Attribute types:
■ An attribute is a data field, representing a characteristic
or feature of a data object.
■ The type of attribute can be determined by the set of
values that an attribute can have.
• Nominal Attributes: Value of attribute are symbols or
names of things and is often referred to as categorical.
• Occupation: teacher, dentist, farmer etc.
• Binary Attributes: A nominal attribute with only two
values i.e. 0 or 1.
• Smoker: 0 means person is not a smoker and 1
means he is
• Ordinal Attributes: values with a meaningful order or
ranking.
• Customer satisfaction: 0 very dissatisfied, 1
Data attributes and
Attribute types:
• Numeric Attributes: measurable quantity represented in
integer or real value. Numerical attributes can be Interval
or ratio scaled.
• Interval-Scaled: the attributes that can not be
described as a ratio to zero point.
• Temperature in Celsius or Fahrenheit
• Ratio-Scaled: Numeric attribute with an inherent
value of zero-point.
• Years of experience
• Discrete versus Continuous Attributes: Discrete
attributes have countably infinite set of values. Continuous
attributes are represented as floating point values.
Mining Data Descriptive Characteristics

■ To better understand the data and to have an

overall picture of data many statistical descriptions
are used:
• Measure of central tendency: measure the location of
center or middle of a data distribution.
• Dispersion of Data: How are the data spread out?
• Graphical Display of statistical Description: Visual
representation of data.
Measuring the Central Tendency
• Mean: n
1 n
• arithmetic mean: x   xi w x i i
n i 1 x i 1
n
• Weighted arithmetic mean: w
i 1
i

• Trimmed mean: mean after chopping of extreme values.

• Median:
• Middle value if odd number of values, or average of the middle two
values otherwise

+ width
• Estimated by interpolation (for grouped data):
–

• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:

mean  mode 3 (mean  median)

Symmetric vs. Skewed Data

■ Median, mean and mode of symmetric,

positively and negatively skewed data
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, M, Q3, max

• Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
• Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)

• Variance: (algebraic, scalable computation)
n n
1 1
 (x   ) x
2
2  i
2
 i  2
N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Boxplot Analysis

• Five-number summary of a distribution:

• Minimum, Q1, M, Q3, Maximum

• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
• The median is marked by a line within the box
• Whiskers: two lines outside the box extend to
Minimum and Maximum
Properties of Normal Distribution Curve

■ The normal (distribution) curve

– From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
Histogram Analysis
• Graph displays of basic statistical class
descriptions
• Frequency histograms
• Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in the
given data
Quantile Plot
• Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi
indicates that approximately 100* fi% of the data
are below or equal to the value xi
Quantile plot
■ "Rankit" method F(i) = (i - 0.5) / n
Sample data: 5, 7, 9, 12, 14, 18, 21, 24, 26, 30
■ For our dataset of size n = 10,
■ F values = [(1 - 0.5)/10, (2 - 0.5)/10, ..., (10 -
0.5)/10] = [0.05, 0.15, 0.25, ..., 0.95]
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate
distribution against the corresponding
quantiles of another
• Allows the user to view whether there is a
shift in going from one distribution to another
Scatter plot
• Provides a first look at bivariate data to see
clusters of points, outliers, etc
• Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
Positively and Negatively Correlated Data
Not Correlated Data
Graphic Displays of Basic Statistical Descriptions

• Histogram: (shown before)

• Boxplot: (covered before)
• Quantile plot: each value xi is paired with fi
indicating that approximately 100 fi % of data are 
xi
• Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
• Loess (local regression) curve: add a smooth curve to
a scatter plot to provide better perception of the
pattern of dependence

Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
AuCom EMX3 User Manual en
No ratings yet
AuCom EMX3 User Manual en
90 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
3_Preprocessing
No ratings yet
3_Preprocessing
82 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
02Data
No ratings yet
02Data
66 pages
02Data
No ratings yet
02Data
65 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
CH 2
No ratings yet
CH 2
68 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
data mining 2
No ratings yet
data mining 2
64 pages
Slide-04-Chapter2-Getting to Know Your Data
No ratings yet
Slide-04-Chapter2-Getting to Know Your Data
47 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Module 1
No ratings yet
Module 1
64 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
CH - 4
No ratings yet
CH - 4
71 pages
Lect 3
No ratings yet
Lect 3
51 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
02 Data
No ratings yet
02 Data
41 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
02Data
No ratings yet
02Data
24 pages
02 Data
No ratings yet
02 Data
65 pages
Lec 2
No ratings yet
Lec 2
26 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
02 Data
No ratings yet
02 Data
62 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
02data Part2
No ratings yet
02data Part2
34 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
02 Data
No ratings yet
02 Data
64 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Analyzing Persuasive Techniques in Advertising PDF
100% (1)
Analyzing Persuasive Techniques in Advertising PDF
4 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Packaging Development Strategy - EN
100% (1)
Packaging Development Strategy - EN
329 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Decision Making Process
0% (1)
Decision Making Process
6 pages
Economics-Update-Magazine-Issue1
No ratings yet
Economics-Update-Magazine-Issue1
20 pages
My Super City Memories Book
No ratings yet
My Super City Memories Book
107 pages
VWAP Ribbon with Single Band
No ratings yet
VWAP Ribbon with Single Band
2 pages
Result of Second Semester M B A Degree Examination August 2013
No ratings yet
Result of Second Semester M B A Degree Examination August 2013
92 pages
COA Question Bank
No ratings yet
COA Question Bank
2 pages
AlwaysOn Availability Group Failover Guide
No ratings yet
AlwaysOn Availability Group Failover Guide
6 pages
Liner calibration
No ratings yet
Liner calibration
1 page
Fahd Stitou Dop
No ratings yet
Fahd Stitou Dop
1 page
ALE 300 Welcome Screen
No ratings yet
ALE 300 Welcome Screen
4 pages
AOIP-Jupiter 650
No ratings yet
AOIP-Jupiter 650
6 pages
Continuous Improvement Roadmap
100% (1)
Continuous Improvement Roadmap
3 pages
NSNU Planning 1 Paper
No ratings yet
NSNU Planning 1 Paper
11 pages
Minutes of The Meeting
No ratings yet
Minutes of The Meeting
5 pages
London's Gold Market, Timeline
No ratings yet
London's Gold Market, Timeline
2 pages
Crim Law Penalties
No ratings yet
Crim Law Penalties
8 pages
Hayat Diinsuranskan Pemunya Polisi: Prudential Assurance Malaysia Berhad
No ratings yet
Hayat Diinsuranskan Pemunya Polisi: Prudential Assurance Malaysia Berhad
2 pages
301 Salas V Matusalem Valdecantos
No ratings yet
301 Salas V Matusalem Valdecantos
3 pages
Internship Report Format
No ratings yet
Internship Report Format
20 pages
Hot Rolled Steel Catalogue - Nov01
No ratings yet
Hot Rolled Steel Catalogue - Nov01
30 pages
Ultrasonic Inspection Procedure For Complete Joint Penetration (CJP) Welds in A T-Joint
No ratings yet
Ultrasonic Inspection Procedure For Complete Joint Penetration (CJP) Welds in A T-Joint
9 pages
Unofficialresults
No ratings yet
Unofficialresults
2 pages
Specifications: Varistor GNR10D
No ratings yet
Specifications: Varistor GNR10D
6 pages
Antal Gin
No ratings yet
Antal Gin
7 pages
Immig Ratio 1
No ratings yet
Immig Ratio 1
3 pages
Inventory Management Problem 1:: ABC Analysis
No ratings yet
Inventory Management Problem 1:: ABC Analysis
9 pages
Rate Analysis
0% (1)
Rate Analysis
14 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet