unit_1
unit_1
with R-Programming
—UNIT 1 —
Recommended Book:
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
April 22, 2025 UNIT 1 1
Introduction
Motivation (for Data Mining)
Data Mining-Definition & Functionalities
Data Processing
Form of Data Preprocessing
Data Cleaning: Missing Values, Noisy Data,
(Binning, Clustering, Regression, Computer and
Human inspection),Inconsistent Data
Data Integration and Transformation.
Data Reduction:-Data Cube Aggregation,
Dimensionality reduction, Data Compression,
Numerosity Reduction, Clustering
Discretization and Concept hierarchy generation.
April 22, 2025 UNIT 1
Motivation
In real world applications data can be
inconsistent ,incomplete and or noisy.
Errors can happen:
Faulty data collection instruments
Data entry problems.
Human misjudgment during data entry
Data transmission problems.
Technology limitations
Discrepancy in naming conventions
Results:
Duplicated records
Incomplete data
Contradictions in data.
Clean Data
Data Engineering
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
Statistics
Confirmatory
Small samples
In-sample Performance
Data Mining
Exploratory
Large Samples
Out-of-Samples Performance
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
Pattern recognition
Prediction
Segmentation
Partitioning
Generalization
“ If X then Y”)
Sequential Analysis determines sequential
patterns.
(Artificial) Neural Networks
Genetic algorithms
Hypothesis Testing.
April 22, 2025 UNIT 1
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
April 22, 2025 UNIT 1
KDD Process
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
April 22, 2025 UNIT 1
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms
•Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify
cars based on (gas mileage)
Predict some unknown or missing numerical values
April 22, 2025 UNIT 1
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and
accessibility
1 n x
x xi
Mean (algebraic measure) (sample vs. population):
n i 1 N
n
Weighted arithmetic mean: w x i i
x i 1
Trimmed mean: chopping extreme values n
w
i 1
i
Median: A holistic measure
Middle value if odd number of values, or average of the
middle two values otherwise
Estimated by interpolation (for grouped data): n / 2 (
median L1 (
f )l )c
Mode f median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula: mean mode 3 (mean median)
April 22, 2025 UNIT 1
Symmetric vs. Skewed
Data
1 nscalable computation)
Variance: (algebraic, 1 n 2 1 n
s 2
Standard deviation ( xi x ) 2
n 1s (or σ) is the square
i 1
[ xi
( x 2
i ] 2(
)
n 1 root ofnvariance s or σ
i 1 i 1
2)
Histogram:
Boxplot:
Quantile plot: each value xi is paired with fi
indicating that approximately 100 fi % of data are
xi
Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
Loess (local regression) curve: add a smooth curve
to a scatter plot to provide better perception of the
pattern of dependence UNIT 1
April 22, 2025
Boxplot Analysis
Upper quartile = = Q3
Interquartile range = Q3 – Q1 (IQR)
The outliers are below Q1 - 1.5 IQR and above Q3+ 1.5IQR
April 22, 2025 UNIT 1
Example Problems – Study Outliers in a Set of
Data:
Example 1 Calculate the outlier for the given set of data 31, 64, 69,
65, 62, 63, 62.
Solution:
The given set of data is 31, 64, 69, 65, 62, 63, 62.
Organize the given set of data in ascending order. 31, 62, 62, 63, 64, 65, 69.
The median is = (7+1)/2 = 4 .
The 4th value is the median. Thus Median = 63. = Q2
Lower quartile = = (7+1)/4 = 8/4 = 2.
The element which is in 2nd position is a lower quartile. Thus 62 is the lower
quartile.
Upper quartile = = (3*(7+1))/4 = 6
Interquartile range = Q3 – Q2 (IQR) = 65-62 = 3
Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR
Q1 - 1.5 IQR = 62-1.5*3 = 57.5
Q3+ 1.5IQR = 65+1.5*3 = 69.5
The outliers are below 57.5 and above 69.5.
31 is the outlier of a given set of data.
April 22, 2025 UNIT 1
Example 2 – Calculate the outlier for the given set of data
50, 61, 65,64, 67, 85, 70.
Solution: The given set of data is 50, 61, 65, 64, 67, 85, 70.
Organize the given set of data in ascending order. 50, 61, 64, 65, 67, 70,
85.
Interquartile range = Q3 – Q1 (IQR) = 70-61 = 9.
Now to find the outlier it needs to calculate Q1 - 1.5 IQR and Q3+ 1.5IQR
Q1 - 1.5 IQR = 61-1.5*9 = 47.5
Q3+ 1.5IQR = 70+1.5*9 = 83.5
The outliers are below 47.5 and above 83.5.
So 85 is an outlier of a given set of data.
April 22, 2025 UNIT 1
Example 1 –Calculate the interquartile range outlier for the given set of
data
60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.
Solution:The given set of data is 60, 61, 62, 55, 58, 59, 64, 65, 67, 90, 100.
Organize the given set of data in ascending order.
55, 58, 59, 60, 61, 62, 64, 65, 67, 90, 100.
The median is = (11+1)/2 =6. The 6 th value is the median = 62= Q2
Interquartile range = Q3 – Q1 = 67 - 59 = 8
Now to find the outlier it needs to calculate Q 1 - 1.5 IQR and Q3+ 1.5IQR
Q1 - 1.5 IQR = 591-1.5*8 = 47
Q3+ 1.5IQR = 70+1.5*8 = 79
Y1
Y1’ y=x+1
X1 x
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
Tool)
April 22, 2025 UNIT 1
Data Preprocessing
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
rA, B
(A A)( B B )
( AB ) n AB
( n 1)AB ( n 1)AB
Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
The larger the Χ2 value, the more likely the
variables are related
The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
Correlation does not imply causality
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
April 22, 2025 UNIT 1
Data Preprocessing
attributes
Data Compression
Date
t
uc
1Qtr 2Qtr 3Qtr 4Qtr sum
od
TV
Pr
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
elimination
Decision-tree induction
A4 ?
A1? A6?
The best single-feature is picked first
Then next best feature condition to the first, ...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and elimination
Use feature elimination and backtracking
April 22, 2025 UNIT 1
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
expansion
Audio/video compression
refinement
Sometimes small fragments of signal can be
os sy
l
Original Data
Approximated
Y1
Y2
X1
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Note: Sampling may not reduce database I/Os (page
at a time)
April 22, 2025 UNIT 1
Sampling: with or without
Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
April 22, 2025 UNIT 1
Sampling: Cluster or Stratified Sampling
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4: