0% found this document useful (0 votes)
0 views

Data Mining

The document provides an introduction to data mining, highlighting its significance due to the exponential growth of data and the need for knowledge extraction. It outlines the stages of data mining, including data preprocessing, modeling, and evaluation, as well as various functionalities such as classification, clustering, and association analysis. Additionally, it discusses the importance of data quality and the preprocessing tasks necessary to ensure accurate data mining outcomes.

Uploaded by

itscherry606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Mining

The document provides an introduction to data mining, highlighting its significance due to the exponential growth of data and the need for knowledge extraction. It outlines the stages of data mining, including data preprocessing, modeling, and evaluation, as well as various functionalities such as classification, clustering, and association analysis. Additionally, it discusses the importance of data quality and the preprocessing tasks necessary to ensure accurate data mining outcomes.

Uploaded by

itscherry606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 254

Data Mining:

Introduction

1
Introduction
• Motivation
• What is data mining?
• Why data mining?
• Data Mining: On what kind of data?
• Data mining functionality

2
Motivation
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras,
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets

3
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

4
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
• Alternative name
– Knowledge discovery in databases (KDD)

5
Data Mining Stages
• Business understanding
involves understanding
the domain for which
data mining has to be
performed
• Data pre-processing
involves cleaning the
data, transforming the
data, selecting subsets of
records that are of
interest
• Data modelling involves
building models such as
decision tree, support
vector machine (SVM),
and neural network from
the pre-processed data 6
Data Mining Models
• Decision trees: Decision tree is one of the most popular classification models. It is
similar to a tree-like structure, where each internal node denotes a decision on the
value of an attribute. A branch represents a decision, and leaves represent target
classes. A decision tree displays the various relationships found in the training data
by executing a classification algorithm.
• Neural networks: Neural networks offer a mathematical model that attempts to
mimic the human brain. Knowledge is represented as a layered set of
interconnected processors called neurons. Each node has a weighted connection
with other nodes in adjacent layers. Learning in neural networks is accomplished
by network connection weight changes while a set of input instances is repeatedly
passed through the network. Once trained, an unknown instance passing through
the network is classified according to the values seen at the output layer.
• Naive Bayes classifier: This classifier offers a simple yet powerful supervised
classification technique. The model assumes all input attributes to be of equal
importance and independent of one another. Naive Bayes classifier is based on the
classical Bayes theorem presented in 1763 which works on the probability theory.

7
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………

• This is a view from typical machine learning and statistics communities

8
Steps of KDD process
• Data cleaning: to remove noise and inconsistent data
• Data integration: where multiple data sources may be combined
• Data selection: where data relevant to the analysis task are retrieved from
the database
• Data transformation: where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations
• Data mining: an essential process where intelligent methods are applied
to extract data patterns
• Pattern evaluation: to identify the truly interesting patterns representing
knowledge based on interest
• Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to users

9
Why Data Mining?—Potential Applications

• In classification, the goal is to classify a new data record into one of the many
possible classes, which are already known. For example, an applicant has to be
classified as a prospective applicant or a defaulter in a loan database, given his
various personal and other demographic features along with previous purchase
characteristics.
• In estimation, unlike classification, we predict the attribute of a data instance—
usually a numeric value rather than a categorical class. An example can be “Estimate
the percentage of marks of a student whose previous marks are already available”.
• Market basket analysis or association rule mining analyses hidden rules called
association rules in a large transactional database. For example, the rule {pen, pencil
→ book} provides the information that whenever pen and pencil are purchased
together, book is also purchased; so these items can be placed together for sales or
supplied as a complementary product with one another to increase the overall sales
of each item.

10
Why Data Mining?—Potential Applications

• In clustering, we use unsupervised learning technique, where target classes are


unknown. For example, given 1000 applicants have to be classified based on
certain similarity criteria and it is not predefined which are those classes to which
the applicants should finally be grouped into.
• Other categories of data available nowadays are scientific data collected by
satellites using sensors, data collected by telescopes scanning the skies, scientific
simulations generating terabytes of data, etc.

11
Market Analysis and Management

• Where does the data come from?


– Credit card transactions, discount coupons,
customer complaint calls
• Target marketing
– Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
– Determine customer purchasing patterns over time

12
Market Analysis and Management

• Cross-market analysis
– Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
– What types of customers buy what products
• Customer requirement analysis
– Identifying the best products for different customers
– Predict what factors will attract new customers
13
Fraud Detection & Mining Unusual Patterns

• Approaches: Clustering & model construction for frauds, outlier analysis


• Applications: Health care, retail, credit card service, telecomm.
– Medical insurance
• Professional patients, and ring of doctors
• Unnecessary or correlated screening tests
– Telecommunications:
• Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest
employees

14
Data Mining: On What Kinds of Data?
• Relational database
• Data warehouse
• Transactional database
• Advanced database and information repository
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Text databases & WWW

15
Data Mining Functionalities

• Concept description: Characterization and discrimination


– Generalize, summarize, and contrast data characteristics
• Association (correlation and causality)
– Buys(X, “computer”)=>buys(X, “software”) [support = 1%,
confidence=50%]
• Classification and Prediction
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction
– Presentation: decision-tree, classification rule, neural network

16
Data Mining Functionalities

• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of
the data
– Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis

17
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines

18
Quiz
1. _______________________ is a data mining activity that predicts a
target class for an instance.

2. Data mining and _______________________ are closely related science


branches.

3. The prediction of numeric values by a model is called ____________.

4. ____________ is the first activity in data mining.

5. ____________ is a model that uses Bayes’ theorem.

19
Data Preprocessing

20
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

21
Forms of Data Preprocessing

22
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
23
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
24 • Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
25
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
26
formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

27
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)

28
Data smoothing by regression

29
Data smoothing by clustering

30
Binning
• Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
• Sorted values are distributed into a number of “buckets,” or bins.
• Binning methods consult the neighborhood of values, they perform local
smoothing.
• Smoothing by bin means: each value in a bin is replaced by the mean
value of the bin.
• Smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median.
• Smoothing by bin boundaries: the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.

31
Example
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28,
34

32
Exercise
• The following data (in increasing order) for the
attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70.

• Use smoothing by bin means to smooth these


data, using a bin depth of 3.

33
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric
vs. British units
34
34
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
35
35
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
• For nominal data the correlation between two attributes can
be identified by this test
2
(Observed  Expected )
 
2

Expected

• The larger the Χ2 value, the more likely the variables are related

36
Chi-Square Calculation: An Example
• Suppose that a group of 1500 people was surveyed. The gender
of each person was noted. Each person was polled as to
whether his or her preferred type of reading material was
fiction or non-fiction. Thus, we have two attributes, gender and
preferred_reading. The observed frequency (or count) of each
possible joint event is summarized in the contingency table.
Are gender and preferred_reading correlated?

• The numbers in parentheses are the expected frequencies.


Male Female Sum
(row)
fiction 250(90) 200(360) 450

non-fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500 37


Chi-Square Calculation: An Example

• Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840
• The degree of freedom is (row-1)(column-1)=1
• It shows that gender and preferred_reading are correlated in the
group

38
Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s product moment


coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective means of A


and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
39
Example
• Is gender independent of education level? A random sample of 395 people
were surveyed and each person was asked to report the highest education
level they obtained. The data that resulted from the survey is summarized in
the following table:

40
Mining Frequent Patterns

41
What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

42
Example
• The information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following
association rule:

Computer=>antivirus software [support = 2%,confidence = 60%]

• Typically, association rules are considered interesting if they satisfy both a


minimum support threshold and a minimum confidence threshold.
These thresholds can be a set by users or domain experts

43
Why Is Freq. Pattern Mining Important?

• Freq. pattern: An intrinsic and important property of datasets


• Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
– Classification: discriminative, frequent pattern analysis
– Cluster analysis: frequent pattern-based clustering

44
Basic Concepts: Frequent Patterns
Ti Items bought • itemset: A set of one or more items
d
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper • (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
40 Nuts, Eggs, Milk
occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, • (relative) support, s, is the fraction
Milk
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
45
Basic Concepts: Frequent Patterns

• The occurrence frequency of an itemset is the


number of transactions that contain the
itemset
• This is also known, as the frequency, support
count, or count of the itemset

46
Basic Concepts: Association Rules
Ti
d
Items bought • Find all the rules X  Y with minimum
10 Beer, Nuts, Diaper
support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs – support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs,
Milk – confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer  Association rules: (many
more!)

Beer  Diaper (60%,
47 100%)
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules

48
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
– Min_sup = 1.
• What is the set of closed itemset?
– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of max-pattern?
– <a1, …, a100>: 1

49
Database
• If I = {I1, I2,..., In} is a set of binary attributes called items and D = {T1,
T2,..., Tn} is a set of transactions called the database, each transaction in D
will have a unique transaction ID and a subset of the items in I.

• The support of an itemset is defined as the proportion of transactions in


the dataset that contains the itemset.

50
Apriori: A Candidate Generation & Test Approach

• Apriori algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining


frequent itemsets for Boolean association rules
• The algorithm uses prior knowledge of frequent itemset properties
• All nonempty subsets of a frequent itemset must also be frequent
• Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested!
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be generated

51
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3

52
– C4 = {abcd}
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
53
The Apriori Algorithm (Pseudo-
Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
54 return  L ;
k k
Example
The AllElectronics transaction database, D, is provided in the
Table below. There are nine transactions in this database, that is,
|D|= 9. Using Apriori algorithm find out the frequent itemsets in
D. (Suppose that the minimum support count required is 2)

55
Solution

56
Further Improvement of the Apriori Method

• Major computational challenges


– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates

57
Construct FP-tree from a Transaction Database
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order
3. Scan DB again, construct FP-tree

min_support = 2

58
Construct FP-tree from a Transaction Database

Itemset Support count


I2 7
I1 6
I3 6
I4 2
I5 2

59
From Conditional Pattern-bases to Conditional FP-trees
Benefits of the FP-tree Structure

• Completeness
– Preserve complete information for frequent pattern mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count node-
links and the count field)

61
The Frequent Pattern Growth Mining Method

• Idea: Frequent pattern growth


– Recursively grow frequent patterns by pattern and database
partition
• Method
– For each frequent item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-
tree
– Until the resulting FP-tree is empty, or it contains only one
path—single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern

62
Advantages of the Pattern Growth Approach

• Divide-and-conquer:
– Decompose both the mining task and DB according to the frequent
patterns obtained so far
– Lead to focused search of smaller databases
• Other factors
– No candidate generation, no candidate test
– Compressed database: FP-tree structure
– No repeated scan of entire database
– Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
• A good open-source implementation and refinement of FPGrowth
– FPGrowth+ (Grahne and J. Zhu, FIMI'03)
63
ECLAT(Equivalence Class
Transformation Vertical Apriori): Mining by Exploring Vertical
Data Format

64
Example
The AllElectronics transaction database, D, is provided in the
Table below. There are nine transactions in this database, that is,
|D|= 9. Using ECLAT algorithm find out the frequent itemsets in
D. (Suppose that the minimum support count required is 2 and
minimum confidence is 70%). Generate the association rules
from one of the frequent patterns.

65
Classification

66
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
67
Prediction Problems: Classification vs. Numeric
Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
68
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known

69
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
70
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
71
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
 Training data set: Buys_computer <=30 high no excellent no
 Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
72
Decision Tree: Example

Name = Andrew; Social Security No = 199199; Age = 22; Sex =


Male; Status = Student; Annual Income = 2000; College GPA =
3.39

73
Algorithm for Decision Tree Induction
Let S = [(X1, C1), (X2, C2),..., (Xk, Ck)] be a training sample. The construction of a
decision tree from S can be done in a divide-and-conquer fashion as follows:

• Step 1: If all the examples in S are labelled with the same class, return a leaf
labelled with that class.
• Step 2: Choose some test t (according to some criterion) that has two or more
mutually exclusive outcomes O1, O2,..., Or for the set Si.
• Step 3: Partition S into disjoint subsets S1, S2,... for examples having outcome
Oi for the test t, for i = 1, 2,..., r.
• Step 4: Call this tree-construction procedure recursively on each of the subsets
S1, S2,..., Sr, and let the decision trees returned by these recursive calls be T1,
T2,..., Tr .
• Step 5: Return a decision tree T with a node labelled t as the root and trees T1,
T2,..., Tr as subtrees below that node.

74
Example

75
Splitting Criterion
 Select the attribute with the highest information gain
 For a sample S the average amount of information needed to find the class
of a case in S is estimated by the function
k
|S | |S |
Info( S )   i log 2 i bits
i 1 | S | |S|
 where Si ⊆ S is the set of examples S of class i and k is the number of
classes.
 How much more information would we still need (after the partitioning) to
r
arrive at an exact classification? This | Samount
i | isS measured by

i 1 | S |
Info ( i )

 Test T can be evaluated based on gain factor. r | S |


gain(t ) Info( S )   i Info( S i )
i 1 | S |

 Gain tells us how much would be gained by branching on T


76
Splitting Criterion: Example

77
Splitting Criterion: Solution
• We first compute the expected information needed to
classify a tuple in S
• Next, we need to compute the expected information
requirement for each attribute
• For the age category “youth,” there are two yes tuples and
three no tuples. For the category “middle aged,” there are
four yes tuples and zero no tuples. For the category “senior,”
there are three yes tuples and two no tuples
• Next compute the gain. Because age has the highest
information gain among the attributes, it is selected as the
splitting attribute

Gain(income)= 0.029 bits, Gain(student) = 0.151 bits, and


Gain(credit rating) = 0.048 bits

78
Splitting Criterion
A bias exists in the gain criterion. It can be overcome by
dividing the information gain of a test by the entropy of the test
results, which calculates the extent of partitioning done by the
test.

giving the gain-ratio measure

Note that split(t) increases as tests partition the examples into


large number of small subsets.
79
Tree Pruning
• Tree pruning methods address this problem of overfitting
the data. Such methods typically use statistical measures to
remove the least reliable branches
• Overfitting happens when our decision tree consider too
much information or noise in training data
• Pruned trees tend to be smaller and less complex and, thus,
easier to comprehend

80
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data

81
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the
probability that the hypothesis holds given the observed data
sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X,
given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
82
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem

P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)


P(X)
• Informally, this can be written as
posteriori = likelihood x prior/evidence
• Predicts X belongs to C2 iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
83
Towards Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

• Since P(X) is constant for all classes, only

needs to be maximized P(C | X) P(X | C )P(C )


i i i
84
Naïve Bayesian Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
85
>40 medium no excellent no
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
86
0.007
Avoiding the Zero-Probability
Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their
“uncorrected” counterparts 87
Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayesian Classifier

88
Example
• Predict the class label of a tuple using naïve Bayesian classification, given
the same training data as follows. The data tuples are described by the
attributes age, income, student, and credit rating.

89
Solution
• Let C1 correspond to the class buys computer = yes and C2 correspond to
buys computer = no.
• The tuple we wish to classify is X = (age = youth, income = medium,
student = yes, credit rating = fair)
• We need to maximize P(Xj|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability
of each class, can be computed based on the training tuples:

90
Solution

91
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
92 priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision Tree age?

<=30 31..40 >40


 Rules are easier to understand than large treesstudent? credit rating?
yes
 One rule is created for each path from the root no yes excellent fair
to a leaf no yes no yes
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
93
Rule Induction: Sequential Covering
Method
• Sequential covering algorithm: Extracts rules directly from training
data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold

94
Basic sequential covering algorithm

95
Example
Suppose our training set, D, consists of loan application data. Attributes
regarding each applicant include their age, income, education level,
residence, credit rating, and the term of the loan. The classifying attribute is
loan decision, which indicates whether a loan is accepted (considered safe) or
rejected (considered risky)

96
Bayesian Belief Networks
 Bayesian belief networks (also known as
Bayesian networks, probabilistic networks):
allow class conditional independencies between
subsets of variables
 A (directed acyclic) graphical model of causal
relationships
 Represents dependency among the variables
 Nodes: random variables
 Gives a specification of joint probability
 Links: dependency
X Y
distribution  X and Y are the parents of Z, and
Y is the parent of P
Z
P  No dependency between Z and P
 Has no loops/cycles
97
Bayesian Belief Network: An
Example
Family CPT: Conditional Probability
Smoker (S)
History (FH) Table for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer
Emphysema ~LC 0.2 0.5 0.3 0.9
(LC)

shows the conditional probability


for each possible combination of its
parents
Derivation of the probability of a
PositiveXRay Dyspnea particular combination of values
of X, from CPT:
Bayesian Belief Network P ( x ,..., x n
1 n )   P ( x i | Parents (Y i ))
i 1
98
Training Bayesian Networks:
Several Scenarios
 Scenario 1: Given both the network structure and all
variables observable: compute only the CPT entries
 Scenario 2: Network structure known, some variables hidden:
gradient descent (greedy hill-climbing) method, i.e., search for
a solution along the steepest descent of a criterion function
 Weights are initialized to random probability values

 At each iteration, it moves towards what appears to be the

best solution at the moment, w.o. backtracking


 Weights are updated at each iteration & converge to local

optimum
 Scenario 3: Network structure unknown, all variables
observable: search through the model space to reconstruct
network topology
 Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose

99
Example
 Burglar alarm at home
 Fairly reliable at detecting a burglary
 Responds at times to minor earthquakes

 Two neighbors, on hearing alarm, calls police


 John always calls when he hears the alarm, but sometimes
confuses the telephone ringing with the alarm and calls
then, too.
 Mary likes loud music and sometimes misses the alarm
altogether

calculate the probability that the alarm has sounded, but neither a burglary nor an
earthquake has occurred, and both John and Mary call
Belief Network Example
P(B) P(E)
Burglary Earthquake
0.001 0.002

B E P(A)
T T 0.95
Alarm T F 0.95
F T 0.29
F F 0.001
A P(J)
T 0.90 A P(M)
JohnCalls MaryCalls
F 0.05 T 0.70
F 0.01
Metrics for Evaluating Classifier
Performance

 True positives (TP): These refer to the positive tuples that were correctly
labeled by the classifier. Let TP be the number of true positives.
 True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
 False positives (FP): These are the negative tuples that were incorrectly
labeled as positive (e.g., tuples of class buys computer D no for which the
classifier predicted buys computer D yes). Let FP be the number of false
positives.
 False negatives (FN): These are the positive tuples that were mislabeled as
negative (e.g., tuples of class buys computer D yes for which the classifier
Classification by
Backpropagation
 Backpropagation: A neural network learning
algorithm
 Started by psychologists and neurobiologists to
develop and test computational analogues of
neurons
 A neural network: A set of connected input/output
units where each connection has a weight
associated with it
 During the learning phase, the network learns
by adjusting the weights so as to be able to
predict the correct class label of the input tuples
103
Neural Network as a
Classifier
 Weakness
 Long training time
 Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
 Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of “hidden units”
in the network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on an array of real-world data, e.g., hand-
written letters
 Algorithms are inherently parallel
 Techniques have recently been developed for the 104
A Multi-Layer Feed-Forward
Neural Network
Output vector
w(jk 1) w(jk )   ( yi  yˆ i( k ) ) xij
Output layer

Hidden layer

wij

Input layer

Input vector: X
105
How A Multi-Layer Neural
Network Works
 The inputs to the network correspond to the attributes
measured for each training tuple
 Inputs are fed simultaneously into the units making up the
input layer
 They are then weighted and fed simultaneously to a hidden
layer
 The number of hidden layers is arbitrary, although usually
only one
 The weighted outputs of the last hidden layer are input to
units making up the output layer, which emits the
network's prediction
 The network is feed-forward: None of the weights cycles
back to an input unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear 106
Defining a Network Topology
 Decide the network topology: Specify # of units
in the input layer, # of hidden layers (if > 1), # of
units in each hidden layer, and # of units in the
output layer
 Normalize the input values for each attribute
measured in the training tuples to [0.0—1.0]
 One input unit per domain value, each initialized to
0
 Output, if for classification and more than two
classes, one output unit per class is used
 Once a network has been trained and its accuracy
is unacceptable, repeat the training process with
a different network topology or a different set of 107
Backpropagation
 Iteratively process a set of training tuples & compare the
network's prediction with the actual known target value
 For each training tuple, the weights are modified to minimize
the mean squared error between the network's prediction
and the actual target value
 Modifications are made in the “backwards” direction: from
the output layer, through each hidden layer down to the first
hidden layer, hence “backpropagation”
 Steps
 Initialize weights to small random numbers, associated with
biases
 Propagate the inputs forward (by applying activation
function)
 108
Neuron: A Hidden/Output Layer
Unitbias
x0 w0 mk
x1 w1
å f output y
xn wn For Example
n
y sign(  wi xi   k )
Input weight weighted Activation i 0

vector x vector w sum function


 An n-dimensional input vector x is mapped into variable y by means of the
scalar product and a nonlinear function mapping
 The inputs to unit are outputs from the previous layer. They are multiplied by
their corresponding weights to form a weighted sum, which is added to the
bias associated with unit. Then a nonlinear activation function is applied to it.
109
Efficiency and Interpretability
 Efficiency of backpropagation: Each epoch (one iteration
through the training set) takes O(|D| * w), with |D| tuples and
w weights, but # of epochs can be exponential to n, the
number of inputs, in worst case
 For easier comprehension: Rule extraction by network
pruning
 Simplify the network structure by removing weighted links
that have the least effect on the trained network
 Then perform link, unit, or activation value clustering
 The set of input and activation values are studied to derive
rules describing the relationship between the input and
hidden unit layers
 Sensitivity analysis: assess the impact that a given input
variable has on a network output. The knowledge gained 110
SVM—Support Vector Machines
 A relatively new classification method for both
linear and nonlinear data
 It uses a nonlinear mapping to transform the
original training data into a higher dimension
 With the new dimension, it searches for the linear
optimal separating hyperplane (i.e., “decision
boundary”)
 With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes
can always be separated by a hyperplane
 SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined
by the support vectors) 111
SVM—History and Applications
 Vapnik and colleagues (1992)—groundwork from
Vapnik & Chervonenkis’ statistical learning theory
in 1960s
 Features: training can be slow but accuracy is high
owing to their ability to model complex nonlinear
decision boundaries (margin maximization)
 Used for: classification and numeric prediction
 Applications:
 handwritten digit recognition, object
recognition, speaker identification,
benchmarking time-series prediction tests 112
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

113
SVM—Margins and Support Vectors

Data Mining: Concepts and


May 17, 2025 Techniques 114
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
115
SVM—Linearly Separable
 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar
(bias)
 For 2-D it can be written as
w 0 + w 1 x1 + w 2 x2 = 0
 The hyperplane defining the sides of the margin:
H 1: w 0 + w 1 x1 + w 2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors

116
Why Is SVM Effective on High
Dimensional Data?
 The complexity of trained classifier is characterized by the
# of support vectors rather than the dimensionality of the
data
 The support vectors are the essential or critical training
examples —they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to
compute an (upper) bound on the expected error rate of the
SVM classifier, which is independent of the data
dimensionality
 Thus, an SVM with a small number of support vectors can
117
SVM vs. Neural Network

 SVM  Neural Network


 Deterministic  Nondeterministic
algorithm algorithm
 Nice
 Generalizes well
generalization but doesn’t have
properties strong
mathematical
 Hard to learn –
foundation
learned in batch  Can easily be
mode using learned in
quadratic 118
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same

group
 dissimilar (or unrelated) to the objects in other

groups
 Cluster analysis (or clustering, data segmentation,
…)
 Finding similarities between data according to the

characteristics found in the data and grouping


similar data objects into clusters
 Unsupervised learning: no predefined classes (i.e.,
learning by observations vs. learning by examples:
supervised)
 Typical applications 119
Clustering for Data Understanding
and Applications
 Biology: taxonomy of living things: kingdom, phylum, class,
order, family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an
earth observation database
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
 Climate: understanding earth climate, find patterns of
atmospheric and ocean
 Economic Science: market resarch
120
Clustering as a Preprocessing Tool
(Utility)
 Summarization:
 Preprocessing for regression, PCA, classification,
and association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of
clusters
 Outlier detection
 Outliers are often viewed as those “far away”
from any cluster
121
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within
clusters
 low inter-class similarity: distinctive between
clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden
122
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance
function, typically metric: d(i, j)
 The definitions of distance functions are usually
rather different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables
 Weights should be associated with different
variables based on applications and data
semantics
 Quality of clustering:
 There is usually a separate “quality” function
that measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good
enough” 123
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region)
vs. non-exclusive (e.g., one document may belong to more
than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces
(often in high-dimensional clustering)

124
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes


 Numerical, binary, categorical, ordinal, linked, and mixture

of these
 Constraint-based clustering
 User may give inputs on constraints

 Use domain knowledge to determine input parameters

 Interpretability and usability


 Others
 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

125
Major Clustering
Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by

some criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or

objects) using some criterion


 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

126
Major Clustering
Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries

to find the best fit of that model to each other


 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-

specific constraints
 Typical methods: COD (obstacles), constrained clustering

 Link-based clustering:
 Objects are often linked together in various ways

 Massive links can be used to cluster objects: SimRank,

LinkClus
127
Partitioning Methods

128
Partitioning Algorithms: Basic
Concept
 Partitioning method: Partitioning a database D of n objects
into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of
cluster Ci) k 2
E  i 1 pCi ( p  ci )

 Given k, find a partition of k clusters that optimizes the


chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster 129
The K-Means Clustering
Method
 Given k, the k-means algorithm is implemented
in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the
centroid is the center, i.e., mean point, of
the cluster)
 Assign each object to the cluster with the
nearest seed point
 Go back to Step 2, stop when the
assignment does not change
130
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update
the
point) for each partition cluster
 Assign each object to the centroids
cluster of its nearest centroid
 Until no change
131
Comments on the K-Means
Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al.,
2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
132
Variations of the K-Means Method
 Most of the variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical
objects
 Using a frequency-based method to update modes of
clusters
 A mixture of categorical and numerical data: k-prototype 133
What Is the Problem of the K-Means
Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may
substantially distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

134
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5
each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

initial to
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 1 2 3 4 5 6 7 8 9 10 nearest 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping total cost
Until no
7 7

O and 6
of 6

change Oramdom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

135
The K-Medoid Clustering Method
 K-Medoids Clustering: Find representative objects (medoids) in
clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw
1987)

Starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering

PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
136
Hierarchical Methods

137
Hierarchical Clustering
 Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
138
AGNES (Agglomerative
Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity
matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10
10 10

9
9 9

8
8 8

7
7 7

6
6 6

5
5 5

4
4 4

3
3 3

2
2 2

1
1 1

0
0 0
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

139
Dendrogram: Shows How Clusters are
Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting


the dendrogram at the desired level, then each
connected component forms a cluster

140
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10

141
Distance between X X

Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip,
tjq)
 Complete link: largest distance between an element in
one cluster and an element in the other, i.e., dist(Ki, Kj) =
max(tip, tjq)
 Average: avg distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters,
i.e., dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e.,
142
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any


point of the cluster to its centroid  N (t  c ) 2
i 1 ip m
Rm 
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

143
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering
methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least
O(n2), where n is the number of total objects
 Integration of hierarchical & distance-based
clustering
 BIRCH (1996): uses Clustering feature (CF)-tree
and incrementally adjusts the quality of sub-
144
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the
order of the data record

145
Clustering Feature Vector in
BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
N
LS: linear sum of N points:  X
i
i 1

SS: square sum of N points CF = (5, (16,30),(54,190))

N 2 10
(3,4)
 Xi
9

(2,6)
8

i 1
7

4
(4,5)
3

2 (4,7)
(3,8)
1

1 2 3 4 5 6 7 8 9 10

146
CF-Tree in BIRCH
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters


 Branching factor: max no. of children per non-leaf node

 Threshold: max diameter of sub-clusters stored at the leaf

nodes

147
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6


L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

148
The Birch Algorithm
 Cluster Diameter 1 2
 ( xi  x j )
n( n  1)
 For each point in the input
 Find closest leaf entry

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and

possibly parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points

 Since we fix the size of leaf nodes, so clusters may not be

so natural
 Clusters tend to be spherical given the radius and diameter

measures
149
CHAMELEON: Hierarchical
Clustering Using Dynamic Modeling
(1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar,
1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster
objects into a large number of relatively small
sub-clusters
150
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 151
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between
clusters
 Generative model: Regard the set of data objects to be
clustered as a sample of the underlying data generation
mechanism to be analyzed
 Easy to understand, same efficiency as algorithmic
agglomerative clustering method, can handle partially
observed data
 In practice, assume the generative models adopt common
distributions functions, e.g., Gaussian distribution or Bernoulli 152
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for
clustering analysis & assuming they are
generated by a Gaussian distribution:

 The probability that a point xi ∈ X is generated


by the model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find


the parameters μ and σ2 such that the maximum
likelihood

153
A Probabilistic Hierarchical Clustering
Algorithm
 For a set of objects partitioned into m clusters C1, . . . ,Cm, the
quality can be measured by,

where P() is the maximum likelihood


 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
154
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination
condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based)
155
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps,
MinPts if

p belongs to NEps(q) p MinPts = 5
 core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q

156
Density-Reachable and Density-
Connected
 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps,
p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
 Density-connected p q
 A point p is density-connected
to a point q w.r.t. Eps, MinPts o
if there is a point o such that
both, p and q are density-
reachable from o w.r.t. Eps 157
DBSCAN: Density-Based Spatial
Clustering of Applications with
Noise
 Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

158
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t.
Eps and MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database
 Continue the process until all of the points have
been processed

159
Mining Time-series Data

160
Time-series Database
• A time-series database consists of sequences of values or
events obtained over repeated measurements of time
• Popular in many applications, such as stock market analysis,
economic and sales forecasting, budgetary analysis, utility
studies, inventory studies etc.
• Time-series database is also a sequence database

161
Time-series representation
A time series involving a variable Y, representing, the daily
closing price of a share in a stock market, can be viewed as a
function of time t, Y = F(t).

• Two goals in time-series analysis: (1) modelling time series


and (2) forecasting time series
162
Trend Analysis
Trend analysis consists of the following four major components
or movements for characterizing time-series data
• Trend or long-term movements: These indicate the general
direction in which a time-series graph is moving over a long
interval of time.
• Cyclic movements or cyclic variations: These refer to the
cycles, that is, the long-term oscillations about a trend line or
curve, which may or may not be periodic.
• Seasonal movements or seasonal variations: These are
systematic or calendar related.
• Irregular or random movements: These characterize the
sporadic motion of time series due to random or chance
events, such as labor disputes, floods, or announced
personnel changes within companies. 163
Determining Trends in Time-series
• Moving average: A moving average tends to reduce the
amount of variation present in the data set. Thus the process
of replacing the time series by its moving average eliminates
unwanted fluctuations and is therefore also referred to as the
smoothing of time series
• Least square method: where we consider the best-fitting
curve C as the leastn squares curve, that is, the curve having
the minimum of  di2
i 1

where the deviation or error, di, is the difference between the


value yi of a point (xi, yi) and the corresponding value as
determined from the curve C
164
Example
• Given a sequence of nine values, 3, 7, 2, 0, 4,
5, 9, 7, 2. Compute its moving average of
order 3, and its weighted moving average of
order 3 using the weights (1, 4, 1).

165
Least Square Method Example 1

• Consider the set of points: (1, 1), (-2,-1), and


(3, 2). Plot these points and the least-squares
regression line in the same graph.

166
Solution
There are three points, so the value of
n is 3

Now, find the value of m, using Now, find the value of b using the
the formula. formula,
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2 b = (∑y - m∑x)/n
m = [(3×9) - (2×2)]/(3×14) - (2)2 b = [2 - (23/38)×2]/3
m = (27 - 4)/(42 - 4) b = [2 -(23/19)]/3
m = 23/38 b = 15/(3×19)
b = 5/19
So, the required equation of least squares is y = mx + b = 23/38x167
+
Similarity Search in Time-Series
• A similarity search finds data sequences that differ only
slightly from the given query sequence
• There are two types of similarity searches: subsequence
matching and whole sequence matching
• Subsequence matching finds the sequences in S that contain
subsequences that are similar to a given query sequence x.
• Whole sequence matching finds a set of sequences in S that
are similar to each other (as a whole).

168
Similarity Search Methods
• For similarity analysis of time-series data, Euclidean distance
is typically used as a similarity measure
• The smaller the distance between two sets of time-series
data, the more similar are the two series

169
Steps in Similarity Search
A similarity search that handles gaps and differences in offsets
and amplitudes can be performed by the following steps
• Atomic matching: Normalize the data. Find all pairs of gap-
free windows of a small length that are similar.
• Window stitching: Stitch similar windows to form pairs of
large similar subsequences, allowing gaps between atomic
matches.
• Subsequence ordering: Linearly order the subsequence
matches to determine whether enough similar pieces exist.

170
Subsequence matching in time-
series data

171
Mining Data Streams

172
Mining Data Streams
• Tremendous and potentially infinite volumes of data streams are often
generated by real-time surveillance systems, communication networks,
Internet traffic, on-line transactions in the financial market or retail
industry, electric power grids, industry production processes, scientific and
engineering experiments, remote sensors, and other ,dynamic
environments.
• It may be impossible to store an entire data stream or to scan through it
multiple times due to its tremendous volume
• To discover knowledge or patterns from data streams, it is necessary to
develop single-scan, on-line, multilevel, multidimensional stream
processing and analysis methods

173
Methodologies for Stream Data
Processing
• Random Sampling: Rather than deal with an entire data stream, we can
think of sampling the stream at periodic intervals. A technique called
reservoir sampling can be used to select an unbiased random sample of s
elements without replacement
• Sliding Windows: Instead of sampling the data stream randomly, we can
use the sliding window model to analyze stream data. The basic idea is
that rather than running computations on all of the data seen so far, or on
some sample, we can make decisions based only on recent data.
• Histograms: A histogram partitions the data into a set of contiguous
buckets. Depending on the partitioning rule used, the width (bucket value
range) and depth (number of elements per bucket) can vary.

174
Methodologies for Stream Data
Processing
• Multiresolution Methods: A more sophisticated way to form multiple
resolutions is to use a clustering method to organize stream data into a
hierarchical structure of trees. For example, we can use a typical
hierarchical clustering data structure like CF-tree in BIRCH to form a
hierarchy of microclusters.
• Sketches: Some techniques require multiple passes over the data, such as
histograms and wavelets, whereas other methods, such as sketches, can
operate in a single pass. Suppose that, ideally, we would like to maintain
the full histogram over the universe of objects or elements in a data
stream, which may be very large.
• When the amount of memory available is smaller, we need to employ a
synopsis. The estimation of the frequency moments can be done by
synopses that are known as sketches. These build a small-space summary
for a distribution vector (e.g., histogram) using randomized linear
projections of the underlying data vectors. Sketches provide probabilistic
guarantees on the quality of the approximate answer 175
Methodologies for Stream Data
Processing
• Randomized Algorithms: Randomized algorithms, in the form of random
sampling and sketching, are often used to deal with massive, high-
dimensional data streams. The use of randomization often leads to simpler
and more efficient algorithms in comparison to known deterministic
algorithms.
• If a randomized algorithm always returns the right answer but the running
times vary, it is known as a Las Vegas algorithm. In contrast, a Monte Carlo
algorithm has bounds on the running time but may not return the correct
result. We mainly consider Monte Carlo algorithms. One way to think of a
randomized algorithm is simply as a probability distribution over a set of
deterministic algorithms.

176
Frequent-Pattern Mining in Data
Streams
• Frequent-pattern mining finds a set of patterns that occur frequently in a
data set, where a pattern can be a set of items (called an itemset), a
subsequence, or a substructure.
• A pattern is considered frequent if its count satisfies a minimum support.
• Existing frequent-pattern mining algorithms require the system to scan the
whole data set more than once, but this is unrealistic for infinite data
streams.
• To overcome this difficulty, there are two possible approaches. One is to
keep track of only a predefined, limited set of items. This method has very
limited usage.
• The second approach is to derive an approximate set of answers. In
practice, approximate answers are often sufficient. Here we use one such
algorithm: the Lossy Counting algorithm. It approximates the frequency of
items or itemsets within a user-specified error bound, e.

177
Example
• Approximate frequent items: A router is
interested in all items whose frequency is at
least 1% (min support) of the entire traffic
stream seen so far. It is felt that 1/10 of min
support (i.e., e = 0.1%) is an acceptable margin
of error. This means that all frequent items
with a support of at least min support will be
output, but that some items with a support of
at least (min support-e) will also be output.
178
Lossy Counting Algorithm
The lossy count algorithm is an algorithm to identify elements in a data
stream whose frequency count exceed a user-given threshold. The algorithm
works by dividing the Data Stream into ‘Buckets’ as for frequent items, but fill
as many buckets as possible in main memory one time. The frequency
computed by this algorithm is not always accurate, but has an error threshold
that can be specified by the user. The run time space required by the
algorithm is inversely proportional to the specified error threshold, hence
larger the error, the smaller the footprint.
• Step 1: Divide the incoming data stream into buckets of width w=1/e,
where e is mentioned by user as the error bound (along with minimum
support threshold ).
• Step 2: Increment the frequency count of each item according to the new
bucket values. After each bucket, decrement all counters by 1.
• Step 3: Repeat – Update counters and after each bucket, decrement all
counters by 1.
179
K-Nearest Neighbor(KNN) Algorithm

• K-Nearest Neighbour is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.

180
KNN Algorithm
The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each
category.
• Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
• Step-6: Our model is ready.

181
Example

182
Example
Divide the given data in 2 clusters using k-means algorithm
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76 183
Class imbalance problem
• The data set distribution reflects a significant majority of the negative
class and a minority positive class
• In medical data, there may be a rare class, such as “cancer.” Suppose that
you have trained a classifier to classify medical data tuples, where the
class label attribute is “cancer” and the possible class values are “yes” and
“no.” An accuracy rate of, say, 97% may make the classifier seem quite
accurate, but what if only, say, 3% of the training tuples are actually cancer
• The sensitivity and specificity measures can be used, respectively, for this
purpose. Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly identified),
while specificity is the true negative rate (i.e., the proportion of negative
tuples that are correctly identified).

184
Prediction models
• “What if we would like to predict a continuous value, rather than a
categorical label?” Numeric prediction is the task of predicting continuous
(or ordered) values for given input. For example, we may wish to predict
the salary of college graduates with 10 years of work experience, or the
potential sales of a new product given its price
• Regression analysis can be used to model the relationship between one or
more independent or predictor variables and a dependent or response
variable (which is continuous-valued). In the context of data mining, the
predictor variables are the attributes of interest describing the tuple. In
general, the values of the predictor variables are known. The response
variable is what we want to predict

185
Types of regression
• Linear regression
• Non-linear regression

186
Linear regression
• Straight-line regression analysis involves a response variable, y, and a
single predictor variable, x.
• Models y as a linear function of x
• where the variance of y is assumed to be constant, and b and w are
regression coefficients specifying the Y-intercept and slope of the line,
respectively. The regression coefficients, w and b, can also be thought of
as weights, so that we can equivalently write
• These coefficients can be solved for by the method of least squares, which
estimates the best-fitting straight line as the one that minimizes the error
between the actual data and the estimate of the line

187
Example
• Table below shows a set of paired data where x is the number of years of
work experience of a college graduate and y is the corresponding salary of
the graduate. Predict that the salary of a college graduate with 10 years of
experience.

188
Non-linear regression
• Polynomial regression is often of interest when there is just one predictor
variable. It can be modelled by adding polynomial terms to the basic linear
model. By applying transformations to the variables, we can convert the
nonlinear model into a linear one that can then be solved by the method
of least squares.
• Transformation of a polynomial regression model to a linear regression
model

• To convert this equation to linear form, we define new variables:

• The converted equation:


• Multiple regression problems are instead commonly solved with the use
of statistical software packages, such as SAS, SPSS, and S-Plus
189
Graph Mining

190
Graph Mining
Graphs
Model sophisticated structures and their interactions
Chemical Informatics
Bioinformatics
Computer Vision
Video Indexing
Text Retrieval Web
Analysis Social
Networks
Mining frequent sub-graph patterns
Characterization, Discrimination, Classification and Cluster
Analysis, building graph indices and similarity search

191
Mining Frequent Subgraphs
Graph g
Vertex Set – V(g) Edge set – E(g)
Label function maps a vertex / edge to a label
Graph g is a sub-graph of another graph g’ if there exists a graph iso-
morphism from g to g’
Support(g) or frequency(g) – number of graphs in D = {G1, G2,..Gn}
where g is a sub-graph
Frequent graph – satisfies min_sup

192
Discovery of Frequent
Substructures
Step 1: Generate frequent sub-structure candidates
Step 2: Check for frequency of each candidate
Involves sub-graph isomorphism test which is
computationally expensive
Approaches
Apriori –based approach
Pattern Growth approach

193
Apriori based Approach

Start with graph of small size –


generate candidates with
extra vertex/edge or path

AprioriGraph
• Level wise mining method
• Size of new substructures is
increased by 1
• Generated by joining two similar but
slightly different frequent sub- graphs
• Frequency is then checked

Candidate generation in graphs


is complex

194
Apriori Approach
AGM (Apriori-based Graph Mining)
Vertex based candidate generation – increases sub structure size by one
vertex at each step
Two frequent k size graphs are joined only if they have the same (k-1)
subgraph (Size – number of vertices)
New candidate has (k-1) sized component and the additional two
vertices
Two different sub-structures can be formed

195
Apriori Approach
FSG (Frequent Sub-graph mining)
Edge-based Candidate generation – increases by one-edge at a
time
Two size k patterns are merged iff they share the same subgraph
having k-1 edges (core)
New candidate – has core and the two additional edges

196
Apriori Approach
Edge disjoint path method
• Classify graphs by number of disjoint paths they have
• Two paths are edge-disjoint if they do not share any common edge
• A substructure pattern with k+1 disjoint paths is generated by joining sub-
structures with k disjoint paths
Disadvantage of Apriori Approaches
• Overhead when joining two sub-structures
• Uses BFS strategy : level-wise candidate generation
To check whether a k+1 graph is frequent – it must check all of its size-k sub graphs

May consume more memory

197
Pattern-Growth Approach
Uses BFS as well as DFS
A graph g can be extended by adding a new edge e. The newly
formed graph is denoted by g x e.
Edge e may or may not introduce a new vertex to g.
If e introduces a new vertex, the new graph is denoted by g xf e,
otherwise, g xb e, where f or b indicates that the extension is in a forward
or backward direction.
Pattern Growth Approach
For each discovered graph g performs extensions recursively until all
frequent graphs with g are found
Simple but inefficient
Same graph is discovered multiple times – duplicate graph

198
Pattern Growth

196
gSpan Algorithm
Reduces generation of duplicate graphs
Does not extend duplicate graphs
Uses Depth First Order
A graph may have several DFS-trees
Visiting order of vertices forms a linear order - Subscript
In a DFS tree – starting vertex – root; last visited vertex – right-most vertex
Path from v0 to vn – right most path

Right most path: (b), (c) – (v0, v1, v3); (d) – (v0, v1, v2, v3)

200
gSpan Algorithm
gSpan restricts the extension method
A new edge e can be added
between the right-most vertex and another vertex on the right-most path (backward
extension);
or it can introduce a new vertex and connect to a vertex on the right-most path
(forward extension)
Right-most extension, denoted by G r e

201
gSpan Algorithm
 Chooses any one DFS tree – base subscripting and
extends it
 Each subscripted graph is transformed into an edge sequence –
DFS code
 Select the subscript that generates minimum sequence
 Edge Order – maps edges in a subscripted graph into a sequence
 Sequence Order – builds an order among edge sequences

Introduce backward edges:


Given a vertex v all of its backward edges should appear before
its forward edges (if any); If there are two backward edges (i,j 1)
appears before (i,j2)

Order of forward edges: (0,1) (1,2) (1,3)


Complete sequence: (0,1) (1,2) (2,0) (1,3)

202
gSpan Algorithm
Here 0 < 1 < 2
0 – Minimum DFS Code
Corresponding subscript – Base
Subscripting

DFS Lexicographic Ordering: Edge order, First Vertex label, Edge label, Second Vertex label

203
gSpan Algorithm

 Root – Empty code


 Each node is a DFS code encoding a graph
 Each edge – rightmost extension from a (k-1) length DFS code to a
k-length DFS code
 If codes s and s’ encode the same graph – search space s’ can be safely
pruned

204
gSpan Algorithm

205
Mining Closed Frequent
Substructures
Helps to overcome the problem of pattern explosion
A frequent graph G is closed if and only if there is no proper super graph G0
that has the same support as G.
Closegraph Algorithm
A frequent pattern G is maximal if and only if there is no frequent super-
pattern of G.
Maximal pattern set is a subset of the closed pattern set.
But cannot be used to reconstruct entire set of frequent patterns

206
Mining Alternative Substructure
Patterns
Mining unlabeled or partially labeled graphs
New empty label  is assigned to vertices and edges that do not have labels
Mining non-simple graphs
A non simple graph may have a self-loop and multiple edges growing order -
backward edges, self-loops, and forward edges
To handle multiple edges - allow sharing of the same vertices in two neighboring
edges in a DFS code
Mining directed graphs
6-tuple (i; j; d; li; l(i; j) ; lj ); d = +1 / -1
Mining disconnected graphs
Graph / Pattern may be disconnected Disconnected
Graph – Add virtual vertex
Disconnected graph pattern – set of connected graphs
Mining frequent subtrees
Tree – Degenerate graph

207
Constraint based Mining of
Substructure
Element, set, or subgraph containment constraint
user requires that the mined patterns contain a particular set of
subgraphs - Succinct constraint
Geometric constraint
A geometric constraint can be that the angle between each pair of
connected edges must be within a range – Anti-monotonic constraint
Value-sum constraint
the sum_of (positive) weights on the edges, must be within a range low
and high – (sum > low) Monotonic / Anti-monotonic (sum < high)
Multiple categories of constraints may also be enforced

208
Mining Approximate Frequent
Substructures
Approximate frequent substructures allow slight structural variations
Several slightly different frequent substructures can be represented
using one approximate substructure
SUBDUE – Substructure discovery system
based on the Minimum Description Length (MDL) principle
adopts a constrained beam search
SUBDUE performs approximate matching

209
Mining Coherent and Dense Sub
structures
A frequent substructure G is a coherent sub graph if the mutual information
between G and each of its own sub graphs is above some threshold
Reduces number of patterns mined
Application: coherent substructure mining selects a small subset of features that have high
distinguishing power between protein classes.
Relational graph –each label is used only once
Frequent highly connected or dense subgraph mining
People with strong associations in OSNs
Set of genes within the same functional module

Cannot judge based on average degree or minimal degree


Must ensure connectedness
Example: Average degree: 3.25
Minimum degree 3

210
Mining Dense Substructures
Dense graphs defined in terms of Edge Connectivity
Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is
disconnected.
A minimum cut is the smallest set in all edge cuts.
The edge connectivity of G is the size of a minimum cut.

A graph is dense if its edge connectivity is no less than a specified minimum cut
threshold
Mining Dense substructures
Pattern-growth approach called Close-Cut (Scalable)
starts with a small frequent candidate graph and extends it until it finds the largest super graph with
the same support

Pattern-reduction approach called Splat (High performance)


directly intersects relational graphs to obtain highly connected graphs
A pattern g discovered in a set is progressively intersected with subsequent components to give g’
Some edges in g may be removed
The size of candidate graphs is reduced by intersection and decomposition operations.

211
Applications – Graph Indexing
Indexing is essential for efficient search and query processing
Traditional approaches are not feasible for graphs
Indexing based on nodes / edges / sub-graphs
Path based Indexing approach
Enumerate all the paths in a database up to maxL length and index them
Index is used to identify all graphs with the paths in query
Not suitable for complex graph queries
Structural information is lost when a query graph is broken apart
Many false positives maybe returned

gIndex – considers frequent and discriminative substructures as index features


A frequent substructure is discriminative if its support cannot be approximated by the intersection of the
graph sets
Achieves good performance at less cost

212
Graph Indexing

Only (c) is an exact match, but


others are also reported due to the
presence of sub-structures

213
Substructure Similarity Search
Bioinformatics and Chem-informatics applications involve query
based search in massive complex structural data

Form a set of sub-graph queries with one


or more edge deletions and then use
exact substructure search

214
Substructure Similarity Search
Grafil (Graph Similarity Filtering)
Feature based structural filtering
Models each query graph as a set of features
Edge deletions – feature misses
Too many features – reduce performance
Multi-filter composition strategy
Feature Set - group of similar features

215
Classification and Cluster Analysis
using Graph Patterns
Graph Classification
Mine frequent graph patterns
Features that are frequent in one class but less in another – Discriminative
features – Model construction
Can adjust frequency, connectivity thresholds SVM, NBM etc are used

Cluster Analysis
Cluster Similar graphs based on graph connectivity (minimal cuts)
Hierarchical clusters based on support threshold
Outliers can also be detected
Inter-related process

216
Social Network Analysis

217
What Is a Social Network
A social network is a heterogeneous and multirelational data set represented
by a graph. The graph is typically very large, with nodes corresponding to
objects and edges corresponding to links representing relationships or
interactions between objects. Both nodes and links have attributes

218
Examples
• electrical power grids
• telephone call graphs
• spread of computer viruses
• World Wide Web
• Co-authorship and citation networks of
scientists
• Small world network

219
Models of Social Networks
• Random graph
• Scale free networks

220
Characteristics of Social Networks
• Densification power law
• Shrinking diameter
• Heavy-tailed out-degree and in-degree
distributions

221
Information on the Social Network
• Heterogeneous, multi-relational data represented as a graph or network
• Nodes are objects
• May have different kinds of objects
• Objects have attributes
• Objects may have labels or classes
• Edges are links
• May have different kinds of links
• Links may have attributes
• Links may be directed, are not required to be binary
• Links represent relationships and interactions between objects - rich
content for mining

222
Link Mining: Tasks and Challenges
“How can we mine social networks?”
• Link-based object classification. In traditional classification methods,
objects are classified based on the attributes that describe them. Link-
based classification predicts the category of an object based not only on
its attributes, but also on its links, and on the attributes of linked objects
– Web page classification is a well-recognized example of link-based classification. It
predicts the category of a Web page based on word occurrence and anchor text, both of
which serve as attributes
• Object type prediction. This predicts the type of an object, based on its
attributes and its links, and on the attributes of objects linked to it
• Link type prediction. This predicts the type or purpose of a link, based on
properties of the objects involved Predicting link existence. Unlike link
type prediction, where we know a connection exists between two objects
and we want to predict its type, instead we may want to predict whether a
link exists between two objects
223
Link Mining: Tasks and Challenges
• Link cardinality estimation. There are two forms of link cardinality
estimation. First, we may predict the number of links to an object. This is
useful, for instance, in predicting the authoritativeness of a Web page
based on the number of links to it (in-links)
• Object reconciliation. In object reconciliation, the task is to predict
whether two objects are, in fact, the same, based on their attributes and
links
• Group detection. Group detection is a clustering task. It predicts when a
set of objects belong to the same group or cluster, based on their
attributes as well as their link structure
• Subgraph detection. Subgraph identification finds characteristic subgraphs
within networks
• Metadata mining. Metadata are data about data. Metadata provide semi-
structured data about unstructured data, ranging from text and Web data
to multimedia databases. It is useful for data integration tasks in many
domains
224
What is New for Link Mining
Traditional machine learning and data mining approaches assume:
– A random sample of homogeneous objects from single relation
• Real world data sets:
– Multi-relational, heterogeneous and semi-structured
• Link Mining
– Research area at the intersection of research in social network and link analysis,
hypertext and web mining, graph mining, relational learning and inductive logic
programming

225
What Is a Link in Link Mining?
Link: relationship among data
• Two kinds of linked networks
– homogeneous vs. heterogeneous
• Homogeneous networks
– Single object type and single link type
– Single model social networks (e.g., friends)
– WWW: a collection of linked Web pages
• Heterogeneous networks
– Multiple object and link types
– Medical network: patients, doctors, disease, contacts, treatments
– Bibliographic network: publications, authors, venues

226
Link-Based Object Ranking (LBR)
• LBR: Exploit the link structure of a graph to order or prioritize the set of
objects within the graph
– Focused on graphs with single object type and single link type
• This is a primary focus of link analysis community
• Web information analysis
– PageRank and Hits are typical LBR approaches
• In social network analysis (SNA), LBR is a core analysis task
– Objective: rank individuals in terms of “centrality”
– Degree centrality vs. eigen vector/power centrality
– Rank objects relative to one or more relevant objects in the graph vs. ranks object over
time in dynamic graphs

227
PageRank: Capturing Page
Popularity(Brin & Page’98)
• Intuitions
– Links are like citations in literature
– A page that is cited often can be expected to be more useful in general
• PageRank is essentially “citation counting”, but improves over simple
counting
– Consider “indirect citations” (being cited by a highly cited paper counts a lot…)
– Smoothing of citations (every page is assumed to have a non-zero citation count)
• PageRank can also be interpreted as random surfing (thus capturing
popularity)

228
WEB MINING

229
230
Data Mining vs. Web Mining
• Traditional data mining
– data is structured and relational
– well-defined tables, columns, rows, keys,
and constraints.
• Web data
– Semi-structured and unstructured
– readily available data
– rich in features and patterns

231
Web Mining

• The term created by Orem Etzioni (1996)

• Application of data mining techniques to automatically discover and


extract information from
Web data

232
Web Mining
• Web is the single largest data source in the
world
• Due to heterogeneity and lack of structure of
web data, mining is a challenging task
• Multidisciplinary field:
– data mining, machine learning, natural language
– processing, statistics, databases, information
– retrieval, multimedia, etc.

233
Mining the World-Wide Web
• The WWW is huge, widely distributed, global
information service center for
– Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
– Hyper-link information
– Access and usage information
• WWW provides rich sources for data mining

234
235
Web Mining: A more challenging task

• Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
• Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented search
– Limited customization to individual users

236
237
238
239
240
241
242
243
244
246
247
248
Web Usage Mining

• Mining Web log records to discover user access


patterns of Web pages
• Applications
– Target potential customers for electronic commerce
– Enhance the quality and delivery of Internet information
services to the end user
– Improve Web server system performance
– Identify potential prime advertisement locations
• Web logs provide rich information about Web dynamics
– Typical Web log entry includes the URL requested, the IP
address from which the request originated, and a timestamp

05/17/2025 249
Techniques for Web usage mining
• Construct multidimensional view on the Weblog
database
– Perform multidimensional OLAP analysis to find the top N
users, top N accessed Web pages, most frequently accessed
time periods, etc.
• Perform data mining on Weblog records
– Find association patterns, sequential patterns, and trends of
Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
• Conduct studies to
– Analyze system performance, improve system design by Web
caching, Web page prefetching, and Web page swapping 250
05/17/2025
Mining the World-Wide Web
• Design of a Web Log Miner
– Web log is filtered to generate a relational database
– A data cube is generated form database
– OLAP is used to drill-down and roll-up in the cube
– OLAM is used for mining interesting knowledge

Knowledge
Web log Database Data Cube Sliced and diced
cube

1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
05/17/2025 251
252
253
254

You might also like