Data Mining
Data Mining
Introduction
1
Introduction
• Motivation
• What is data mining?
• Why data mining?
• Data Mining: On what kind of data?
• Data mining functionality
2
Motivation
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras,
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
5
Data Mining Stages
• Business understanding
involves understanding
the domain for which
data mining has to be
performed
• Data pre-processing
involves cleaning the
data, transforming the
data, selecting subsets of
records that are of
interest
• Data modelling involves
building models such as
decision tree, support
vector machine (SVM),
and neural network from
the pre-processed data 6
Data Mining Models
• Decision trees: Decision tree is one of the most popular classification models. It is
similar to a tree-like structure, where each internal node denotes a decision on the
value of an attribute. A branch represents a decision, and leaves represent target
classes. A decision tree displays the various relationships found in the training data
by executing a classification algorithm.
• Neural networks: Neural networks offer a mathematical model that attempts to
mimic the human brain. Knowledge is represented as a layered set of
interconnected processors called neurons. Each node has a weighted connection
with other nodes in adjacent layers. Learning in neural networks is accomplished
by network connection weight changes while a set of input instances is repeatedly
passed through the network. Once trained, an unknown instance passing through
the network is classified according to the values seen at the output layer.
• Naive Bayes classifier: This classifier offers a simple yet powerful supervised
classification technique. The model assumes all input attributes to be of equal
importance and independent of one another. Naive Bayes classifier is based on the
classical Bayes theorem presented in 1763 which works on the probability theory.
7
KDD Process: A Typical View from ML and Statistics
8
Steps of KDD process
• Data cleaning: to remove noise and inconsistent data
• Data integration: where multiple data sources may be combined
• Data selection: where data relevant to the analysis task are retrieved from
the database
• Data transformation: where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations
• Data mining: an essential process where intelligent methods are applied
to extract data patterns
• Pattern evaluation: to identify the truly interesting patterns representing
knowledge based on interest
• Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to users
9
Why Data Mining?—Potential Applications
• In classification, the goal is to classify a new data record into one of the many
possible classes, which are already known. For example, an applicant has to be
classified as a prospective applicant or a defaulter in a loan database, given his
various personal and other demographic features along with previous purchase
characteristics.
• In estimation, unlike classification, we predict the attribute of a data instance—
usually a numeric value rather than a categorical class. An example can be “Estimate
the percentage of marks of a student whose previous marks are already available”.
• Market basket analysis or association rule mining analyses hidden rules called
association rules in a large transactional database. For example, the rule {pen, pencil
→ book} provides the information that whenever pen and pencil are purchased
together, book is also purchased; so these items can be placed together for sales or
supplied as a complementary product with one another to increase the overall sales
of each item.
10
Why Data Mining?—Potential Applications
11
Market Analysis and Management
12
Market Analysis and Management
• Cross-market analysis
– Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
– What types of customers buy what products
• Customer requirement analysis
– Identifying the best products for different customers
– Predict what factors will attract new customers
13
Fraud Detection & Mining Unusual Patterns
14
Data Mining: On What Kinds of Data?
• Relational database
• Data warehouse
• Transactional database
• Advanced database and information repository
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Text databases & WWW
15
Data Mining Functionalities
16
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of
the data
– Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
17
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
18
Quiz
1. _______________________ is a data mining activity that predicts a
target class for an instance.
19
Data Preprocessing
20
Data Quality: Why Preprocess the Data?
21
Forms of Data Preprocessing
22
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
23
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
24 • Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
25
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
26
formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
27
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)
28
Data smoothing by regression
29
Data smoothing by clustering
30
Binning
• Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
• Sorted values are distributed into a number of “buckets,” or bins.
• Binning methods consult the neighborhood of values, they perform local
smoothing.
• Smoothing by bin means: each value in a bin is replaced by the mean
value of the bin.
• Smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median.
• Smoothing by bin boundaries: the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
31
Example
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28,
34
32
Exercise
• The following data (in increasing order) for the
attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70.
33
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric
vs. British units
34
34
Handling Redundancy in Data Integration
• Χ2 (chi-square) test
• For nominal data the correlation between two attributes can
be identified by this test
2
(Observed Expected )
2
Expected
• The larger the Χ2 value, the more likely the variables are related
36
Chi-Square Calculation: An Example
• Suppose that a group of 1500 people was surveyed. The gender
of each person was noted. Each person was polled as to
whether his or her preferred type of reading material was
fiction or non-fiction. Thus, we have two attributes, gender and
preferred_reading. The observed frequency (or count) of each
possible joint event is summarized in the contingency table.
Are gender and preferred_reading correlated?
38
Correlation Analysis (Numeric Data)
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
40
Mining Frequent Patterns
41
What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
42
Example
• The information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following
association rule:
43
Why Is Freq. Pattern Mining Important?
44
Basic Concepts: Frequent Patterns
Ti Items bought • itemset: A set of one or more items
d
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper • (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
40 Nuts, Eggs, Milk
occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, • (relative) support, s, is the fraction
Milk
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
45
Basic Concepts: Frequent Patterns
46
Basic Concepts: Association Rules
Ti
d
Items bought • Find all the rules X Y with minimum
10 Beer, Nuts, Diaper
support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs – support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs,
Milk – confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer Association rules: (many
more!)
Beer Diaper (60%,
47 100%)
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y כX, with the same support as X
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y כX
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
48
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
– Min_sup = 1.
• What is the set of closed itemset?
– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of max-pattern?
– <a1, …, a100>: 1
49
Database
• If I = {I1, I2,..., In} is a set of binary attributes called items and D = {T1,
T2,..., Tn} is a set of transactions called the database, each transaction in D
will have a unique transaction ID and a subset of the items in I.
50
Apriori: A Candidate Generation & Test Approach
51
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
52
– C4 = {abcd}
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
53
The Apriori Algorithm (Pseudo-
Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
54 return L ;
k k
Example
The AllElectronics transaction database, D, is provided in the
Table below. There are nine transactions in this database, that is,
|D|= 9. Using Apriori algorithm find out the frequent itemsets in
D. (Suppose that the minimum support count required is 2)
55
Solution
56
Further Improvement of the Apriori Method
57
Construct FP-tree from a Transaction Database
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order
3. Scan DB again, construct FP-tree
min_support = 2
58
Construct FP-tree from a Transaction Database
59
From Conditional Pattern-bases to Conditional FP-trees
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count node-
links and the count field)
61
The Frequent Pattern Growth Mining Method
62
Advantages of the Pattern Growth Approach
• Divide-and-conquer:
– Decompose both the mining task and DB according to the frequent
patterns obtained so far
– Lead to focused search of smaller databases
• Other factors
– No candidate generation, no candidate test
– Compressed database: FP-tree structure
– No repeated scan of entire database
– Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
• A good open-source implementation and refinement of FPGrowth
– FPGrowth+ (Grahne and J. Zhu, FIMI'03)
63
ECLAT(Equivalence Class
Transformation Vertical Apriori): Mining by Exploring Vertical
Data Format
64
Example
The AllElectronics transaction database, D, is provided in the
Table below. There are nine transactions in this database, that is,
|D|= 9. Using ECLAT algorithm find out the frequent itemsets in
D. (Suppose that the minimum support count required is 2 and
minimum confidence is 70%). Generate the association rules
from one of the frequent patterns.
65
Classification
66
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
67
Prediction Problems: Classification vs. Numeric
Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
68
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
69
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
71
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
72
Decision Tree: Example
73
Algorithm for Decision Tree Induction
Let S = [(X1, C1), (X2, C2),..., (Xk, Ck)] be a training sample. The construction of a
decision tree from S can be done in a divide-and-conquer fashion as follows:
• Step 1: If all the examples in S are labelled with the same class, return a leaf
labelled with that class.
• Step 2: Choose some test t (according to some criterion) that has two or more
mutually exclusive outcomes O1, O2,..., Or for the set Si.
• Step 3: Partition S into disjoint subsets S1, S2,... for examples having outcome
Oi for the test t, for i = 1, 2,..., r.
• Step 4: Call this tree-construction procedure recursively on each of the subsets
S1, S2,..., Sr, and let the decision trees returned by these recursive calls be T1,
T2,..., Tr .
• Step 5: Return a decision tree T with a node labelled t as the root and trees T1,
T2,..., Tr as subtrees below that node.
74
Example
75
Splitting Criterion
Select the attribute with the highest information gain
For a sample S the average amount of information needed to find the class
of a case in S is estimated by the function
k
|S | |S |
Info( S ) i log 2 i bits
i 1 | S | |S|
where Si ⊆ S is the set of examples S of class i and k is the number of
classes.
How much more information would we still need (after the partitioning) to
r
arrive at an exact classification? This | Samount
i | isS measured by
i 1 | S |
Info ( i )
77
Splitting Criterion: Solution
• We first compute the expected information needed to
classify a tuple in S
• Next, we need to compute the expected information
requirement for each attribute
• For the age category “youth,” there are two yes tuples and
three no tuples. For the category “middle aged,” there are
four yes tuples and zero no tuples. For the category “senior,”
there are three yes tuples and two no tuples
• Next compute the gain. Because age has the highest
information gain among the attributes, it is selected as the
splitting attribute
78
Splitting Criterion
A bias exists in the gain criterion. It can be overcome by
dividing the information gain of a test by the entropy of the test
results, which calculates the extent of partitioning done by the
test.
80
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
81
Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (posteriori probability), the
probability that the hypothesis holds given the observed data
sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelyhood), the probability of observing the sample X,
given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
82
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem
88
Example
• Predict the class label of a tuple using naïve Bayesian classification, given
the same training data as follows. The data tuples are described by the
attributes age, income, student, and credit rating.
89
Solution
• Let C1 correspond to the class buys computer = yes and C2 correspond to
buys computer = no.
• The tuple we wish to classify is X = (age = youth, income = medium,
student = yes, credit rating = fair)
• We need to maximize P(Xj|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability
of each class, can be computed based on the training tuples:
90
Solution
91
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
92 priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision Tree age?
94
Basic sequential covering algorithm
95
Example
Suppose our training set, D, consists of loan application data. Attributes
regarding each applicant include their age, income, education level,
residence, credit rating, and the term of the loan. The classifying attribute is
loan decision, which indicates whether a loan is accepted (considered safe) or
rejected (considered risky)
96
Bayesian Belief Networks
Bayesian belief networks (also known as
Bayesian networks, probabilistic networks):
allow class conditional independencies between
subsets of variables
A (directed acyclic) graphical model of causal
relationships
Represents dependency among the variables
Nodes: random variables
Gives a specification of joint probability
Links: dependency
X Y
distribution X and Y are the parents of Z, and
Y is the parent of P
Z
P No dependency between Z and P
Has no loops/cycles
97
Bayesian Belief Network: An
Example
Family CPT: Conditional Probability
Smoker (S)
History (FH) Table for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
optimum
Scenario 3: Network structure unknown, all variables
observable: search through the model space to reconstruct
network topology
Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose
99
Example
Burglar alarm at home
Fairly reliable at detecting a burglary
Responds at times to minor earthquakes
calculate the probability that the alarm has sounded, but neither a burglary nor an
earthquake has occurred, and both John and Mary call
Belief Network Example
P(B) P(E)
Burglary Earthquake
0.001 0.002
B E P(A)
T T 0.95
Alarm T F 0.95
F T 0.29
F F 0.001
A P(J)
T 0.90 A P(M)
JohnCalls MaryCalls
F 0.05 T 0.70
F 0.01
Metrics for Evaluating Classifier
Performance
True positives (TP): These refer to the positive tuples that were correctly
labeled by the classifier. Let TP be the number of true positives.
True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
False positives (FP): These are the negative tuples that were incorrectly
labeled as positive (e.g., tuples of class buys computer D no for which the
classifier predicted buys computer D yes). Let FP be the number of false
positives.
False negatives (FN): These are the positive tuples that were mislabeled as
negative (e.g., tuples of class buys computer D yes for which the classifier
Classification by
Backpropagation
Backpropagation: A neural network learning
algorithm
Started by psychologists and neurobiologists to
develop and test computational analogues of
neurons
A neural network: A set of connected input/output
units where each connection has a weight
associated with it
During the learning phase, the network learns
by adjusting the weights so as to be able to
predict the correct class label of the input tuples
103
Neural Network as a
Classifier
Weakness
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of “hidden units”
in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on an array of real-world data, e.g., hand-
written letters
Algorithms are inherently parallel
Techniques have recently been developed for the 104
A Multi-Layer Feed-Forward
Neural Network
Output vector
w(jk 1) w(jk ) ( yi yˆ i( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
105
How A Multi-Layer Neural
Network Works
The inputs to the network correspond to the attributes
measured for each training tuple
Inputs are fed simultaneously into the units making up the
input layer
They are then weighted and fed simultaneously to a hidden
layer
The number of hidden layers is arbitrary, although usually
only one
The weighted outputs of the last hidden layer are input to
units making up the output layer, which emits the
network's prediction
The network is feed-forward: None of the weights cycles
back to an input unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear 106
Defining a Network Topology
Decide the network topology: Specify # of units
in the input layer, # of hidden layers (if > 1), # of
units in each hidden layer, and # of units in the
output layer
Normalize the input values for each attribute
measured in the training tuples to [0.0—1.0]
One input unit per domain value, each initialized to
0
Output, if for classification and more than two
classes, one output unit per class is used
Once a network has been trained and its accuracy
is unacceptable, repeat the training process with
a different network topology or a different set of 107
Backpropagation
Iteratively process a set of training tuples & compare the
network's prediction with the actual known target value
For each training tuple, the weights are modified to minimize
the mean squared error between the network's prediction
and the actual target value
Modifications are made in the “backwards” direction: from
the output layer, through each hidden layer down to the first
hidden layer, hence “backpropagation”
Steps
Initialize weights to small random numbers, associated with
biases
Propagate the inputs forward (by applying activation
function)
108
Neuron: A Hidden/Output Layer
Unitbias
x0 w0 mk
x1 w1
å f output y
xn wn For Example
n
y sign( wi xi k )
Input weight weighted Activation i 0
113
SVM—Margins and Support Vectors
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane (MMH)
115
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar
(bias)
For 2-D it can be written as
w 0 + w 1 x1 + w 2 x2 = 0
The hyperplane defining the sides of the margin:
H 1: w 0 + w 1 x1 + w 2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
116
Why Is SVM Effective on High
Dimensional Data?
The complexity of trained classifier is characterized by the
# of support vectors rather than the dimensionality of the
data
The support vectors are the essential or critical training
examples —they lie closest to the decision boundary (MMH)
If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
The number of support vectors found can be used to
compute an (upper) bound on the expected error rate of the
SVM classifier, which is independent of the data
dimensionality
Thus, an SVM with a small number of support vectors can
117
SVM vs. Neural Network
group
dissimilar (or unrelated) to the objects in other
groups
Cluster analysis (or clustering, data segmentation,
…)
Finding similarities between data according to the
124
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
of these
Constraint-based clustering
User may give inputs on constraints
High dimensionality
125
Major Clustering
Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
126
Major Clustering
Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries
Frequent pattern-based:
Based on the analysis of frequent patterns
User-guided or constraint-based:
Clustering by considering user-specified or application-
specific constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
LinkClus
127
Partitioning Methods
128
Partitioning Algorithms: Basic
Concept
Partitioning method: Partitioning a database D of n objects
into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of
cluster Ci) k 2
E i 1 pCi ( p ci )
K=2
Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean Update
the
point) for each partition cluster
Assign each object to the centroids
cluster of its nearest centroid
Until no change
131
Comments on the K-Means
Method
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal.
Weakness
Applicable only to objects in a continuous n-dimensional space
Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of
data
Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al.,
2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes
132
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical
objects
Using a frequency-based method to update modes of
clusters
A mixture of categorical and numerical data: k-prototype 133
What Is the Problem of the K-Means
Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may
substantially distort the distribution of the data
K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
134
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5
each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
initial to
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 1 2 3 4 5 6 7 8 9 10 nearest 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
135
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects (medoids) in
clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw
1987)
Starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering
PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
136
Hierarchical Methods
137
Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
138
AGNES (Agglomerative
Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity
matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10 10
9
9 9
8
8 8
7
7 7
6
6 6
5
5 5
4
4 4
3
3 3
2
2 2
1
1 1
0
0 0
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
139
Dendrogram: Shows How Clusters are
Merged
140
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
141
Distance between X X
Clusters
Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip,
tjq)
Complete link: largest distance between an element in
one cluster and an element in the other, i.e., dist(Ki, Kj) =
max(tip, tjq)
Average: avg distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters,
i.e., dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e.,
142
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN1(t )
Cm N ip
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
143
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering
methods
Can never undo what was done previously
Do not scale well: time complexity of at least
O(n2), where n is the number of total objects
Integration of hierarchical & distance-based
clustering
BIRCH (1996): uses Clustering feature (CF)-tree
and incrementally adjusts the quality of sub-
144
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
Zhang, Ramakrishnan & Livny, SIGMOD’96
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the
order of the data record
145
Clustering Feature Vector in
BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
N
LS: linear sum of N points: X
i
i 1
N 2 10
(3,4)
Xi
9
(2,6)
8
i 1
7
4
(4,5)
3
2 (4,7)
(3,8)
1
1 2 3 4 5 6 7 8 9 10
146
CF-Tree in BIRCH
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
nodes
147
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
148
The Birch Algorithm
Cluster Diameter 1 2
( xi x j )
n( n 1)
For each point in the input
Find closest leaf entry
possibly parents
Algorithm is O(n)
Concerns
Sensitive to insertion order of data points
so natural
Clusters tend to be spherical given the radius and diameter
measures
149
CHAMELEON: Hierarchical
Clustering Using Dynamic Modeling
(1999)
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar,
1999
Measures the similarity based on a dynamic model
Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters
Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster
objects into a large number of relatively small
sub-clusters
150
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Data Set
K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 151
Probabilistic Hierarchical Clustering
Algorithmic hierarchical clustering
Nontrivial to choose a good distance measure
Hard to handle missing attribute values
Optimization goal not clear: heuristic, local search
Probabilistic hierarchical clustering
Use probabilistic models to measure distances between
clusters
Generative model: Regard the set of data objects to be
clustered as a sample of the underlying data generation
mechanism to be analyzed
Easy to understand, same efficiency as algorithmic
agglomerative clustering method, can handle partially
observed data
In practice, assume the generative models adopt common
distributions functions, e.g., Gaussian distribution or Bernoulli 152
Generative Model
Given a set of 1-D points X = {x1, …, xn} for
clustering analysis & assuming they are
generated by a Gaussian distribution:
153
A Probabilistic Hierarchical Clustering
Algorithm
For a set of objects partitioned into m clusters C1, . . . ,Cm, the
quality can be measured by,
based)
155
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps,
MinPts if
p belongs to NEps(q) p MinPts = 5
core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
156
Density-Reachable and Density-
Connected
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. Eps,
p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
Density-connected p q
A point p is density-connected
to a point q w.r.t. Eps, MinPts o
if there is a point o such that
both, p and q are density-
reachable from o w.r.t. Eps 157
DBSCAN: Density-Based Spatial
Clustering of Applications with
Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
158
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t.
Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database
Continue the process until all of the points have
been processed
159
Mining Time-series Data
160
Time-series Database
• A time-series database consists of sequences of values or
events obtained over repeated measurements of time
• Popular in many applications, such as stock market analysis,
economic and sales forecasting, budgetary analysis, utility
studies, inventory studies etc.
• Time-series database is also a sequence database
161
Time-series representation
A time series involving a variable Y, representing, the daily
closing price of a share in a stock market, can be viewed as a
function of time t, Y = F(t).
165
Least Square Method Example 1
166
Solution
There are three points, so the value of
n is 3
Now, find the value of m, using Now, find the value of b using the
the formula. formula,
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2 b = (∑y - m∑x)/n
m = [(3×9) - (2×2)]/(3×14) - (2)2 b = [2 - (23/38)×2]/3
m = (27 - 4)/(42 - 4) b = [2 -(23/19)]/3
m = 23/38 b = 15/(3×19)
b = 5/19
So, the required equation of least squares is y = mx + b = 23/38x167
+
Similarity Search in Time-Series
• A similarity search finds data sequences that differ only
slightly from the given query sequence
• There are two types of similarity searches: subsequence
matching and whole sequence matching
• Subsequence matching finds the sequences in S that contain
subsequences that are similar to a given query sequence x.
• Whole sequence matching finds a set of sequences in S that
are similar to each other (as a whole).
168
Similarity Search Methods
• For similarity analysis of time-series data, Euclidean distance
is typically used as a similarity measure
• The smaller the distance between two sets of time-series
data, the more similar are the two series
169
Steps in Similarity Search
A similarity search that handles gaps and differences in offsets
and amplitudes can be performed by the following steps
• Atomic matching: Normalize the data. Find all pairs of gap-
free windows of a small length that are similar.
• Window stitching: Stitch similar windows to form pairs of
large similar subsequences, allowing gaps between atomic
matches.
• Subsequence ordering: Linearly order the subsequence
matches to determine whether enough similar pieces exist.
170
Subsequence matching in time-
series data
171
Mining Data Streams
172
Mining Data Streams
• Tremendous and potentially infinite volumes of data streams are often
generated by real-time surveillance systems, communication networks,
Internet traffic, on-line transactions in the financial market or retail
industry, electric power grids, industry production processes, scientific and
engineering experiments, remote sensors, and other ,dynamic
environments.
• It may be impossible to store an entire data stream or to scan through it
multiple times due to its tremendous volume
• To discover knowledge or patterns from data streams, it is necessary to
develop single-scan, on-line, multilevel, multidimensional stream
processing and analysis methods
173
Methodologies for Stream Data
Processing
• Random Sampling: Rather than deal with an entire data stream, we can
think of sampling the stream at periodic intervals. A technique called
reservoir sampling can be used to select an unbiased random sample of s
elements without replacement
• Sliding Windows: Instead of sampling the data stream randomly, we can
use the sliding window model to analyze stream data. The basic idea is
that rather than running computations on all of the data seen so far, or on
some sample, we can make decisions based only on recent data.
• Histograms: A histogram partitions the data into a set of contiguous
buckets. Depending on the partitioning rule used, the width (bucket value
range) and depth (number of elements per bucket) can vary.
174
Methodologies for Stream Data
Processing
• Multiresolution Methods: A more sophisticated way to form multiple
resolutions is to use a clustering method to organize stream data into a
hierarchical structure of trees. For example, we can use a typical
hierarchical clustering data structure like CF-tree in BIRCH to form a
hierarchy of microclusters.
• Sketches: Some techniques require multiple passes over the data, such as
histograms and wavelets, whereas other methods, such as sketches, can
operate in a single pass. Suppose that, ideally, we would like to maintain
the full histogram over the universe of objects or elements in a data
stream, which may be very large.
• When the amount of memory available is smaller, we need to employ a
synopsis. The estimation of the frequency moments can be done by
synopses that are known as sketches. These build a small-space summary
for a distribution vector (e.g., histogram) using randomized linear
projections of the underlying data vectors. Sketches provide probabilistic
guarantees on the quality of the approximate answer 175
Methodologies for Stream Data
Processing
• Randomized Algorithms: Randomized algorithms, in the form of random
sampling and sketching, are often used to deal with massive, high-
dimensional data streams. The use of randomization often leads to simpler
and more efficient algorithms in comparison to known deterministic
algorithms.
• If a randomized algorithm always returns the right answer but the running
times vary, it is known as a Las Vegas algorithm. In contrast, a Monte Carlo
algorithm has bounds on the running time but may not return the correct
result. We mainly consider Monte Carlo algorithms. One way to think of a
randomized algorithm is simply as a probability distribution over a set of
deterministic algorithms.
176
Frequent-Pattern Mining in Data
Streams
• Frequent-pattern mining finds a set of patterns that occur frequently in a
data set, where a pattern can be a set of items (called an itemset), a
subsequence, or a substructure.
• A pattern is considered frequent if its count satisfies a minimum support.
• Existing frequent-pattern mining algorithms require the system to scan the
whole data set more than once, but this is unrealistic for infinite data
streams.
• To overcome this difficulty, there are two possible approaches. One is to
keep track of only a predefined, limited set of items. This method has very
limited usage.
• The second approach is to derive an approximate set of answers. In
practice, approximate answers are often sufficient. Here we use one such
algorithm: the Lossy Counting algorithm. It approximates the frequency of
items or itemsets within a user-specified error bound, e.
177
Example
• Approximate frequent items: A router is
interested in all items whose frequency is at
least 1% (min support) of the entire traffic
stream seen so far. It is felt that 1/10 of min
support (i.e., e = 0.1%) is an acceptable margin
of error. This means that all frequent items
with a support of at least min support will be
output, but that some items with a support of
at least (min support-e) will also be output.
178
Lossy Counting Algorithm
The lossy count algorithm is an algorithm to identify elements in a data
stream whose frequency count exceed a user-given threshold. The algorithm
works by dividing the Data Stream into ‘Buckets’ as for frequent items, but fill
as many buckets as possible in main memory one time. The frequency
computed by this algorithm is not always accurate, but has an error threshold
that can be specified by the user. The run time space required by the
algorithm is inversely proportional to the specified error threshold, hence
larger the error, the smaller the footprint.
• Step 1: Divide the incoming data stream into buckets of width w=1/e,
where e is mentioned by user as the error bound (along with minimum
support threshold ).
• Step 2: Increment the frequency count of each item according to the new
bucket values. After each bucket, decrement all counters by 1.
• Step 3: Repeat – Update counters and after each bucket, decrement all
counters by 1.
179
K-Nearest Neighbor(KNN) Algorithm
180
KNN Algorithm
The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each
category.
• Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
• Step-6: Our model is ready.
181
Example
182
Example
Divide the given data in 2 clusters using k-means algorithm
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76 183
Class imbalance problem
• The data set distribution reflects a significant majority of the negative
class and a minority positive class
• In medical data, there may be a rare class, such as “cancer.” Suppose that
you have trained a classifier to classify medical data tuples, where the
class label attribute is “cancer” and the possible class values are “yes” and
“no.” An accuracy rate of, say, 97% may make the classifier seem quite
accurate, but what if only, say, 3% of the training tuples are actually cancer
• The sensitivity and specificity measures can be used, respectively, for this
purpose. Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly identified),
while specificity is the true negative rate (i.e., the proportion of negative
tuples that are correctly identified).
184
Prediction models
• “What if we would like to predict a continuous value, rather than a
categorical label?” Numeric prediction is the task of predicting continuous
(or ordered) values for given input. For example, we may wish to predict
the salary of college graduates with 10 years of work experience, or the
potential sales of a new product given its price
• Regression analysis can be used to model the relationship between one or
more independent or predictor variables and a dependent or response
variable (which is continuous-valued). In the context of data mining, the
predictor variables are the attributes of interest describing the tuple. In
general, the values of the predictor variables are known. The response
variable is what we want to predict
185
Types of regression
• Linear regression
• Non-linear regression
186
Linear regression
• Straight-line regression analysis involves a response variable, y, and a
single predictor variable, x.
• Models y as a linear function of x
• where the variance of y is assumed to be constant, and b and w are
regression coefficients specifying the Y-intercept and slope of the line,
respectively. The regression coefficients, w and b, can also be thought of
as weights, so that we can equivalently write
• These coefficients can be solved for by the method of least squares, which
estimates the best-fitting straight line as the one that minimizes the error
between the actual data and the estimate of the line
187
Example
• Table below shows a set of paired data where x is the number of years of
work experience of a college graduate and y is the corresponding salary of
the graduate. Predict that the salary of a college graduate with 10 years of
experience.
188
Non-linear regression
• Polynomial regression is often of interest when there is just one predictor
variable. It can be modelled by adding polynomial terms to the basic linear
model. By applying transformations to the variables, we can convert the
nonlinear model into a linear one that can then be solved by the method
of least squares.
• Transformation of a polynomial regression model to a linear regression
model
190
Graph Mining
Graphs
Model sophisticated structures and their interactions
Chemical Informatics
Bioinformatics
Computer Vision
Video Indexing
Text Retrieval Web
Analysis Social
Networks
Mining frequent sub-graph patterns
Characterization, Discrimination, Classification and Cluster
Analysis, building graph indices and similarity search
191
Mining Frequent Subgraphs
Graph g
Vertex Set – V(g) Edge set – E(g)
Label function maps a vertex / edge to a label
Graph g is a sub-graph of another graph g’ if there exists a graph iso-
morphism from g to g’
Support(g) or frequency(g) – number of graphs in D = {G1, G2,..Gn}
where g is a sub-graph
Frequent graph – satisfies min_sup
192
Discovery of Frequent
Substructures
Step 1: Generate frequent sub-structure candidates
Step 2: Check for frequency of each candidate
Involves sub-graph isomorphism test which is
computationally expensive
Approaches
Apriori –based approach
Pattern Growth approach
193
Apriori based Approach
AprioriGraph
• Level wise mining method
• Size of new substructures is
increased by 1
• Generated by joining two similar but
slightly different frequent sub- graphs
• Frequency is then checked
194
Apriori Approach
AGM (Apriori-based Graph Mining)
Vertex based candidate generation – increases sub structure size by one
vertex at each step
Two frequent k size graphs are joined only if they have the same (k-1)
subgraph (Size – number of vertices)
New candidate has (k-1) sized component and the additional two
vertices
Two different sub-structures can be formed
195
Apriori Approach
FSG (Frequent Sub-graph mining)
Edge-based Candidate generation – increases by one-edge at a
time
Two size k patterns are merged iff they share the same subgraph
having k-1 edges (core)
New candidate – has core and the two additional edges
196
Apriori Approach
Edge disjoint path method
• Classify graphs by number of disjoint paths they have
• Two paths are edge-disjoint if they do not share any common edge
• A substructure pattern with k+1 disjoint paths is generated by joining sub-
structures with k disjoint paths
Disadvantage of Apriori Approaches
• Overhead when joining two sub-structures
• Uses BFS strategy : level-wise candidate generation
To check whether a k+1 graph is frequent – it must check all of its size-k sub graphs
197
Pattern-Growth Approach
Uses BFS as well as DFS
A graph g can be extended by adding a new edge e. The newly
formed graph is denoted by g x e.
Edge e may or may not introduce a new vertex to g.
If e introduces a new vertex, the new graph is denoted by g xf e,
otherwise, g xb e, where f or b indicates that the extension is in a forward
or backward direction.
Pattern Growth Approach
For each discovered graph g performs extensions recursively until all
frequent graphs with g are found
Simple but inefficient
Same graph is discovered multiple times – duplicate graph
198
Pattern Growth
196
gSpan Algorithm
Reduces generation of duplicate graphs
Does not extend duplicate graphs
Uses Depth First Order
A graph may have several DFS-trees
Visiting order of vertices forms a linear order - Subscript
In a DFS tree – starting vertex – root; last visited vertex – right-most vertex
Path from v0 to vn – right most path
Right most path: (b), (c) – (v0, v1, v3); (d) – (v0, v1, v2, v3)
200
gSpan Algorithm
gSpan restricts the extension method
A new edge e can be added
between the right-most vertex and another vertex on the right-most path (backward
extension);
or it can introduce a new vertex and connect to a vertex on the right-most path
(forward extension)
Right-most extension, denoted by G r e
201
gSpan Algorithm
Chooses any one DFS tree – base subscripting and
extends it
Each subscripted graph is transformed into an edge sequence –
DFS code
Select the subscript that generates minimum sequence
Edge Order – maps edges in a subscripted graph into a sequence
Sequence Order – builds an order among edge sequences
202
gSpan Algorithm
Here 0 < 1 < 2
0 – Minimum DFS Code
Corresponding subscript – Base
Subscripting
DFS Lexicographic Ordering: Edge order, First Vertex label, Edge label, Second Vertex label
203
gSpan Algorithm
204
gSpan Algorithm
205
Mining Closed Frequent
Substructures
Helps to overcome the problem of pattern explosion
A frequent graph G is closed if and only if there is no proper super graph G0
that has the same support as G.
Closegraph Algorithm
A frequent pattern G is maximal if and only if there is no frequent super-
pattern of G.
Maximal pattern set is a subset of the closed pattern set.
But cannot be used to reconstruct entire set of frequent patterns
206
Mining Alternative Substructure
Patterns
Mining unlabeled or partially labeled graphs
New empty label is assigned to vertices and edges that do not have labels
Mining non-simple graphs
A non simple graph may have a self-loop and multiple edges growing order -
backward edges, self-loops, and forward edges
To handle multiple edges - allow sharing of the same vertices in two neighboring
edges in a DFS code
Mining directed graphs
6-tuple (i; j; d; li; l(i; j) ; lj ); d = +1 / -1
Mining disconnected graphs
Graph / Pattern may be disconnected Disconnected
Graph – Add virtual vertex
Disconnected graph pattern – set of connected graphs
Mining frequent subtrees
Tree – Degenerate graph
207
Constraint based Mining of
Substructure
Element, set, or subgraph containment constraint
user requires that the mined patterns contain a particular set of
subgraphs - Succinct constraint
Geometric constraint
A geometric constraint can be that the angle between each pair of
connected edges must be within a range – Anti-monotonic constraint
Value-sum constraint
the sum_of (positive) weights on the edges, must be within a range low
and high – (sum > low) Monotonic / Anti-monotonic (sum < high)
Multiple categories of constraints may also be enforced
208
Mining Approximate Frequent
Substructures
Approximate frequent substructures allow slight structural variations
Several slightly different frequent substructures can be represented
using one approximate substructure
SUBDUE – Substructure discovery system
based on the Minimum Description Length (MDL) principle
adopts a constrained beam search
SUBDUE performs approximate matching
209
Mining Coherent and Dense Sub
structures
A frequent substructure G is a coherent sub graph if the mutual information
between G and each of its own sub graphs is above some threshold
Reduces number of patterns mined
Application: coherent substructure mining selects a small subset of features that have high
distinguishing power between protein classes.
Relational graph –each label is used only once
Frequent highly connected or dense subgraph mining
People with strong associations in OSNs
Set of genes within the same functional module
210
Mining Dense Substructures
Dense graphs defined in terms of Edge Connectivity
Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is
disconnected.
A minimum cut is the smallest set in all edge cuts.
The edge connectivity of G is the size of a minimum cut.
A graph is dense if its edge connectivity is no less than a specified minimum cut
threshold
Mining Dense substructures
Pattern-growth approach called Close-Cut (Scalable)
starts with a small frequent candidate graph and extends it until it finds the largest super graph with
the same support
211
Applications – Graph Indexing
Indexing is essential for efficient search and query processing
Traditional approaches are not feasible for graphs
Indexing based on nodes / edges / sub-graphs
Path based Indexing approach
Enumerate all the paths in a database up to maxL length and index them
Index is used to identify all graphs with the paths in query
Not suitable for complex graph queries
Structural information is lost when a query graph is broken apart
Many false positives maybe returned
212
Graph Indexing
213
Substructure Similarity Search
Bioinformatics and Chem-informatics applications involve query
based search in massive complex structural data
214
Substructure Similarity Search
Grafil (Graph Similarity Filtering)
Feature based structural filtering
Models each query graph as a set of features
Edge deletions – feature misses
Too many features – reduce performance
Multi-filter composition strategy
Feature Set - group of similar features
215
Classification and Cluster Analysis
using Graph Patterns
Graph Classification
Mine frequent graph patterns
Features that are frequent in one class but less in another – Discriminative
features – Model construction
Can adjust frequency, connectivity thresholds SVM, NBM etc are used
Cluster Analysis
Cluster Similar graphs based on graph connectivity (minimal cuts)
Hierarchical clusters based on support threshold
Outliers can also be detected
Inter-related process
216
Social Network Analysis
217
What Is a Social Network
A social network is a heterogeneous and multirelational data set represented
by a graph. The graph is typically very large, with nodes corresponding to
objects and edges corresponding to links representing relationships or
interactions between objects. Both nodes and links have attributes
218
Examples
• electrical power grids
• telephone call graphs
• spread of computer viruses
• World Wide Web
• Co-authorship and citation networks of
scientists
• Small world network
219
Models of Social Networks
• Random graph
• Scale free networks
220
Characteristics of Social Networks
• Densification power law
• Shrinking diameter
• Heavy-tailed out-degree and in-degree
distributions
221
Information on the Social Network
• Heterogeneous, multi-relational data represented as a graph or network
• Nodes are objects
• May have different kinds of objects
• Objects have attributes
• Objects may have labels or classes
• Edges are links
• May have different kinds of links
• Links may have attributes
• Links may be directed, are not required to be binary
• Links represent relationships and interactions between objects - rich
content for mining
222
Link Mining: Tasks and Challenges
“How can we mine social networks?”
• Link-based object classification. In traditional classification methods,
objects are classified based on the attributes that describe them. Link-
based classification predicts the category of an object based not only on
its attributes, but also on its links, and on the attributes of linked objects
– Web page classification is a well-recognized example of link-based classification. It
predicts the category of a Web page based on word occurrence and anchor text, both of
which serve as attributes
• Object type prediction. This predicts the type of an object, based on its
attributes and its links, and on the attributes of objects linked to it
• Link type prediction. This predicts the type or purpose of a link, based on
properties of the objects involved Predicting link existence. Unlike link
type prediction, where we know a connection exists between two objects
and we want to predict its type, instead we may want to predict whether a
link exists between two objects
223
Link Mining: Tasks and Challenges
• Link cardinality estimation. There are two forms of link cardinality
estimation. First, we may predict the number of links to an object. This is
useful, for instance, in predicting the authoritativeness of a Web page
based on the number of links to it (in-links)
• Object reconciliation. In object reconciliation, the task is to predict
whether two objects are, in fact, the same, based on their attributes and
links
• Group detection. Group detection is a clustering task. It predicts when a
set of objects belong to the same group or cluster, based on their
attributes as well as their link structure
• Subgraph detection. Subgraph identification finds characteristic subgraphs
within networks
• Metadata mining. Metadata are data about data. Metadata provide semi-
structured data about unstructured data, ranging from text and Web data
to multimedia databases. It is useful for data integration tasks in many
domains
224
What is New for Link Mining
Traditional machine learning and data mining approaches assume:
– A random sample of homogeneous objects from single relation
• Real world data sets:
– Multi-relational, heterogeneous and semi-structured
• Link Mining
– Research area at the intersection of research in social network and link analysis,
hypertext and web mining, graph mining, relational learning and inductive logic
programming
225
What Is a Link in Link Mining?
Link: relationship among data
• Two kinds of linked networks
– homogeneous vs. heterogeneous
• Homogeneous networks
– Single object type and single link type
– Single model social networks (e.g., friends)
– WWW: a collection of linked Web pages
• Heterogeneous networks
– Multiple object and link types
– Medical network: patients, doctors, disease, contacts, treatments
– Bibliographic network: publications, authors, venues
226
Link-Based Object Ranking (LBR)
• LBR: Exploit the link structure of a graph to order or prioritize the set of
objects within the graph
– Focused on graphs with single object type and single link type
• This is a primary focus of link analysis community
• Web information analysis
– PageRank and Hits are typical LBR approaches
• In social network analysis (SNA), LBR is a core analysis task
– Objective: rank individuals in terms of “centrality”
– Degree centrality vs. eigen vector/power centrality
– Rank objects relative to one or more relevant objects in the graph vs. ranks object over
time in dynamic graphs
227
PageRank: Capturing Page
Popularity(Brin & Page’98)
• Intuitions
– Links are like citations in literature
– A page that is cited often can be expected to be more useful in general
• PageRank is essentially “citation counting”, but improves over simple
counting
– Consider “indirect citations” (being cited by a highly cited paper counts a lot…)
– Smoothing of citations (every page is assumed to have a non-zero citation count)
• PageRank can also be interpreted as random surfing (thus capturing
popularity)
228
WEB MINING
229
230
Data Mining vs. Web Mining
• Traditional data mining
– data is structured and relational
– well-defined tables, columns, rows, keys,
and constraints.
• Web data
– Semi-structured and unstructured
– readily available data
– rich in features and patterns
231
Web Mining
232
Web Mining
• Web is the single largest data source in the
world
• Due to heterogeneity and lack of structure of
web data, mining is a challenging task
• Multidisciplinary field:
– data mining, machine learning, natural language
– processing, statistics, databases, information
– retrieval, multimedia, etc.
233
Mining the World-Wide Web
• The WWW is huge, widely distributed, global
information service center for
– Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
– Hyper-link information
– Access and usage information
• WWW provides rich sources for data mining
234
235
Web Mining: A more challenging task
• Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
• Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented search
– Limited customization to individual users
236
237
238
239
240
241
242
243
244
246
247
248
Web Usage Mining
05/17/2025 249
Techniques for Web usage mining
• Construct multidimensional view on the Weblog
database
– Perform multidimensional OLAP analysis to find the top N
users, top N accessed Web pages, most frequently accessed
time periods, etc.
• Perform data mining on Weblog records
– Find association patterns, sequential patterns, and trends of
Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
• Conduct studies to
– Analyze system performance, improve system design by Web
caching, Web page prefetching, and Web page swapping 250
05/17/2025
Mining the World-Wide Web
• Design of a Web Log Miner
– Web log is filtered to generate a relational database
– A data cube is generated form database
– OLAP is used to drill-down and roll-up in the cube
– OLAM is used for mining interesting knowledge
Knowledge
Web log Database Data Cube Sliced and diced
cube
1 2
Data Cleaning 3 4
Data Cube OLAP Data Mining
Creation
05/17/2025 251
252
253
254