0% found this document useful (0 votes)

6 views103 pages

Chapter 4 (2)

Uploaded by

ttdlinh.sdh231

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views103 pages

Chapter 4 (2)

Uploaded by

ttdlinh.sdh231

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 103

D a t a Science in Business

Chapter 4 — Classification and

Prediction

Dr. LE SONG THANH QUYNH

Ho Chi Minh City University of Technology

1
Chapter 4. Classification and Prediction

■ What is classification? What is ■ Support Vector Machines

prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
Classification vs. Prediction
■ Classification
■
predicts categorical class labels (discrete or nominal)
■ classifies data (constructs a model) based on the

training set and the values (class labels) in a

classifying attribute and uses it in classifying new data
■ Prediction
■
models continuous-valued functions, i.e., predicts
unknown or missing values
■ Typical applications
■
Credit approval
■ Target marketing

■ Medical diagnosis

■ Fraud detection
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute

■
The set of tuples used for model construction is training set
■
The model is represented as classification rules, decision trees,
or mathematical formulae
■ Model usage: for classifying future or unknown objects
■ Estimate accuracy of the model

■
The known label of test sample is compared with the
classified result from the model
■
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
■
Test set is independent of training set, otherwise over-fitting
will occur
■ If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known

Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured =
‘yes’
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised Learning

■ Supervised learning (classification)

■
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
■ New data is classified based on the training set
■ Unsupervised learning (clustering)
■
The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues: Data Preparation

■ Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
■ Relevance analysis (feature selection)
Remove the irrelevant or redundant
■

■ attributes Data transformation

■
Generalize and/or normalize data
Issues: Evaluating Classification Methods
■ Accuracy
■
classifier accuracy: predicting class label
■ predictor accuracy: guessing value of predicted

attributes
■ Speed
■
time to construct the model (training time)
■
time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■
understanding and insight provided by the
■ model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
Decision Tree Induction: Training Dataset
age income student credit_rating buys_cloths
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “ buys_cloths”

age

<=30 31....40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision Tree Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Examples are partitioned recursively based on selected
■ attributes
Test attributes are selected on the basis of a heuristic or statistical
■ measurefor(e.g.,
Conditions information
stopping gain)
partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
■ There are no samples left
Tree construction general algorithm

Two steps: recursively generate the tree

1-4), and prune the tree (5)

1. At each node, choose the “best”

attribute by a given measure for
attribute selection

2. Extend tree by adding new

branch for each value of the
attribute

3. Sorting training examples to leaf

nodes

4. If examples in a node belong to one

class Then Stop Else Repeat steps
1-4 for leaf nodes
5. Prune the tree to avoid over-fitting
Example : Training data for concept “play-‐
tennis”

• A typical dataset in machine learning

• 14 objects belonging to two class {Y,
N} are observed on 4 properties.
• Dom(Outlook) =
{sunny, overcast, rain}
• Dom(Temperature) =
{hot, mild, cool}
• Dom(humidity) =
{high, normal}
• Dom(Wind) =
{weak, strong}
A decision tree for playing tennis
A simple decision tree for playing
tennis

This tree is much simpler as “outlook” is

selected at the root. How to select good
attribute to split a decision node?
Which attribute is the best?
• The “playing-tennis” set S contains 9 positive objects (+)
and 5 negative objects (-), denote by [9+, 5-]

• If attributes “humidity” and “wind” split S into sub-nodes

with proportions of positive and negative objects as below,
which attribute is better?

A1 =
[9+, A2 =
humidity [9+,
5-] wind
5-]
normal high weak strong

[6+, [3+, [6+, [3+,

1-] 4-] 2-] 3-]
Entropy

• Entropy characterizes the impurity (purity) of an

arbitrary collection of objects .
 S is the collection of positive and negative objects
 � is the proportion of positive objects in S

𝑝
� is the proportion of negative objects in S

 In the play-tennis example, these numbers are 14, 9/14
and 5/14, respectively

• Entropy is defined as
follows
Entropy

The entropy function relative to

a Boolean classification, as the y

proportion 𝑝 of positive
op
rt
n
e

objects varies between 0 and 1.

If the collection has c distinct
groups of objects then the
entropy is defined by

�
�
Example : Training data for concept “play-‐
tennis”
From 14 examples of Play-Tennis, 9 positive and 5
negative objects (denote by [9+, 5-‐] )
Entropy( [9+, 5-‐] ) = − (9/14)log2(9/14) −
(5/14)log2(5/14)
= 0.940

Notice:
1. Entropy is 0 if all members of S belong to the same
class
2. Entropy is 1 if the collection contains an equal number
of positive and negative examples. If these numbers are
unequal, the entropy is between 0 and 1.
Information gain measures the
expected reduction in entropy
We define a measure, called information gain, of the
effectiveness of an attribute in classifying data. It is the expected
reduction in entropy caused by partitioning the objects according
to this attribute

where Value(A) is the set of all possible values for attribute A, and
Sv is the subset of S for which A has value v .
Information gain measures the
expected reduction in entropy

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is
[6+, 2-] Sstrong , the subnode with value
“strong”, is [3+, 3-]
𝐺𝑎𝑖𝑛 𝑆, 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) −
𝑊𝑖𝑛𝑑 𝑆𝑣)
�

∑ 𝑣∈{w𝑒𝑎𝑘,𝑠𝑡𝑟𝑜𝑛g}
𝑆𝑣
�

= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆1−
8 weak 6
− 1 𝑠𝑡𝑟 𝑜 )
𝑛g
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 4 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆
4
8 6
= 0.940 − 0.811 − 𝑥 1.0 =
0.048
14 14
Which attribute is the best classifier?

S:[9+, S:[9+,
5-] 5-]
E= E=
Humidity
0.940 Wind
0.940

High Weak Strong

Normal

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E= E= E= E = 1.00
0.985 0.592 0.811
Gain(S, Humidity) Gain(S, Wind)
= .940 - (7/14).985 - = .940 - (8/14).811 -
(7/14).592 (6/14)1.00
= .151 = .048
Information gain of all attributes

Gain (S, Outlook) = 0.246

Gain (S, Humidity) =

0.151

Gain (S, Wind) =

0.048 Gain (S,

Temperature) = 0.029
Next step in growing the decision tree
Attributes with many values

• If attribute has many values (e.g., days of the

month), ID3 will select it
• C4.5 uses GainRatio instead
Measures for attribute selection
Overfitting and Tree Pruning
■ Overfitting: An induced tree may overfit the training data
■ Too many branches, some may reflect anomalies due to noise or
outliers
■ Poor accuracy for unseen samples
■ Two approaches to avoid overfitting
■ Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
■
Difficult to choose an appropriate threshold
■ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
■
Use a set of data different from the training data to decide
which is the “best pruned tree”
Enhancements to Basic Decision Tree Induction

■ Allow for continuous-valued attributes

■ Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
■ Handle missing attribute values
■
Assign the most common value of the attribute
■
Assign probability to each of the possible values
■ Attribute construction
■
Create new attributes based on existing ones that are
sparsely represented
■ This reduces fragmentation, repetition, and
replication
Classification in Large Databases

■ Classification—a classical problem extensively studied by

statisticians and machine learning researchers
■ Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
■ Why decision tree induction in data mining?
■
relatively faster learning speed (than other classification
methods)
■ convertible to simple and easy to understand

classification rules
■ can use SQL queries for accessing databases

■ comparable classification accuracy with other methods

Scalable Decision Tree Induction Methods

■ SLIQ (EDBT’96 — Mehta et al.)

■ Builds an index for each attribute and only class list and

the current attribute list reside in memory

■ SPRINT (VLDB’96 — J. Shafer et al.)
■
Constructs an attribute list data
■ structure PUBLIC (VLDB’98 — Rastogi &
Shim)
■ Integrates tree splitting and tree pruning: stop growing

the tree earlier

■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
■
Builds an AVC-list (attribute, value, class label)
■ BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
■
Uses bootstrapping to create several small samples
Assignment 4 – 10/11/2024
Chapter 4 . Classification and
Prediction
■ What is classification? What is ■ Support Vector Machines

prediction? (SVM)

■ Issues regarding classification ■ Associative classification

and prediction ■ Lazy learners (or learning from

■ Classification by decision tree your neighbors)

induction ■ Other classification methods

■ Bayesian classification ■ Prediction

■ Rule-based classification ■ Accuracy and error measures

■ Classification by back ■ Ensemble methods

propagation ■ Model selection
■ Summary
Bayesian Classification: Why?
■ A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
■ Foundation: Based on Bayes’ Theorem.
■ Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
■ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
■ Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayesian Theorem: Basics

■ Let X be a data sample (“evidence”): class label is unknown

■ Let H be a hypothesis that X belongs to class C
■ Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
■ P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income,
■

■ … P(X): probability that sample data is observed

■ P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
■
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Bayesian Theorem

■ Given training data X, posteriori probability of a hypothesis

H, P(H|X), follows the Bayes theorem
P ( H | X )  P(X| H )P(H
)
P(X)
■ Informally, this can be written as posteriori =
likelihood x prior/evidence
■ Predicts X belongs to Ci if the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
■
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Towards Naïve Bayesian
Classifier
■ Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n- D attribute
vector X = (x1, x2, …, xn)
■ Suppose there are m classes C1, C2, …, Cm.
■
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
■ This can be derived from Bayes’ theorem

P(X|Ci)P(Ci)
P(Ci | X) 
P(X)
■ Since P(X) is constant for all classes, only

P(Ci | X)  P(X|
needs to be maximized
C )P(C )
Derivation of Naïve Bayes
Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( | )  P( | )  P( | ) ... P(
 x Ci x Ci x Ci x | Ci)
P(X | C i ) k 1 2 n
k

■ This greatly reduces 1the computation cost: Only counts
the class distribution
■ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of Ci
■ in D)
based
If Ak is on Gaussian distribution
continous-valued, P(xk|Ci)with a mean
is usually μ and
computed
standard deviation σ 1 
( x) 2

g(x,  ,  )  e 22

2
and P(xk|Ci) is
P (X | Ci)  g (xk ,  C ,i
Naïve Bayesian Classifier: Training Dataset

age income student redit_rating _com

c
<=30 high no fair no
Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes
Data sample >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
Income = medium, 31…40 low yes excellent yes
Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Naïve Bayesian Classifier: An Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

■
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (buy-computer = ‘yes’)

Avoiding the 0-Probability Problem
■ Naïve Bayesian prediction requires each conditional prob. be non-
zero. Otherwise, the predicted prob. will be zero
n
 P(x k |
P( X | C i ) 
Ci) k 1

■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

medium (990), and income = high (10),
■ Use Laplacian correction (or Laplacian estimator)
■ Adding 1 to each case

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
■ The “corrected” prob. estimates are close to their “uncorrected”

counterparts
Naïve Bayesian Classifier:
Comments
■ Advantages
■
Easy to implement
■
Good results obtained in most of the cases
■ Disadvantages
■
Assumption: class conditional independence, therefore
loss of accuracy
■ Practically, dependencies exist among variables

■
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
■
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
■ How to deal with these dependencies?
■
Bayesian Belief Networks
Naive Bayesian Classifier
Example
Outlook Temperature Humidity W indy
sunny hot high false
Class
N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N

To classify a new sample X:

outlook = sunny
temperature = cool
humidity = high
windy = false
Naive Bayesian Classifier
Example
Outlook Temperature Humidity Windy Class
overcast hot high false P
rain mild high false P
rain cool normal false P
overcast cool normal true P
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
9
overcast mild high true P
overcast hot normal false P

Outlook Temperature Humidity Windy Class

sunny hot high false N
sunny hot high true N
rain cool normal true N 5
sunny mild high false N
rain mild high true N
Naive Bayesian Classifier
Example
Given the training set, we compute the probabilities:
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5

We also have the probabilities

P = 9/14
N = 5/14
Naive Bayesian Classifier
Example
To classify a new sample X:
outlook = sunny
temperature = cool
humidity = high
windy = false

Prob(P|X) = Prob(P)Prob(sunny|P)Prob(cool|P)* Prob(high|P)*Prob(false|

P) = 9/14*2/9*3/9*3/9*6/9 = 0.01

Prob(N|X) = Prob(N)Prob(sunny|N)Prob(cool|N)* Prob(high|N)*Prob(false|

N) = 5/14*3/5*1/5*4/5*2/5 = 0.013

Therefore X takes class label N

Naive Bayesian Classifier
Example
Second example X = <rain, hot, high, false>

P(X|p)·P(p) =

P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

P(X|n)·P(n) =

P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

Sample X is classified in class N (don’t play)

Bayesian Belief Networks

■ Bayesian belief network allows a subset of the variables

conditionally independent
■ A graphical model of causal relationships
■
Represents dependency among the variables
■ Gives a specification of joint probability distribution

❑ Nodes: random variables

❑ Links: dependency
X Y ❑X and Y are the parents of Z, and Y
is the parent of P
Z ❑ No dependency between Z and P
P
❑ Has no loops or cycles
Bayesian Belief Network: An Example

Family The conditional probability table

Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S)
(~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for

each possible combination of its parents

PositiveXRay Dyspnea Derivation of the probability of a

particular combination of values of X,
from CPT:
n
Bayesian Belief Networks
P ( x1 ,..., xn )   P ( ix | i
i 1
P a re n t s ( Y ) )
Training Bayesian Networks
■ Several scenarios:
■
Given both the network structure and all variables
observable: learn only the CPTs
■
Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,
analogous to neural network learning
■
Network structure unknown, all variables observable:
search through the model space to reconstruct
network topology
■ Unknown structure, all hidden variables: No good

algorithms known for this purpose

■ Ref. D. Heckerman: Bayesian networks for data mining
Chapter 6. Classification and
Prediction
■ What is classification? What is ■ Support Vector Machines
prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
■ Summary
Using IF-THEN Rules for Classification
■ Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer =
yes
■ Rule antecedent/precondition vs. rule consequent

■ Assessment of a rule: coverage and

accuracy
■ n = # of tuples covered by R
covers

■ ncorect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set
*/ accuracy(R) = ncorect / ncovers
■
If more than one rule is triggered, need conflict
resolution
■ Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
■ Class-based ordering: decreasing order of prevalence or
misclassification cost per class
■ Rule-based ordering (decision list ): rules are organized into one long
priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision Tree
age?

<=30 31..40 >40

■ Rules are easier to understand than large trees student? credit rating?
yes
■ One rule is created for each path from the root to excellent fair
no yes
a leaf no yes
no yes
■ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
■ Rules are mutually exclusive and exhaustive
■ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = THEN buys_computer = yes
yes IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer =
yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Rule Extraction from the Training Data

■ Sequential covering algorithm: Extracts rules directly from training data

■ Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
■ Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
■
Steps:
■ Rules are learned one at a time

■ Each time a rule is learned, the tuples covered by the rules are
removed
■ The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
■ Comp. w. decision-tree induction: learning a set of rules simultaneously
How to Learn-One-Rule?
■ Star with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■ Picks the one that most improves the rule quality
■ Rule-Quality measures: consider both coverage and accuracy
■ Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
pos
FOIL _ Gain  pos'(log
2  log )
pos' 2 pos'neg' pos 
neg
It favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune(R)  pos  neg

Pos/neg are # of positive/negative tuples covered by R.

If FOIL_Prune is higher for the pruned version of R, prune R
Chapter 6. Classification and
Prediction
■ What is classification? What is ■ Support Vector Machines
prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
■ Summary
Classification: A Mathematical Mapping

■ Classification:
■predicts categorical class labels

■ E.g., Personal homepage classification

■ x = (x , x , x , …), y = +1 or –1
i 1 2 3 i

■ x1 : # of a word “homepage”
x2 : # of a word
■

■
“welcome” Mathematically
■ x  X = n, y  Y = {+1, –

1}
■ We want a function f: X
Linear Classification

Binary Classification
x problem
x x
x x The data above the red
line belongs to class ‘x’
x x o
x x The data below red line
o belongs to class ‘o’
x o o
oo o Examples: SVM,
o
o Perceptron, Probabilistic
o o o
Classifiers
o
Discriminative Classifiers
■ Advantages
■
prediction accuracy is generally high
■
As compared to Bayesian methods – in general
■
robust, works when training examples contain errors
■
fast evaluation of the learned target function
■
Bayesian networks are normally slow
■ Criticism
■
long training time
■ difficult to understand the learned function (weights)

■
Bayesian networks can be used easily for pattern discovery
■ not easy to incorporate domain knowledge
■
Easy in the form of priors on the data or distributions
Perceptron & Winnow
• Vector: x, w
x2 •Scalar: x, y, w

Input: {(x1, y1),

…}
Output: classification function f(x)
f(xi) > 0 for yi = +1

f(x)i) =>
f(x wxyi+=b-1= 0
< 0 for
or w1x1+w2x2+b = 0
•Perceptron: update
W additively
•Winnow: update
W multiplicatively
x1
Perceptron & Winnow
Classification by Backpropagation

■ Backpropagation: A neural network learning algorithm

■ Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
■ A neural network: A set of connected input/output units
where each connection has a weight associated with it
■ During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
■ Also referred to as connectionist learning due to the
connections between units
Neural Network as a Classifier
■ Weakness
■ Long training time
■ Require a number of parameters typically best determined
empirically, e.g., the network topology or ``structure."
■ Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network
■ Strength
■ High tolerance to noisy data
■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
■ Successful on a wide array of real-world data
■ Algorithms are inherently parallel
■ Techniques have recently been developed for the extraction of
rules from trained neural networks
A Neuron (= a perceptron)

x0 w0 - k
x1 w1
 f
output y
xn wn
For Example
n
Input weight weighted Activation y  sign( wi xi 
vector vector sum function k )
i0

x w
■ The n-dimensional input vector x is mapped into variable y by
means of the scalar product and a nonlinear function mapping
Neural Networks
What are they?
Based on early research aimed at representing the
way the human brain works
Neural networks are composed of many processing
units called neurons

Types (Supervised versus Unsupervised)

Training

65
Neural Networks are great, but..
Problem 1: The black box model!
Solution: 1. Do we really need to know?
Solution 2. Rule Extraction techniques

Problem 2: Long training times

Solution 1: Get a faster PC with lots of RAM
Solution 2: Use faster algorithms “For example:
Quickprop”

Problems 3: Back propagation

Solution: Evolutionary Neural Networks!

66
Neural Network Concepts

Neural networks (NN): a brain metaphor for information

processing

Neural computing
Artificial neural network (ANN)

Many uses for ANN for

pattern recognition, forecasting, prediction, and classification

Many application areas

finance, marketing, manufacturing, operations, information
systems, and so on
Biological Neural Networks

Dendrites
Synapse
Synapse

Axon

Dendrites Soma
Soma

Two interconnected brain cells (neurons)

Processing Information in ANN

Inputs Weights Outputs

x1
w1 Y1

x2 w2 Neuron (or PE) f (S )

. S  
n
X iW
Y
. Y2
. i 1
i

.
. Summation
Transfer
.
Function
wn Yn
xn

A single neuron (processing element – PE) with inputs and

outputs
Elements of ANN

Processing element (PE)

Network architecture
Hidden layers
Parallel processing
Network information processing
Inputs
Outputs
Connection weights
Summation function
Neural Network Architectures
Recurrent Neural Networks
A Supervised Learning Process

ANN
Model
Three-step process:
1. Compute temporary
Compute
output outputs
2. Compare outputs with
desired targets
3. Adjust the weights and
Is desired
Adjust
weights
No
output repeat the process
achieved?

Yes

Stop
learning
How a Network Learns

Example: single neuron that learns the inclusive OR

operation

Learning parameters:
 Learning rate

 Momentum
Backpropagation Learning

a(Zi – Yi)
x1 error
w1

x2 w2 Neuron (or PE) f (S )

. n
S   X iW i
Y  f (S ) Yi
. i 1

. Summation
Transfer
Function
wn
xn

Backpropagation of Error for a Single Neuron

Backpropagation Learning

The learning algorithm procedure:

1. Initialize weights with random values and set other
network parameters
2. Read in the inputs and the desired outputs
3. Compute the actual output (by working forward
through the layers)
4. Compute the error (difference between the actual
and desired output)
5. Change the weights by working backward through
the hidden layers
6. Repeat steps 2-5 until weights stabilize
Development Process of an ANN
Neural Network Architectures

Architecture of a neural network is driven by the task it

is intended to address Classification, regression,
clustering, general optimization, association, ….

Most popular architecture: Feedforward, multi-layered

perceptron with backpropagation learning algorithm
used for both classification and regression type
problems
Other Popular ANN Paradigms
Self Organizing Maps (SOM)
Applications of SOM
Customer segmentation
Bibliographic classification
Image-browsing systems
Medical diagnosis
Interpretation of seismic activity
Speech recognition
Data compression
Environmental modeling, many more …
Application of ANN

Forecasting/Market Prediction: finance and banking

Manufacturing: quality control, fault diagnosis

Medicine: analysis of electrocardiogram data, RNA &

DNA sequencing, drug development without animal
testing

Control: process, robotics

Chapter 6. Classification and
Prediction
■ What is classification? What is ■ Support Vector Machines
prediction? (SVM)
■ Issues regarding classification ■ Associative classification
and prediction ■ Lazy learners (or learning from
■ Classification by decision tree your neighbors)
induction ■ Other classification methods
■ Bayesian classification ■ Prediction
■ Rule-based classification ■ Accuracy and error measures
■ Classification by back ■ Ensemble methods
propagation ■ Model selection
■
SVM—Support Vector Machines
■ A new classification method for both linear and nonlinear
data
■ It uses a nonlinear mapping to transform the original
training data into a higher dimension
■ With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
■ With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
■ SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
SVM—History and Applications
■ Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
■ Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
■ Used both for classification and prediction
■ Applications:
■
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
SVM—General Philosophy

Small Margin Large Margin

Support Vectors
SVM—Margins and Support
Vectors
SVM—When Data Is Linearly
Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi

There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Separable
■
A separating hyperplane can be written as W
●X+b=0
where W={w1, w2, …, wn} is a weight vector
■
and b a scalar (bias)
For 2-D it can be written as w0
■
+ w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2:
■
w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1

■
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic Programming
► Lagrangian multipliers
Why Is SVM Effective on High Dimensional
Data?
■
The complexity of trained classifier is characterized by the # of support
vectors rather than the dimensionality of the data
■
The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
■
If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
■
The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
■ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
A
2

SVM—Linearly Inseparable
A

Transform the original input data into a higher dimensional

1
■

space

■ Search for a linear separating hyperplane in the new space

SVM—Kernel functions
■ Instead of computing the dot product on the transformed data tuples, it
is mathematically equivalent to instead applying a kernel function K(Xi,
Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
■
Typical Kernel Functions

■ SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
Scaling SVM by Hierarchical Micro-Clustering

■ SVM is not scalable to the number of data objects in terms of training

time and memory usage
■ “Classifying Large Datasets Using SVMs with Hierarchical Clusters
Problem” by Hwanjo Yu, Jiong Yang, Jiawei Han, KDD’03
■ CB-SVM (Clustering-Based SVM)
■ Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
■ Use micro-clustering to effectively reduce the number of points to
be considered
■ At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy
CB-SVM: Clustering-Based SVM
■ Training data sets may not even fit in memory
■ Read the data set once (minimizing disk access)
■ Construct a statistical summary of the data (i.e., hierarchical
clusters) given a limited amount of memory
■ The statistical summary maximizes the benefit of learning
SVM
■ The summary plays a role in indexing SVMs
■ Essence of Micro-clustering (Hierarchical indexing structure)
■ Use micro-cluster hierarchical indexing structure
■
provide finer samples closer to the boundary and coarser
samples farther from the boundary
■ Selective de-clustering to ensure high accuracy
CF-Tree: Hierarchical Micro-cluster
CB-SVM Algorithm: Outline
■ Construct two CF-trees from positive and negative data
sets independently
■
Need one scan of the data set
■ Train an SVM from the centroids of the root entries
■ De-cluster the entries near the boundary into the next
level
■ The children entries de-clustered from the parent

entries are accumulated into the training set with the

non-declustered parent entries
■ Train an SVM again from the centroids of the entries in the
training set
■ Repeat until nothing is accumulated
Selective Declustering
■ CF tree is a suitable base structure for selective declustering
■ De-cluster only the cluster Ei such that
■ Di – Ri < Ds, where Di is the distance from the boundary to
the center point of Ei and Ri is the radius of Ei
■
Decluster only the cluster whose subclusters have
possibilities to be the support cluster of the boundary
■
“Support cluster”: The cluster whose centroid is a
support vector
Experiment on Synthetic Dataset
Experiment on a Large Data Set
SVM vs. Neural Network

■ SVM ■ Neural Network

■
Relatively new concept ■
Relatively old
■ Nondeterministic
■ Deterministic algorithm
algorithm
■ Nice Generalization ■ Generalizes well but

properties doesn’t have strong

■ Hard to learn – learned mathematical foundation
■ Can easily be learned in
in batch mode using
incremental fashion
quadratic programming
■ To learn complex
techniques
functions—use multilayer
■ Using kernels can learn perceptron (not that
very complex functions trivial)
What Is Prediction?
■ (Numerical) prediction is similar to classification
■ construct a model

■ use model to predict continuous or ordered value for a given input

■ Prediction is different from classification

■ Classification refers to predict categorical class label

■ Prediction models continuous-valued functions

■ Major method for prediction: regression

■
model the relationship between one or more independent or
predictor variables and a dependent or response variable
■ Regression analysis
■ Linear and multiple regression

■ Non-linear regression

■ Other regression methods: generalized linear model, Poisson

regression, log-linear models, regression trees

Linear Regression
■ Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression
■
coefficients Method|D|of least squares: estimates the best-fitting
straight line w yw
w i1
0 1
1 |D|

x
 
(x  i x)( y
i1(x  x)i
2
i  y)

■ Multiple linear regression: involves more than one predictor variable

■ Training data is of the form (X1, y1), (X2, y2),…, (X|D| , y|
■
D| ) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2
■
x2
Plus .Many nonlinear functions can be transformed into the
Solvable by extension of least square method or
Nonlinear Regression
■ Some nonlinear models can be modeled by a polynomial
function
■ A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3=
x3 y = w0 + w1 x + w2 x2 + w3 x3
■
Other functions, such as power function, can also be
transformed to linear model
■
Some models are intractable nonlinear (e.g., sum of
exponential terms)
■
possible to obtain least square estimates
through....
Other Regression-Based Models
■ Generalized linear model:
■ Foundation on which linear regression can be applied to modeling
categorical response variables
■ Variance of y is a function of the mean value of y, not a constant
■ Logistic regression: models the prob. of some event occurring as a
linear function of a set of predictor variables
■ Poisson regression: models the data that exhibit a Poisson
distribution
■ Log-linear models: (for categorical data)
■ Approximate discrete multidimensional prob. distributions
■ Also useful for data compression and smoothing
■ Regression trees and model trees
■ Trees to predict continuous values rather than class labels
Regression Trees and Model
Trees
■ Regression tree: proposed in CART system (Breiman et al. 1984)
■ CART: Classification And Regression Trees
■ Each leaf stores a continuous-valued
■ prediction
It is the average value of the predicted attribute for the
■ training
Model tree: tuples that by
proposed reach the leaf
Quinlan (1992)
■ Each leaf holds a regression model—a multivariate linear equation
for the predicted attribute
■ A more general case than regression tree
■ Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple linear
model
Predictive Modeling in Multidimensional
Databases
■ Predictive modeling: Predict data values or construct
generalized linear models based on the database data
■ One can only predict value ranges or category distributions
■ Method outline:
■ Minimal generalization

■ Attribute relevance analysis

■ Generalized linear model construction

■ Prediction

■ Determine the major factors which influence the prediction

■ Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgement, etc.

■ Multi-level prediction: drill-down and roll-up analysis

Chapter 4
No ratings yet
Chapter 4
103 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Lesson 5 - Supervised Learning-Classification
100% (1)
Lesson 5 - Supervised Learning-Classification
91 pages
A-6554-Article Text-23954-1-4-20250502_R2
No ratings yet
A-6554-Article Text-23954-1-4-20250502_R2
10 pages
Lecture 6 Classification-Decision Tree Rule Based K-NN
No ratings yet
Lecture 6 Classification-Decision Tree Rule Based K-NN
73 pages
DM UNIT III (1)
No ratings yet
DM UNIT III (1)
87 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Lecture 023+-+Decision+Trees+ - 1
No ratings yet
Lecture 023+-+Decision+Trees+ - 1
54 pages
L5 - Decision Tree - B
No ratings yet
L5 - Decision Tree - B
51 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
DS_w12_DT
No ratings yet
DS_w12_DT
61 pages
DWDM UNIT 4
No ratings yet
DWDM UNIT 4
80 pages
Module 3
No ratings yet
Module 3
132 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
Chapter 02_DM tasks_Part I_Classification
No ratings yet
Chapter 02_DM tasks_Part I_Classification
58 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Chapter 3 Decision Trees
No ratings yet
Chapter 3 Decision Trees
61 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
CH 5
No ratings yet
CH 5
84 pages
Classification
No ratings yet
Classification
73 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Class Basic
No ratings yet
Class Basic
75 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
04 Classification
No ratings yet
04 Classification
72 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
7. Decision Tree & Random Forest
No ratings yet
7. Decision Tree & Random Forest
41 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
SDG Sdgs DF
No ratings yet
SDG Sdgs DF
23 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
DMDW-CO3-SESSION-14
No ratings yet
DMDW-CO3-SESSION-14
55 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
L3 - Decision Trees
No ratings yet
L3 - Decision Trees
28 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Class 16 Decision Tree
No ratings yet
Class 16 Decision Tree
45 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
Unit 6 Finalized
No ratings yet
Unit 6 Finalized
30 pages
Classification
No ratings yet
Classification
33 pages
DM_06-Mar-2025
No ratings yet
DM_06-Mar-2025
13 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
ML-Lec5
No ratings yet
ML-Lec5
7 pages
unit 5
No ratings yet
unit 5
25 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
4 Classification
No ratings yet
4 Classification
20 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
ML Unit 1
No ratings yet
ML Unit 1
22 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
77 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Artificial-intelligence-and-machine-learning-models-_2025_Sustainable-Energy
No ratings yet
Artificial-intelligence-and-machine-learning-models-_2025_Sustainable-Energy
20 pages
Beige and Black Colorful Illustration Stock Market Presentation (1)
No ratings yet
Beige and Black Colorful Illustration Stock Market Presentation (1)
19 pages
PFE Book - Mass Analytics - 2022
No ratings yet
PFE Book - Mass Analytics - 2022
22 pages
Chapter 5 2018 2019
No ratings yet
Chapter 5 2018 2019
5 pages
T-4 Vision Transformers, Ensemble Model, and Transfer Learning Leveraging Explainable AI for Brain Tu (1)
No ratings yet
T-4 Vision Transformers, Ensemble Model, and Transfer Learning Leveraging Explainable AI for Brain Tu (1)
11 pages
House_Price_Prediction
No ratings yet
House_Price_Prediction
25 pages
Ensemble Learning in Machine Learning
No ratings yet
Ensemble Learning in Machine Learning
39 pages
Da Unit-4
No ratings yet
Da Unit-4
43 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
Ml_unit_1
No ratings yet
Ml_unit_1
29 pages
ML_Group_9 (1)
No ratings yet
ML_Group_9 (1)
7 pages
A Context Aware Unsupervised Predictive Maintenance
No ratings yet
A Context Aware Unsupervised Predictive Maintenance
27 pages
Brain Tumor MRI Detection
No ratings yet
Brain Tumor MRI Detection
39 pages
DMBI Sem 6 Important Topics (IT)
No ratings yet
DMBI Sem 6 Important Topics (IT)
20 pages
Interview Questions ML
100% (1)
Interview Questions ML
83 pages
Data4800 Report Ai
No ratings yet
Data4800 Report Ai
8 pages
1.Bais varience trade-off
No ratings yet
1.Bais varience trade-off
5 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
Machine Learning For Demand Forecasting in Manufacturing
No ratings yet
Machine Learning For Demand Forecasting in Manufacturing
12 pages
Android Skin Cancer Detection and Classification B
No ratings yet
Android Skin Cancer Detection and Classification B
14 pages
Are Transformers Effective For Time Series Forecasting?
No ratings yet
Are Transformers Effective For Time Series Forecasting?
15 pages
CUSTOMER SEGMENTATION ANALYSIS OF DATA AND MACHINE LEARNING APPROACH With Plugorism
No ratings yet
CUSTOMER SEGMENTATION ANALYSIS OF DATA AND MACHINE LEARNING APPROACH With Plugorism
6 pages
Kernel Matrix Factorization Models
No ratings yet
Kernel Matrix Factorization Models
8 pages
1 5 Bias Variance Trade Off
No ratings yet
1 5 Bias Variance Trade Off
34 pages
Kenny-230718-Top 60+ Data Analyst Interview Questions and Answers For 2023
No ratings yet
Kenny-230718-Top 60+ Data Analyst Interview Questions and Answers For 2023
39 pages
INTERVIEW QUESTIONS ML
No ratings yet
INTERVIEW QUESTIONS ML
4 pages
365 ML Infographic
No ratings yet
365 ML Infographic
1 page
Algorithmic Trading Using Intelligent Agents
No ratings yet
Algorithmic Trading Using Intelligent Agents
7 pages
Measurement - Task Sheets Gr. 3-5
From Everand
Measurement - Task Sheets Gr. 3-5
Chris Forest
No ratings yet
Data Analysis & Probability - Drill Sheets Gr. 6-8
From Everand
Data Analysis & Probability - Drill Sheets Gr. 6-8
Chris Forest
No ratings yet
Number & Operations - Task Sheets Gr. 3-5
From Everand
Number & Operations - Task Sheets Gr. 3-5
Nat Reed
No ratings yet

Chapter 4 (2)

Uploaded by

Chapter 4 (2)

Uploaded by

D a t a Science in Business

Chapter 4 — Classification and

Dr. LE SONG THANH QUYNH

■ What is classification? What is ■ Support Vector Machines

training set and the values (class labels) in a

as determined by the class label attribute

tuples whose class labels are not known

NAME RANK YEARS TENURED Classifier

■ Supervised learning (classification)

■ attributes Data transformation

<=30 31....40 >40

student? yes credit rating?

no yes excellent fair

Two steps: recursively generate the tree

1. At each node, choose the “best”

2. Extend tree by adding new

3. Sorting training examples to leaf

4. If examples in a node belong to one

• A typical dataset in machine learning

This tree is much simpler as “outlook” is

• If attributes “humidity” and “wind” split S into sub-nodes

[6+, [3+, [6+, [3+,

• Entropy characterizes the impurity (purity) of an

The entropy function relative to

objects varies between 0 and 1.

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

High Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

Gain (S, Outlook) = 0.246

Gain (S, Humidity) =

Gain (S, Wind) =

0.048 Gain (S,

• If attribute has many values (e.g., days of the

■ Allow for continuous-valued attributes

■ Classification—a classical problem extensively studied by

■ comparable classification accuracy with other methods

■ SLIQ (EDBT’96 — Mehta et al.)

the current attribute list reside in memory

the tree earlier

■ Issues regarding classification ■ Associative classification

and prediction ■ Lazy learners (or learning from

■ Classification by decision tree your neighbors)

induction ■ Other classification methods

■ Bayesian classification ■ Prediction

■ Rule-based classification ■ Accuracy and error measures

■ Classification by back ■ Ensemble methods

■ Let X be a data sample (“evidence”): class label is unknown

■ … P(X): probability that sample data is observed

■ Given training data X, posteriori probability of a hypothesis

age income student redit_rating _com

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

Therefore, X belongs to class (buy-computer = ‘yes’)

■ Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

Prob(income = low) = 1/1003

To classify a new sample X:

Outlook Temperature Humidity Windy Class

We also have the probabilities

Prob(P|X) = Prob(P)*Prob(sunny|P)*Prob(cool|P)* Prob(high|P)*Prob(false|

Prob(N|X) = Prob(N)*Prob(sunny|N)*Prob(cool|N)* Prob(high|N)*Prob(false|

Therefore X takes class label N

P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

Sample X is classified in class N (don’t play)

■ Bayesian belief network allows a subset of the variables

❑ Nodes: random variables

Family The conditional probability table

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPT shows the conditional probability for

PositiveXRay Dyspnea Derivation of the probability of a

algorithms known for this purpose

■ Assessment of a rule: coverage and

■ ncorect = # of tuples correctly classified by R

<=30 31..40 >40

■ Sequential covering algorithm: Extracts rules directly from training data

Pos/neg are # of positive/negative tuples covered by R.

Prob(P|X) = Prob(P)Prob(sunny|P)Prob(cool|P)* Prob(high|P)*Prob(false|

Prob(N|X) = Prob(N)Prob(sunny|N)Prob(cool|N)* Prob(high|N)*Prob(false|