0% found this document useful (0 votes)
2 views

Slide 10 Chapter9 Classification Advanced Methods

Chapter 9 discusses advanced classification methods including Bayesian Belief Networks, Backpropagation, Support Vector Machines (SVM), and Lazy Learners. It covers the structure and training of Bayesian networks, the mechanics of neural networks, and the principles behind SVMs for both linear and nonlinear data. Additionally, it introduces ensemble learning techniques like Bagging and Random Forests to improve classification accuracy.

Uploaded by

a19910207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Slide 10 Chapter9 Classification Advanced Methods

Chapter 9 discusses advanced classification methods including Bayesian Belief Networks, Backpropagation, Support Vector Machines (SVM), and Lazy Learners. It covers the structure and training of Bayesian networks, the mechanics of neural networks, and the principles behind SVMs for both linear and nonlinear data. Additionally, it introduces ensemble learning techniques like Bagging and Random Forests to improve classification accuracy.

Uploaded by

a19910207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Chapter 9.

Classification:
Advanced Methods
HUI-YIN CHANG (張彙音)

1
Bayesian Belief Networks
Bayesian belief networks (also known as Bayesian networks,
probabilistic networks): allow class conditional independencies
between subsets of variables

A (directed acyclic) graphical model of causal relationships


◦ Represents dependency among the variables
◦ Gives a specification of joint probability distribution
❑ Nodes: random variables
❑ Links: dependency
X Y
❑ X and Y are the parents of Z, and Y is
the parent of P
Z
P ❑ No dependency between Z and P
❑ Has no loops/cycles
2
Bayesian Belief Network: An Example
Family CPT: Conditional Probability Table
Smoker (S)
History (FH) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema
~LC 0.2 0.5 0.3 0.9
(LC) (肺氣腫)

shows the conditional probability for


each possible combination of its parents

Dyspnea Derivation of the probability of a


PositiveXRay
(呼吸困難) particular combination of values of X,
from CPT:
Bayesian Belief Network n
P ( x1 ,..., xn ) =  P ( x i | Parents (Y i ))
i =1
3
Training Bayesian Networks: Several Scenarios
Scenario 1: Given both the network structure and all variables observable:
compute only the CPT entries
Scenario 2: Network structure known, some variables hidden: gradient descent
(greedy hill-climbing) method, i.e., search for a solution along the steepest
descent of a criterion function
◦ Weights are initialized to random probability values
◦ At each iteration, it moves towards what appears to be the best solution at the
moment, w.o. backtracking
◦ Weights are updated at each iteration & converge to local optimum
Scenario 3: Network structure unknown, all variables observable: search through
the model space to reconstruct network topology
Scenario 4: Unknown structure, all hidden variables: No good algorithms known
for this purpose
D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in
Graphical Models, M. Jordan, ed.. MIT Press, 1999.

4
Chapter 9. Classification: Advanced Methods

Bayesian Belief Networks

Classification by Backpropagation

Support Vector Machines

Classification by Using Frequent Patterns

Lazy Learners (or Learning from Your Neighbors)

Other Classification Methods

Additional Topics Regarding Classification

Summary

5
A Multi-Layer Feed-Forward Neural Network
Output vector
w(jk +1) = w(jk ) +  ( yi − yˆ i( k ) ) xij
Output layer

Hidden layer

wij

Input layer

Input vector: X

6
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop and test
computational analogues of neurons
A neural network: A set of connected input/output units where
each connection has a weight associated with it
During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of the
input tuples
Also referred to as connectionist learning due to the connections
between units
7
Neural Network as a Classifier
Weakness
◦ Long training time
◦ Require a number of parameters typically best determined empirically, e.g.,
the network topology or “structure.”
◦ Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of “hidden units” in the network

Strength
◦ High tolerance to noisy data
◦ Ability to classify untrained patterns
◦ Well-suited for continuous-valued inputs and outputs
◦ Successful on an array of real-world data, e.g., hand-written letters
◦ Algorithms are inherently parallel
◦ Techniques have recently been developed for the extraction of rules from
trained neural networks
8
Classification: A Mathematical Mapping
Classification: predicts categorical class labels
◦ E.g., Personal homepage classification
◦ xi = (x1, x2, x3, …), yi = +1 or –1
◦ x1 : # of word “homepage”
◦ x2 : # of word “welcome”

Mathematically, x  X = n, y  Y = {+1, –1}, x


x x
◦ We want to derive a function f: X → Y x x
x
Linear Classification x x x o
o
x o o
◦ Binary Classification problem ooo
◦ Data above the red line belongs to class ‘x’ o o
o o o o
◦ Data below red line belongs to class ‘o’
◦ Examples: SVM, Perceptron, Probabilistic Classifiers

9
Discriminative Classifiers
Advantages
◦ Prediction accuracy is generally high
◦ As compared to Bayesian methods – in general

◦ Robust, works when training examples contain errors


◦ Fast evaluation of the learned target function
◦ Bayesian networks are normally slow

Criticism
◦ Long training time
◦ Difficult to understand the learned function (weights)
◦ Bayesian networks can be used easily for pattern discovery

◦ Not easy to incorporate domain knowledge


◦ Easy in the form of priors on the data or distributions

10
SVM—Support Vector Machines
A relatively new classification method for both linear and nonlinear
data
It uses a nonlinear mapping to transform the original training data
into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)

11
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

12
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s

Features: training can be slow but accuracy is high owing to their


ability to model complex nonlinear decision boundaries (margin
maximization)

Used for: classification and numeric prediction

Applications:
◦ handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests

13
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)

14
SVM—Linearly Separable
◼ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
◼ This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints →
Quadratic Programming (QP) → Lagrangian multipliers
15
Why Is SVM Effective on High Dimensional Data?

◼ The complexity of trained classifier is characterized by the # of


support vectors rather than the dimensionality of the data
◼ The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
◼ If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
◼ The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
◼ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high

16
SVM: Different Kernel functions
◼ Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
◼ Typical Kernel Functions

◼ SVM can also be used for classifying multiple (> 2) classes


and for regression analysis (with additional parameters)

17
SVM Related Links
SVM Website: http://www.kernel-machines.org/

Representative implementations
◦ LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
◦ SVM-light: simpler but performance is not better than LIBSVM,
support only binary classification and only in C
◦ SVM-torch: another recent implementation also written in C

18
Lazy vs. Eager Learning
Lazy vs. eager learning
◦ Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
◦ Eager learning (the above discussed methods): Given a set of
training tuples, constructs a classification model before
receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
◦ Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form an implicit global
approximation to the target function
◦ Eager: must commit to a single hypothesis that covers the entire
instance space

19
Lazy Learner: Instance-Based Methods

Instance-based learning:
◦ Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
Typical approaches
◦ k-nearest neighbor approach
◦ Instances represented as points in a Euclidean space.

◦ Locally weighted regression


◦ Constructs local approximation

◦ Case-based reasoning
◦ Uses symbolic representations and knowledge-based inference

20
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples

_
_
_ _
.
+
_
. +
xq +
. . .
_ + . 21
Discussion on the k-NN Algorithm
k-NN for real-valued prediction for a given unknown tuple
◦ Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
◦ Weight the contribution of each of the k neighbors according to
their distance to the query xq
w 1
◦ Give greater weight to closer neighbors
d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
◦ To overcome it, axes stretch or elimination of the least relevant
attributes
22
Genetic Algorithms (GA)
Genetic Algorithm: based on an analogy to biological evolution
An initial population is created consisting of randomly generated rules
◦ Each rule is represented by a string of bits
◦ E.g., if A1 and ¬A2 then C2 can be encoded as 100
◦ If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population is formed to consist
of the fittest rules and their offspring
The fitness of a rule is represented by its classification accuracy on a set of training
examples
Offspring are generated by crossover and mutation
The process continues until a population P evolves when each rule in P satisfies a
prespecified threshold
Slow but easily parallelizable

23
Ensemble Learning (集成學習)
集成學習(Ensemble Learning)
◦ 將多個模型整合起來,最後獲得比單一個模型表現更好的方法,例如在某筆資料集中可以使用KNN、
線性迴歸模型(Linear Regression)、決策樹(Decision Tree)等方法來完成預測,而集成學習就是想將
這些可以運用方法的結果整合以獲得比單一模型更好的結果,常見的方法例如Bagging、Boosting、
AdaBoost與隨機森林(Random Forest)等。
Ensemble Learning (集成學習)
Bagging
◦ Bagging概念很簡單,從訓練資料中隨機抽取(取出後放回,n<N)樣本訓練多個分類器(要多少個分類
器自己設定),每個分類器的權重一致最後用投票方式(Majority vote)得到最終結果,而這種抽樣的方
法在統計上稱為bootstrap。

◦ Bagging的優點在於原始訓練樣本中有噪聲資料(不好的資料),透過Bagging抽樣就有機會不讓有噪聲
資料被訓練到,所以可以降低模型的不穩定性
◦ Example: Random Forest (隨機森林)
Random forest (隨機森林)
隨機森林其實就是進階版的決策樹,所謂的森林就是由很多棵決策樹所組成。隨機森林是使
用 Bagging 加上隨機特徵採樣的方法所產生出來的整體學習演算法。
Random forest - steps
Step 1: 從訓練集中抽取 n’ 筆資料出來
Step 2: n’ 筆資料隨機挑選 k 個特徵做樣本
Step 3:重複 m 次,產生 m 棵決策樹
Step 4:分類: 多數投票機制進行預測、迴歸: 平均機制進行預測

隨機森林中的隨機有兩種方面可以解釋。首先第一個是隨機取樣,在模型訓練的過程中每棵
樹的生成都會先從訓練集中隨機抽取 n’ 筆資料出來,而這 n’ 筆資料是可以被重複抽取的。此
抽取資料的方式又稱為 Bootstrap,它是一種在統計學上常用的資料估計方法。第二個解釋隨
機的理由是在隨機森林中每一棵樹都是隨機的特徵選取。每一棵樹都是從 n’ 筆資料中隨機挑
選 k 個特徵做樣本。
Random forest - advantages
每棵樹會用到哪些訓練資料及特徵都是由隨機決定
採用多個決策樹的投票機制來改善決策樹
與決策樹相比,不容易過度擬合
隨機森林每一棵樹都是獨立的
訓練或是預測的階段每一棵樹都能平行化的運行
Ensemble Learning (集成學習)
Boosting
◦ Boosting算法是將很多個弱的分類器(weak classifier)進行合成變成一個強分類器(Strong
classifier),和Bagging不同的是分類器之間是有關聯性的,是透過將舊分類器的錯誤資料
權重提高,然後再訓練新的分類器,這樣新的分類器就會學習到錯誤分類資料
(misclassified data)的特性,進而提升分類結果。

◦ 對於Boosting來說,有兩個關鍵,一是在如何改變訓練資料的權重;二是如何將多個弱分
類器組合成一個強分類器。而且存在一個重大的缺陷:該分類算法要求預先知道弱分類器
識別準確率的下限。
◦ Example: Adaboost, Xgboost
Adaboost

Adaboost: Adaptive Boosting (自適應增強)


由Yoav Freund和Robert Schapire在1995年提出
自適應: 前一個基本分類器分錯的樣本會得到加強,加權後的全體樣本再次被用來訓練下一個
基本分類器。同時,在每一輪中加入一個新的弱分類器,直到達到某個預定的足夠小的錯誤率
或達到預先指定的最大疊代次數。
AdaBoost 把多個不同的決策樹用一種非隨機的方式組合起來,表現出驚人的性能!
Adaboost 特性
決策樹準確率大幅度提高,可與 SVM 媲美
速度快,且基本不用調參數
幾乎不 Overfitting
Steps of Adaboost algorithm
初始化訓練實例的權重均勻,因此每個實例最初具有相等的重要性。
對於給定的迭代次數(或直到達到停止條件):
◦ 在加權訓練數據上訓練一個弱分類器,其中權重強調先前迭代中被錯誤分類的實例。
◦ 計算弱分類器的錯誤率,該錯誤率指示其在訓練數據上的表現如何。
◦ 根據弱分類器的錯誤率計算其權重,將更準確的分類器賦予更高的權重。
◦ 調整實例權重,增加被錯誤分類的實例的權重,減少被正確分類的實例的權重。

通過為每個弱分類器分配權重(與其準確性成比例)將它們組合在一起。
最終分類是通過弱分類器的加權組合決定的,其中每個分類器的權重取決於其準確性。
An example of Adaboost
Given the 10 data points (each has a label, Y) and three weak
models, how do we build a general model for classification?

Step 1:
An example of Adaboost
Step 2:
An example of Adaboost
Step 2:

在權值分布D1的情况下,取已知的三个弱分类器h1、h2和
h3中误差率最小的分类器作为第1个基本分类器H1(x)(三个
弱分类器的误差率都是0.3,那就取第1个吧)

在分类器H1(x)=h1情况下,样本点“5 7 8”被错分,因此基本
分类器H1(x)的误差率为:

被误分类样本的权值之和影响误差率e,误差率e影响基本
分类器在最终分类器中所占的权重α。
An example of Adaboost
Step 2:
更新训练样本数据的权值分布,用于下一轮迭代,对于正
确分类的训练样本“1 2 3 4 6 9 10”(共7个)的权值更新为:

第1轮迭代后,最后得到各个样本数据新的权值分布:
D2=[1/14,1/14,1/14,1/14,1/6,1/14,1/6,1/6,1/14,1/14]
An example of Adaboost
Step 2:
由于样本数据“5 7 8”被H1(x)分错了,所以它们的权值由之
前的0.1增大到1/6;反之,其它数据皆被分正确,所以它们
的权值皆由之前的0.1减小到1/14,下表给出了权值分布的
变换情况:

可得分类函数:f1(x)= α1H1(x) = 0.4236H1(x)。此时,组合一


个基本分类器sign(f1(x))作为强分类器在训练数据集上有3个
误分类点(即5 7 8),此时强分类器的训练错误为:0.3
An example of Adaboost
Step 3:
第二次迭代t=2:
在权值分布D2的情况下,再取三个弱分类器h1、h2和h3中误差率
最小的分类器作为第2个基本分类器H2(x):
① 当取弱分类器h1=X1=2.5时,此时被错分的样本点为“5 7 8”:
误差率e=1/6+1/6+1/6=3/6=1/2;
② 当取弱分类器h2=X1=8.5时,此时被错分的样本点为“3 4 6”:
误差率e=1/14+1/14+1/14=3/14;
③ 当取弱分类器h3=X2=6.5时,此时被错分的样本点为“1 2 9”:
误差率e=1/14+1/14+1/14=3/14;

取当前最小的分类器h2作为第2个基本分类器H2(x)
An example of Adaboost
Step 3:
H2(x)把样本“3 4 6”分错了,根据D2可知它们的权值为
D2(3)=1/14,D2(4)=1/14, D2(6)=1/14,所以H2(x)在训练数
据集上的误差率:

第2轮迭代后,最后得到各个样本数据新的权值分布:
D3=[1/22,1/22,1/6,1/6,7/66,1/6,7/66,7/66,1/22,1/22]
An example of Adaboost
Step 3:

分类函数:f2(x)=0.4236H1(x) + 0.6496H2(x)。此时,组合两
个基本分类器sign(f2(x))作为强分类器在训练数据集上有3个
误分类点(即3 4 6),此时强分类器的训练错误为:0.3
An example of Adaboost
Step 4:
第三次迭代t=3:
在权值分布D3的情况下,再取三个弱分类器h1、h2和h3中误差率
最小的分类器作为第3个基本分类器H3(x):
① 当取弱分类器h1=X1=2.5时,此时被错分的样本点为“5 7 8”:
误差率e=7/66+7/66+7/66=7/22;
② 当取弱分类器h2=X1=8.5时,此时被错分的样本点为“3 4 6”:
误差率e=1/6+1/6+1/6=1/2=0.5;
③ 当取弱分类器h3=X2=6.5时,此时被错分的样本点为“1 2 9”:
误差率e=1/22+1/22+1/22=3/22;

取当前最小的分类器h3作为第3个基本分类器H3(x):
An example of Adaboost
Step 4:

第3轮迭代后,得到各个样本数据新的权值分布为:
D4=[1/6,1/6,11/114,11/114,7/114,11/114,7/114,7/114,1/6,1/38]
An example of Adaboost
Step 4:

分类函数:f3(x)=0.4236H1(x) + 0.6496H2(x)+0.9229H3(x)。此
时,组合三个基本分类器sign(f3(x))作为强分类器,在训练
数据集上有0个误分类点。至此,整个训练过程结束。
整合所有分类器,可得最终的强分类器为:

这个强分类器Hfinal对训练样本的错误率为0!
Pros and Cons of Adaboost
Probs:
◦ Adaboost提供一种框架,在框架内可以使用各种方法构建子分类器。可以使用简单的弱分类器,不用
对特征进行筛选,也不存在过拟合的现象。
◦ Adaboost算法不需要弱分类器的先验知识,最后得到的强分类器的分类精度依赖于所有弱分类器。无
论是应用于人造数据还是真实数据,Adaboost都能显著的提高学习精度。
◦ Adaboost算法不需要预先知道弱分类器的错误率上限,且最后得到的强分类器的分类精度依赖于所有
弱分类器的分类精度,可以深挖分类器的能力。Adaboost可以根据弱分类器的反馈,自适应地调整假
定的错误率,执行的效率高。
◦ Adaboost对同一个训练样本集训练不同的弱分类器,按照一定的方法把这些弱分类器集合起来,构造
一个分类能力很强的强分类器,即“三个臭皮匠赛过一个诸葛亮”。

Cons:
◦ 在Adaboost训练过程中,Adaboost会使得难于分类样本的权值呈指数增长,训练将会过于偏向这类
困难的样本,导致Adaboost算法易受噪声干扰。此外,Adaboost依赖于弱分类器,而弱分类器的训
练时间往往很长。

Adaboost算法的某些特性是非常好的,这里主要介绍Adaboost的两个特性。
◦ 训练的错误率上界,随着迭代次数的增加,会逐渐下降
◦ Adaboost算法即使训练次数很多,也不会出现过拟合的问题。
Bagging vs. Boosting
訓練樣本:
◦ Bagging: 每一次的訓練集是隨機抽取(每個樣本權重一致),抽出可放回,以獨立同分布選取的訓練樣
本子集訓練弱分類器。
◦ Boosting: 每一次的訓練集不變,訓練集之間的選擇不是獨立的,每一是選擇的訓練集都是依賴上一
次學習得結果,根據錯誤率(給予訓練樣本不同的權重)取樣。

分類器:
◦ Bagging: 每個分類器的權重相等。
◦ Boosting: 每個弱分類器都有相應的權重,對於分類誤差小的分類器會有更大的權重。

每個分類器的取得:
◦ Bagging: 每個分類器可以並行生成。
◦ Boosting: 每個弱分類器只能依賴上一次的分類器順序生成。
Thanks for Your Attention
Q&A

46

You might also like