Slide 10 Chapter9 Classification Advanced Methods
Slide 10 Chapter9 Classification Advanced Methods
Classification:
Advanced Methods
HUI-YIN CHANG (張彙音)
1
Bayesian Belief Networks
Bayesian belief networks (also known as Bayesian networks,
probabilistic networks): allow class conditional independencies
between subsets of variables
4
Chapter 9. Classification: Advanced Methods
Classification by Backpropagation
Summary
5
A Multi-Layer Feed-Forward Neural Network
Output vector
w(jk +1) = w(jk ) + ( yi − yˆ i( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
6
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop and test
computational analogues of neurons
A neural network: A set of connected input/output units where
each connection has a weight associated with it
During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of the
input tuples
Also referred to as connectionist learning due to the connections
between units
7
Neural Network as a Classifier
Weakness
◦ Long training time
◦ Require a number of parameters typically best determined empirically, e.g.,
the network topology or “structure.”
◦ Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of “hidden units” in the network
Strength
◦ High tolerance to noisy data
◦ Ability to classify untrained patterns
◦ Well-suited for continuous-valued inputs and outputs
◦ Successful on an array of real-world data, e.g., hand-written letters
◦ Algorithms are inherently parallel
◦ Techniques have recently been developed for the extraction of rules from
trained neural networks
8
Classification: A Mathematical Mapping
Classification: predicts categorical class labels
◦ E.g., Personal homepage classification
◦ xi = (x1, x2, x3, …), yi = +1 or –1
◦ x1 : # of word “homepage”
◦ x2 : # of word “welcome”
9
Discriminative Classifiers
Advantages
◦ Prediction accuracy is generally high
◦ As compared to Bayesian methods – in general
Criticism
◦ Long training time
◦ Difficult to understand the learned function (weights)
◦ Bayesian networks can be used easily for pattern discovery
10
SVM—Support Vector Machines
A relatively new classification method for both linear and nonlinear
data
It uses a nonlinear mapping to transform the original training data
into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
11
SVM—General Philosophy
12
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
Applications:
◦ handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
13
SVM—When Data Is Linearly Separable
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
14
SVM—Linearly Separable
◼ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
◼ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
◼ The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
◼ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
◼ This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints →
Quadratic Programming (QP) → Lagrangian multipliers
15
Why Is SVM Effective on High Dimensional Data?
16
SVM: Different Kernel functions
◼ Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
◼ Typical Kernel Functions
17
SVM Related Links
SVM Website: http://www.kernel-machines.org/
Representative implementations
◦ LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
◦ SVM-light: simpler but performance is not better than LIBSVM,
support only binary classification and only in C
◦ SVM-torch: another recent implementation also written in C
18
Lazy vs. Eager Learning
Lazy vs. eager learning
◦ Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
◦ Eager learning (the above discussed methods): Given a set of
training tuples, constructs a classification model before
receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
◦ Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form an implicit global
approximation to the target function
◦ Eager: must commit to a single hypothesis that covers the entire
instance space
19
Lazy Learner: Instance-Based Methods
Instance-based learning:
◦ Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
Typical approaches
◦ k-nearest neighbor approach
◦ Instances represented as points in a Euclidean space.
◦ Case-based reasoning
◦ Uses symbolic representations and knowledge-based inference
20
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples
_
_
_ _
.
+
_
. +
xq +
. . .
_ + . 21
Discussion on the k-NN Algorithm
k-NN for real-valued prediction for a given unknown tuple
◦ Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
◦ Weight the contribution of each of the k neighbors according to
their distance to the query xq
w 1
◦ Give greater weight to closer neighbors
d ( xq , x )2
i
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
◦ To overcome it, axes stretch or elimination of the least relevant
attributes
22
Genetic Algorithms (GA)
Genetic Algorithm: based on an analogy to biological evolution
An initial population is created consisting of randomly generated rules
◦ Each rule is represented by a string of bits
◦ E.g., if A1 and ¬A2 then C2 can be encoded as 100
◦ If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population is formed to consist
of the fittest rules and their offspring
The fitness of a rule is represented by its classification accuracy on a set of training
examples
Offspring are generated by crossover and mutation
The process continues until a population P evolves when each rule in P satisfies a
prespecified threshold
Slow but easily parallelizable
23
Ensemble Learning (集成學習)
集成學習(Ensemble Learning)
◦ 將多個模型整合起來,最後獲得比單一個模型表現更好的方法,例如在某筆資料集中可以使用KNN、
線性迴歸模型(Linear Regression)、決策樹(Decision Tree)等方法來完成預測,而集成學習就是想將
這些可以運用方法的結果整合以獲得比單一模型更好的結果,常見的方法例如Bagging、Boosting、
AdaBoost與隨機森林(Random Forest)等。
Ensemble Learning (集成學習)
Bagging
◦ Bagging概念很簡單,從訓練資料中隨機抽取(取出後放回,n<N)樣本訓練多個分類器(要多少個分類
器自己設定),每個分類器的權重一致最後用投票方式(Majority vote)得到最終結果,而這種抽樣的方
法在統計上稱為bootstrap。
◦ Bagging的優點在於原始訓練樣本中有噪聲資料(不好的資料),透過Bagging抽樣就有機會不讓有噪聲
資料被訓練到,所以可以降低模型的不穩定性
◦ Example: Random Forest (隨機森林)
Random forest (隨機森林)
隨機森林其實就是進階版的決策樹,所謂的森林就是由很多棵決策樹所組成。隨機森林是使
用 Bagging 加上隨機特徵採樣的方法所產生出來的整體學習演算法。
Random forest - steps
Step 1: 從訓練集中抽取 n’ 筆資料出來
Step 2: n’ 筆資料隨機挑選 k 個特徵做樣本
Step 3:重複 m 次,產生 m 棵決策樹
Step 4:分類: 多數投票機制進行預測、迴歸: 平均機制進行預測
隨機森林中的隨機有兩種方面可以解釋。首先第一個是隨機取樣,在模型訓練的過程中每棵
樹的生成都會先從訓練集中隨機抽取 n’ 筆資料出來,而這 n’ 筆資料是可以被重複抽取的。此
抽取資料的方式又稱為 Bootstrap,它是一種在統計學上常用的資料估計方法。第二個解釋隨
機的理由是在隨機森林中每一棵樹都是隨機的特徵選取。每一棵樹都是從 n’ 筆資料中隨機挑
選 k 個特徵做樣本。
Random forest - advantages
每棵樹會用到哪些訓練資料及特徵都是由隨機決定
採用多個決策樹的投票機制來改善決策樹
與決策樹相比,不容易過度擬合
隨機森林每一棵樹都是獨立的
訓練或是預測的階段每一棵樹都能平行化的運行
Ensemble Learning (集成學習)
Boosting
◦ Boosting算法是將很多個弱的分類器(weak classifier)進行合成變成一個強分類器(Strong
classifier),和Bagging不同的是分類器之間是有關聯性的,是透過將舊分類器的錯誤資料
權重提高,然後再訓練新的分類器,這樣新的分類器就會學習到錯誤分類資料
(misclassified data)的特性,進而提升分類結果。
◦ 對於Boosting來說,有兩個關鍵,一是在如何改變訓練資料的權重;二是如何將多個弱分
類器組合成一個強分類器。而且存在一個重大的缺陷:該分類算法要求預先知道弱分類器
識別準確率的下限。
◦ Example: Adaboost, Xgboost
Adaboost
通過為每個弱分類器分配權重(與其準確性成比例)將它們組合在一起。
最終分類是通過弱分類器的加權組合決定的,其中每個分類器的權重取決於其準確性。
An example of Adaboost
Given the 10 data points (each has a label, Y) and three weak
models, how do we build a general model for classification?
Step 1:
An example of Adaboost
Step 2:
An example of Adaboost
Step 2:
在權值分布D1的情况下,取已知的三个弱分类器h1、h2和
h3中误差率最小的分类器作为第1个基本分类器H1(x)(三个
弱分类器的误差率都是0.3,那就取第1个吧)
在分类器H1(x)=h1情况下,样本点“5 7 8”被错分,因此基本
分类器H1(x)的误差率为:
被误分类样本的权值之和影响误差率e,误差率e影响基本
分类器在最终分类器中所占的权重α。
An example of Adaboost
Step 2:
更新训练样本数据的权值分布,用于下一轮迭代,对于正
确分类的训练样本“1 2 3 4 6 9 10”(共7个)的权值更新为:
第1轮迭代后,最后得到各个样本数据新的权值分布:
D2=[1/14,1/14,1/14,1/14,1/6,1/14,1/6,1/6,1/14,1/14]
An example of Adaboost
Step 2:
由于样本数据“5 7 8”被H1(x)分错了,所以它们的权值由之
前的0.1增大到1/6;反之,其它数据皆被分正确,所以它们
的权值皆由之前的0.1减小到1/14,下表给出了权值分布的
变换情况:
取当前最小的分类器h2作为第2个基本分类器H2(x)
An example of Adaboost
Step 3:
H2(x)把样本“3 4 6”分错了,根据D2可知它们的权值为
D2(3)=1/14,D2(4)=1/14, D2(6)=1/14,所以H2(x)在训练数
据集上的误差率:
第2轮迭代后,最后得到各个样本数据新的权值分布:
D3=[1/22,1/22,1/6,1/6,7/66,1/6,7/66,7/66,1/22,1/22]
An example of Adaboost
Step 3:
分类函数:f2(x)=0.4236H1(x) + 0.6496H2(x)。此时,组合两
个基本分类器sign(f2(x))作为强分类器在训练数据集上有3个
误分类点(即3 4 6),此时强分类器的训练错误为:0.3
An example of Adaboost
Step 4:
第三次迭代t=3:
在权值分布D3的情况下,再取三个弱分类器h1、h2和h3中误差率
最小的分类器作为第3个基本分类器H3(x):
① 当取弱分类器h1=X1=2.5时,此时被错分的样本点为“5 7 8”:
误差率e=7/66+7/66+7/66=7/22;
② 当取弱分类器h2=X1=8.5时,此时被错分的样本点为“3 4 6”:
误差率e=1/6+1/6+1/6=1/2=0.5;
③ 当取弱分类器h3=X2=6.5时,此时被错分的样本点为“1 2 9”:
误差率e=1/22+1/22+1/22=3/22;
取当前最小的分类器h3作为第3个基本分类器H3(x):
An example of Adaboost
Step 4:
第3轮迭代后,得到各个样本数据新的权值分布为:
D4=[1/6,1/6,11/114,11/114,7/114,11/114,7/114,7/114,1/6,1/38]
An example of Adaboost
Step 4:
分类函数:f3(x)=0.4236H1(x) + 0.6496H2(x)+0.9229H3(x)。此
时,组合三个基本分类器sign(f3(x))作为强分类器,在训练
数据集上有0个误分类点。至此,整个训练过程结束。
整合所有分类器,可得最终的强分类器为:
这个强分类器Hfinal对训练样本的错误率为0!
Pros and Cons of Adaboost
Probs:
◦ Adaboost提供一种框架,在框架内可以使用各种方法构建子分类器。可以使用简单的弱分类器,不用
对特征进行筛选,也不存在过拟合的现象。
◦ Adaboost算法不需要弱分类器的先验知识,最后得到的强分类器的分类精度依赖于所有弱分类器。无
论是应用于人造数据还是真实数据,Adaboost都能显著的提高学习精度。
◦ Adaboost算法不需要预先知道弱分类器的错误率上限,且最后得到的强分类器的分类精度依赖于所有
弱分类器的分类精度,可以深挖分类器的能力。Adaboost可以根据弱分类器的反馈,自适应地调整假
定的错误率,执行的效率高。
◦ Adaboost对同一个训练样本集训练不同的弱分类器,按照一定的方法把这些弱分类器集合起来,构造
一个分类能力很强的强分类器,即“三个臭皮匠赛过一个诸葛亮”。
Cons:
◦ 在Adaboost训练过程中,Adaboost会使得难于分类样本的权值呈指数增长,训练将会过于偏向这类
困难的样本,导致Adaboost算法易受噪声干扰。此外,Adaboost依赖于弱分类器,而弱分类器的训
练时间往往很长。
Adaboost算法的某些特性是非常好的,这里主要介绍Adaboost的两个特性。
◦ 训练的错误率上界,随着迭代次数的增加,会逐渐下降
◦ Adaboost算法即使训练次数很多,也不会出现过拟合的问题。
Bagging vs. Boosting
訓練樣本:
◦ Bagging: 每一次的訓練集是隨機抽取(每個樣本權重一致),抽出可放回,以獨立同分布選取的訓練樣
本子集訓練弱分類器。
◦ Boosting: 每一次的訓練集不變,訓練集之間的選擇不是獨立的,每一是選擇的訓練集都是依賴上一
次學習得結果,根據錯誤率(給予訓練樣本不同的權重)取樣。
分類器:
◦ Bagging: 每個分類器的權重相等。
◦ Boosting: 每個弱分類器都有相應的權重,對於分類誤差小的分類器會有更大的權重。
每個分類器的取得:
◦ Bagging: 每個分類器可以並行生成。
◦ Boosting: 每個弱分類器只能依賴上一次的分類器順序生成。
Thanks for Your Attention
Q&A
46