Chap 8
Chap 8
Chapter 8.
Learning Methods
We review some of the most commonly used learning methods, including k-nearest neighbor (KNN),
decision trees and random forests, support vector machines, neural nets and cluster analysis. The
ensemble methods, also viewed as part of regularization in broad sense, can be very useful in com-
bining weaker methods into a strong method. We shall consider both regresison and classification
problems, with data (yi , xi ), i = 1, ..., p, where yi are output and xi are p-dimensional inputs. The
regression model throughout this chapter is
yi = f (xi ) + i , ...i = 1, ..., n.
Unless otherwise stated, the loss function is the square loss. For classification problems, yi = ±1,
representing the two classes: class 1 and class −1. In this section, we will use the terms learners,
predictors and estimators interchangeably.
K-nearest neighbor is one of the simplest methods. For any x in the p-dimensional real space, let
K(x) denote the indices of the data points that are the K closest to x. In a regression problem, f
is estimated by
1 X
fˆ(x) = yi ,
K
i∈K(x)
which is just the average of the responses of the K data points that are the closest to x. The norm
k · k is usually the Euclidean norm. Notice ˆ
n
that f (x) is a step function of x. In other words, the
entire input space is partitioned into K sets, each corresponding to one cluster of K data points.
n n
Arrange the totally K such clusters as clusters 1, 2, ..., K . The set corresponding to the j-th
cluster are all the points that are closest to cluster j than to any other cluster of data points. Some
of the sets could be empty. Notice that the smaller that K, the larger the model, and the larger
the variance and the smaller the bias. For prediction purpose, the best K are generally determined
by cross-validation.
For classification, simply classify a new observation with input x as from class 1 if fˆ(x) > 0 and into
class −1 otherwise. This is a majority vote scheme: classify the observation into the class which is
the majority class among all the K nearest data points.
Decision trees are popular nonlinear methods that have proper interpretation. In fact, they may
even have better interpretation than linear regression models. Decision trees can also be viewed as
step function regression. The predicted outputs are constants over specially constructed partitions
of the input value space. In the following, we introduce the classification and regression trees
(CART).
Under squared loss, if fˆ is a constant over a region R, then fˆ(x) is the average of all data points xi
in R. Regression trees are greedy algorithms to grow a tree that evenually leads to the partition of
the input space. At every step, the algorithm determines the way to split a given region, according
to the values of one variable. The critical issue is the choice of which variable and which cut point
− +
to split the data. Set Rj,s = {i : xi ∈ R, xij ≤ s} and Rj,s = {i : xi ∈ R, xij > s}. Then R can
+ −
be split into two regions Rj,s and Rj,s according whether the j-th variable is greater than s or not.
The reduction of the residual sum of squares (RSS) is
X X X
(yi − ȳ)2 − (yi − ȳ + )2 + (yi − ȳ − )2
i∈R + −
i∈Rj,s i∈Rj,s
802
+ −
where ȳ + and ȳ − are the average of yi for i ∈ Rj,s and i ∈ Rj,s , respectively. Then, it is natural that
one chooses the variable j and the cutpoint s so that the reduction of RSS is the largest, since the
smaller the RSS, the better the fit. In this way, the method of regression trees defines a recursive
binary split algorithm to partition the space. Specifically, first split the entire input space into two
regions. then, for each of the two regions, perform the split by the same rule. Continue the split
till stop. Every split must be choosing the best variable and best cutpoint to achieve the largest
reduction of RSS. In the end, each partitioned area represents a terminal node, also called leaves,
while every split corresponds to an internal node. The entire process resembles the growth of trees.
One joins a node with the following and the line is viewed as branches. Traditionally, the tree-like
structure of data-split is drawn upside-down. The size of the tree can be conveniently measured by
the number of terminal nodes. A simple tree is shown in the following with three terminal nodes
(leaves) and two internal nodes. The leaves are R1 = {X1 ≤ 0.9}; R2 = {X1 > 0.9, X2 ≥ −0.3}
and R3 = {X1 > 0.9, X2 < −0.3}.
R1
X2 ≥ −0.3 X2 < −0.3
R2 R3
There is a major problem of when to stop the split. One way is to pre-specify a threshold for the
amount reduction of RSS or for the number of data points in the region. A drawback with these
types of stop rules is that they are usually near-sighted as they are based on the current step of
data-split. A commonly adopted strategy is to grow a very large tree, which tends to overfit the
data, and then perform a pruning of the tree. The cost-complexity pruning is the most popular with
computational efficiency. For a large tree, denoted as T0 , we consider any of its subtrees T that
are part of T0 by collapsing any number of internal nodes. Consider the common regularization
formula of “Loss + Penalty”
|T |
X X
(yi − ȳm )2 + α|T |
m=1 i∈Rm
where |T | denotes the number of the terminal nodes of the subtree T , and Rm and the terminal
nodes and ȳm is the mean outputs of the data points in Rm , and α is tuning parameter. Denote
the minimized subtree as Tα . If α = 0, no penalty the optimal tree is the original T0 . If α = ∞,
the tree has no split at all. The predictor is just ȳ. The larger the α, the more penalty for model
complexity. Just like Lasso, there exists efficient computation algorithm to compute the entire path
of Tα for all α. Then we can use the cross-validation method to find the best α to minimize the
test error.
For classification trees, one can follow the same line of procedure as that of regression trees, but
using, instead of RSS, the error measurements that are more proper for classications. For a region
R, let p̂k be the percentage of observations in this region that belong to class k. We introduce an
impurity measuare that corresponds to the RSS in regression. The typical impurity measures are
E = 1 − maxk p̂k .
803
If a region R is nearly pure, most of the observations are from one class, then the Gini-index and
cross-entropy would take smaller values than classfication error rate. Gini-index and cross-entropy
are more sentive to node purity. To evaluate the quality of a particluar split, the Gini-index and
cross-entropy are more popularly used as error measurement crietria than classification error rate.
Any of these three approaches might be used when pruning the tree. The classification error rate
is preferable if prediction accuracy of the final pruned tree is the goal.
Large regression trees are known to be of high variance, meaning that given different training data,
the estimates can be quite different. As a result, prediction based on a single large tree can be
unstable and inaccurate. A different way to boost the prediciton accuracy of trees is to grow many
small trees and combine them together. This will be introduced in Section 8.5.
Support vector machines are commonly applied to classification problems, though it also has appli-
cation to regression. We consider here the classification problem. Consider a fixed p-dimensional
vector b = (b1 , ..., bp ). All p-vector x = (x1 , ..., xp ) can be written as ab+z where z is perpendicular
to b. Consider linear function
A classification rule would simple classify a subject with input x into class 1 if f (x) > 0 and into
class −1 if f (xi ) < 0.
A maximal margin classifier seeks to find the hyperplane such that the minimal distance of all
points to the separating hyperplane is the smallest. This is the following optimization problem:
where C is a nonnegative tuning parameter, and i are the so-called slack variables. The classifica-
tion rule is: classify an observation with input x into class +1 if f (x) > 0; else into −1 class.
To understand the slack variables i , we note that i = 0 ⇐⇒ the i-th observation is on the
correct side of the margin; i > 0 ⇐⇒ the i-th observation is on the wrong side of the margin; and
i > 1 ⇐⇒ the i-th observation is on the wrong side of the hyperplane. The tuning parameter C
is a budget for the amount that the margin can be violated by the n observations. C = 0 ⇐⇒ no
budget and, as a result, i = 0 for all i. The classifier is a maximal margin classifier, which exists
only if the two classes are separable by hyperplanes. Larger C implies more tolerance of margin
violation. Note that more than C observations can be on the wrong side of the soft-margin classifier
hyperplane. As C increases, the margin widens and more violations of the margin. Observations
that lie directly on the margin, or on the wrong side of the margin for their class, are known as
support vectors. Only the support vectors affect the support vector classifier. Those strictly on the
correct side of the margin do not, just as outliers do not change the median. Larger C implies more
violations, and more support vectors, and smaller variance and more robust classfier.
In summary, the linear support vector classifier can be represented as
n
X
f (x) = β0 + αi < xi , x >
i=1
where αi 6= 0 only for all support vectors. Moreover, αi can also be computed based on < xj , xk >.
Only the inner product of the feature space is relevant in computing the linaer support vector
classfier. The above support vector classifier has a linear boundary, which may not be the “ground
truth” in practice. To cope with more general cases, one can consider to enlarge the feature space.
A straightfoward method is to include the power functions of the inputs. A better approach is the
use of the kernel trick, which gives rise to the suppot vector machines. The support vector machine
actually enlarges the original feature space to a space of kernel functions:
xi → K(·, xi ).
The
P kernel functions are bivariate functions satisfying the property of nonnegative definiteness:
i,j ai aj K(xi , xj ) ≥ 0. The original feature space is the p-dimensional input space. The enlarged
805
feature space is the space of kernel functions, which is in fact of infinite dimension. In actual fitting
of the support vector machine, we only need to compute the K(xi , xj ) for all xi , xj in training data.
And the support vector machine classifier can be written as
n
X
f (x) = β0 + αi K(x, xi )
i=1
where f (x) = β0 + β1 x1 + ... + βxp . The hinge loss function : (1 − x)+ . In comparison, the classifier
from the logistic regression with l2 penalty is
n
X p
X
log(1 + e−yi f (xi ) ) + λ βj2
i=1 j=1
Neural network, particular deep learning, has achieved spectalur success in the past decades and
is now one of the most successful learning methods. With deep architecture, the neural network
is a particular kind of machine learning that achieves great power and flexibility by learning to
represent the world as a nested hierarchy of concepts, with each concept defined in relation to
Elements of Statistical Learning (2nd Ed.) Hastie,
c Tibshirani & Friedman 2009 Chap 11
simpler concepts, and more abstract representations computed in terms of less abstract ones. The
main idea is to process, in a layerwise structure, linear combinations of the outputs of the previous
layers by a nonlinear functions. These processors are called neural nets. In the basic feed-forward
neural network, one can understand the network as informatoin flow forwardly from input x to
output ŷ. The following picture shows a schematic of the neural network with three layers: input
layer (bottom), hidden layer (middle) and output layer (top).
1
0
0
1 Y1 Y2 YK
Z1 Z2 Z3 ZmM
X1 X2 X3 Xp-1 Xp
wj is the j-th row of W, which is m × (p + 1) matrix. ŷi = g(β T hi ), with activation function g.
The parameters are θ = (β, W) of m + 1 + m(p + 1) dimension.
Consider the least squares loss:
n
X n
X n
X
(yi − ŷi )2 = (yi − g(β T hi ))2 = Ri (θ).
i=1 i=1 i=1
The gradient based minimization is to differentiate the loss function, as function of θ, step-by-step,
backwards. The errors from an upper layer is propagated into the next lower layer as in the following
precedure.
∂Ri (θ)
= −2(yi − ŷi )ġ(β T hi )hi
∂β
= δi hi , say.
Here δi and sji are the “errors” from the current model at the output and the hidden layer units.
The errors satify
sji = σ̇(wjT xi )δi βj
These are the back-propagation equations, which can be used to fast update the parameters: (with
learning rate γ)
n n
X ∂Ri (θ) X
β new = β − γ = βj − γ δi hi
i=1
∂β i=1
n n
X ∂Ri (θ) X
wjnew = wj − γ = wj − γ sji xi .
i=1
∂wj i=1
The back-propagation can be understood as a two pass algorithm. The forward pass is: given the
current parameters (weights), to compute ŷi = g(β T hi ) = g(β T σ(Wxi )). The backward pass is:
compute the error δi , then use the back-propagation equation to back-propagate it into the errors
sji of the next lower layer, (and continue on if more layers are in the network). All the errors are
then used to update the parameters with a proper choice of the learning rate. The advantages of
the back-propagation are simplicity of computation; the local nature: the local nature: each unit
807
passes and recieves information only to and from those units with connection; the batch learning:
The parameter updates do not have to take place over all training samples, i.e., the summation
could be on a subset (batch) of the training samples. It can even be just one single sample. This
is widely used stochastic gradient descent (SGD).
Ensemble methods generally have two tasks: 1. build a bank of base learners; 2. combining these
base learners to form a final learner. These base learners are usually weak learners while the
combined one is expected to be a strong learner.
Bagging. Bagging stands for boostrap aggregating. By perturbing the data, one generate many
predictors based on the same model. Bagging perturbs the data by boostrap. Recall that fˆ is
the predictor based on original data D = {(y1 , x1 ), ..., (yn , xn )}). a boostrap sample, denoted as
(y1∗ , x∗1 ), ..., (yn∗ , x∗n ), are a random sample from the original data D, by sampling with replacement.
Based on this boostrap sample, denoted as D1∗ , one can construct a predictor fˆ1∗ . Repeat this
process, one can construct many predictors, say, fˆ2∗ , ..., fˆB∗ , where B can be as large as possible.
Then the bagged predictor is
B
1 X ˆ∗
fˆbagging (x) = f (x).
B j=1 j
In the case of classification with two classes, 1 or −1, each fˆj∗ takes values 1 or −1. Then the
bagged classifier is a majority-vote classifier: it classifies an new/old observation into that class
which receives most “vote” from fˆj∗ , j = 1, ..., B.
Associated with bagging, there is an easily available estimate of test error. Let
PB ˆ∗ / Dj∗ )
j=1 fj (xi )I((yi , xi ) ∈
fˆ∗ (xi ) = PB
j=1 I((yi , xi ) ∈/ Dj∗ )
be the average of the boostrap predictors based on the boostrap samples that do not contain the
i-th data point (yi , xi ). Then, the OOB (out-of-bag) mean squared error is
n
X
(yi − fˆ∗ (xi ))2 ,
i=1
which serves as an approriate estimator of the test mean squared error. In the case of classification,
fˆ∗ (xi ) is the class of the majority vote of the OOB bootstrap classifiers. And the OOB classfication
error is
Xn
(1/n) (yi 6= fˆ∗ (xi ))
i=1
Continue this iteration till a stop, and output fˆ, which is the sum of λĝ at each step. Here λ is
the learning rate, which is usually chosen to be small. For a learning method that minimizes a loss
function L, the boosting ˆ
Pn algorithm can be described as the following. Start with f0 (x) which is a
constant γ such that i=1 L(yi , γ) is minimized. For k = 1, 2, ..., K :
where
n
X
(λ̂, ĝ(x)) = argming∈G,λ L(yi , fˆk−1 (x) + λg(x))
i=1
and G refers to the collection of learners for this learning method. For example, for the tree boosting,
G is the collection of trees with certain size. The final output is fˆK . This is the forward stagewise
boosting algorithm. Very often, the λ̂ is not chosen to be the optimal one but a learning rate that
is controled to be small.
The above minimization to obtain g might be computationally difficult. The gradient boosting is a
short-cut to reduce the computational complexity. Set
∂
gik = − L(yi , z)|z=fˆk−1 (xi )
∂z
A base learner, denoted as ĥk , is fit to the data: (gik , xi ), i = 1, ..., n. The iteration becomes
where λ is the learning rate. For square loss, the gradient boosting and residual boosting are the
same.
Three techniques play important roles in boosting: 1. (shrinkage) the use of small learning rate;
2. Early stop to avoid overfit; 3. random subsampling to construct the base learners. The use
of small learning rate may increase computational workload but the iteration can be more stable.
Without early stopping, too many base learners putting together can cause overfit. The stop of
the boosting iteration is mostly determined by checking the test error through validation or cross
validation. The random subsampling is attempted to lower the correlation between the constructed
base learners.
Tree boosting and random forest The basic form of boosting can be applied to the boosting trees,
either based on residuals or based on gradients. Consider square loss, the the tree boosting algorithm
is:
Boosted trees can greatly improve over the prediction of each single tree. In fact, each single tree
in the process of boosting can be of small size, such as stumps. Stumps are trees with only two
leaves.
Random forest is a method of tree boosting by bagging, with an additional element of creating less
correlated trees. The specific procedure is as follows.
(a) For k = 1, ..., B:
i. Draw boostrap sample Dk∗ = {(x∗1 , y1 )..., (x∗n , yn∗ )}
ii. Grow a tree Tk∗ based on Dk∗ in the following fashion. Before every split, select m variables
at random from the p variables, and pick the best variable and cutpoint, among these m
variables to perform the split.
PB
(b) Output fˆ(x) = (1/B) k=1 Tk∗ (x).
For classification problem, each tree Tk∗ provide a class prediction for input x, and the random
forest outputs the majority-vote of them as the predicted class for x. The purpose of the random
subset selection in step (ii) is to create less correlated boostrap trees, so that the bagging would
√
have effect on variance reduction. A common choise of m is p or smaller, and it can be as small
as 1.
Stacking. Stacking is also called stacked generalization. Suppose we have M models with predictors
fˆ1 , ..., fˆJ based on the training data (yi , xi ), i = 1, ..., n. A simple idea is to consider linear combi-
nation of the predictors to hopefully form a predictor of better accuracy. Let w = (w1 , ..., wJ )T be
the weights. One might consider
Xn XJ
(yi − wj fˆj (xi ))2 ,
i=1 j=1
and wish to minize this training error to find the optimal weights. In terms of computation, the
minimizer is easily solved and has an expression like the least squares estimator. However, such a
minimization can easily lead to large weights on large models and small weights on small models.
In other words, the model complexity is not considered in the model combination. Stacking deals
(−i)
with the problem by the leave-one-out-cross-validation (LOOCV). Let fˆj be the estimator of f
based on model j and the data excluding the i-th data point (yi , xi ). Let
n J
(−i)
X X
ŵ = argminw (yi − wj fˆj (xi ))2 .
i=1 j=1
Pn
The resulting stacked predictor is j=1 ŵj fˆj . Alternatively, one can also restrict ŵj to nonnegative
in the above minimization. It can be proved that, at the population level, the weighted model
averaging with optimal weights produce a model as good as or better than any of the individual
model, in the sense that it has smaller error than any single model.
There exist many other ensemble methods. The richest is the Bayesian model averaging or com-
bination. In broad sense, the regularized regression such as Lasso or ridge regression where there
are tuning parameters can also be viewed as ensemble methods, called Buckets of models. There
is a collection of models, each model indexed by a tuning parameter. Buckets of models method
tries to select the best of them, by using, for example, cross validation. Another method called
bumping also finds a best single model from a bucket of models by boostrap sampling. let the
original prediction based on the training data be fˆ. Draw B sets of boostrap samples, and fit
each boostrap sample to generate boostrap prediction fˆj∗ , with j = 1, .., B. Choose the the pre-
diction among fˆ, fˆ1∗ , ...., fˆB∗ with the smallest training error over the entire training data, which is
Pn ˆ∗ 2 ˆ∗
i=1 (yi − fj (xi )) for squared loss and prediction fj . Bumping picks the best prediction in terms
of training error, while bagging uses the average boostrap predictions.