0% found this document useful (0 votes)
52 views14 pages

Lec5 Class

This document summarizes a lecture on classification in statistical machine learning. It introduces classification problems and provides examples like email spam detection and image classification. It formally defines classification as predicting a class label y for a feature vector x. The goal is to minimize classification error. Common approaches to classification include modeling the data probabilistically and finding the classifier that maximizes the probability P(y|x). Regression can also be used to estimate the probabilities P(y|x) and derive a classifier from the regression estimates.

Uploaded by

araymundom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views14 pages

Lec5 Class

This document summarizes a lecture on classification in statistical machine learning. It introduces classification problems and provides examples like email spam detection and image classification. It formally defines classification as predicting a class label y for a feature vector x. The goal is to minimize classification error. Common approaches to classification include modeling the data probabilistically and finding the classifier that maximizes the probability P(y|x). Regression can also be used to estimate the probabilities P(y|x) and derive a classifier from the regression estimates.

Uploaded by

araymundom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

STAT 535: Statistical Machine Learning Winter 2019

Lecture 5: Classification
Instructor: Yen-Chi Chen

5.1 Introduction

Classification is one of the most important data analysis problems. Much early work on this topic was done
by statisticians but in the past 20 years, computer science and machine learning communities have made
much much more progress on this topic.
Here are some classical applications of classification.

• Email spam. Email service provider such as Google often faced the problem of classifying a new
email. The problem they want to address is like: given an email, how do you decide if it is a spam, an
ordinary email, or an important one?
• Image classification. If you have used Facebook, you may notice that whenever your photo contains
a picture of one of your friends, Facebook may ask you if you want to tag your friend, even if you did
not manually tell the computer that this is your friend. How do they know that picture is a human
and that guy is your friend?

We consider a simple scenario – binary classification. Namely, there are only two possible classes that we
will consider. We will just denote the two classes as 0 and 1.
The classification problem can be formalized as follows. Given a feature vector x0 , we want to create a
classifier c that maps x0 into 0 or 1. Namely, we want to find a function c(x0 ) that outputs only two
possible number: 0 and 1. Moreover, we want to make sure that our classification error is small. Let y0 be
the actual label of x0 . We measure the classification error using a loss function L such that the the loss
of making a prediction c(x0 ) when the actual class label is y0 is L(c(x0 ), y0 ). A common loss function is the
0-1 loss, which is L(c(x0 ), y0 ) = I(c(x0 ) 6= y0 ). Namely, when we make a wrong classification, we loss 1
point and we do not lose anything if we make the correct classification.
How do we find the classifier c? A good news is: often we have a labeled sample (data) (X1 , Y1 ), · · · , (Xn , Yn )
available. Then we will find c using this dataset.
In statistics, we often model the data as an IID random sample from a distribution. We now define several
useful distribution functions:
p0 (x) = p(X = x|Y = 0) : the density of X when the actual label is 0,
p1 (x) = p(X = x|Y = 1) : the density of X when the actual label is 1,
(5.1)
P (y|x) = P (Y = y|X = x) : the probability of being in the class y when the feature is x,
PY (y) = P (Y = y) : the probability of observing the class y, regardless of the feature value.

Using a probability model, we will define the risk function, which is the expected value of the loss function
when the input is random. The risk of a classifier c is

R(c) = E(L(c(X), Y )).

5-1
5-2 Lecture 5: Classification

Ideally, we want to find a classifier that minimizes the risk because such a classifier will minimize our expected
losses.
Assume that we know the 4 quantities in equation (5.1), what class label will you predict when seeing a
feature X = x? An intuitive choice is that we should predict the value y that maximizer P (y|x). Namely,
we predict the label using the one with highest probability. Such classifier can be written as
(
0, if P (0|x) ≥ P (1|x),
c∗ (x) = argmaxy=0,1 P (y|x) = (5.2)
1, if P (1|x) > P (0|x).

Is this classifier good in the sense of the classification error (risk)? The answer depends on the loss function.
A good news is: this classifier is the optimal classifier for the 0 − 1 loss. Namely,

R(c∗ ) = min R(c)


c

when using a 0 − 1 loss. However, if we are using other loss function, this classifier will not be the best one
(with the smallest expected loss).
Derivation of c∗ is optimal under 0 − 1 loss. Given a classifier c, the risk function R(c) = E(L(c(X), Y )).
Using the property of expectation, we can further write it as

R(c) = E(L(c(X), Y )) = E(E(L(c(X), Y )|X)).


| {z }
(A)

For the quantity (A), we have

E(L(c(X), Y )|X) = L(c(X), 1)p(Y = 1|X) + L(c(X), 0)p(Y = 0|X)


= I(c(X) 6= 1)p(Y = 1|X) + I(c(X) 6= 0)p(Y = 0|X)
(
p(Y = 1|X) if c(X) = 0
=
p(Y = 0|X) if c(X) = 1.

Thus, seeing a feature X, the expected loss we have when predicting c(X) = 0 is P (Y = 1|X) whereas when
prediction c(X) = 1 is P (Y = 0|X). The optimal choice is predicting c(X) = 0 if P (Y = 1|X) ≤ P (Y = 0|X)
and c(X) = 1 if P (Y = 1|X) > P (Y = 0|X) (the equality does not matter), which is the classifier c∗ .
When a classifier attains the optimal risk (i.e., having a risk of minc R(c)), it is called a Bayes classifier.
Thus, the classifier c∗ is the Bayes classifier in 0 − 1 loss.
For a classifier c, we define its excess risk (regret) as

E(c) = R(c) − min R(c).


c

The excess risk is a quantity that measures how the quality of c is away from the optimal/Bayes classifier.
If we cannot find the Bayes classifier, we will at least try to find a classifier whose excess risk is small.

5.2 Regression Approach

If we know the P (y|x), we can build the Bayes classifier and this classifier is the optimal one in terms of
the risk function. However, P (y|x) is a population quantity, which is often unknown to us. All we have is
a random sample (X1 , Y1 ), · · · , (Xn , Yn ). So the question becomes: how do we estimate P (y|x) using the
data?
Lecture 5: Classification 5-3

It turns out that this is a problem we know have to solve. Here is just one hint: because the response variable
Y only takes two possible values {0, 1}, it is actually a Bernoulli random variable! Thus, E(Y ) = P (Y = 1),
which implies
E(Y |X = x) = P (Y = 1|X = x) = P (1|x). (5.3)
Namely, P (1|x) is the regression function! Using the fact that P (0|x) + P (1|x) = 1, an estimator of P (1|x)
leads to an estimator of P (y|x) for both y = 0 and y = 1.
Thus, as long as we have a regression estimator, we can convert it into a classifier. Here is one example of
using kernel regression. Let
Pn
Yi K Xih−x

b K (x) = Pi=1
m n Xi −x

i=1 K h

be the kernel regression. Then

PbK (1|x) = m
b K (x), PbK (0|x) = 1 − m
b K (x).

Thus, a classifier based on kernel regression is


(
0, if PbK (0|x) ≥ PbK (1|x),
bcK (x) =
1, if PbK (1|x) > PbK (0|x)
(
0, if 1 − PbK (1|x) ≥ PbK (1|x),
=
1, if PbK (1|x) > 1 − PbK (1|x).
(
0, if PbK (1|x) ≤ 12 ,
=
1, if PbK (1|x) > 12 .
(
0, b K (x) ≤ 21 ,
if m
=
1, b K (x) > 12 .
if m

Namely, the classifier will output 1 whenever the estimated regression function is greater than half and 0
otherwise.
Will this classifier be a good one? Intuitively, it should be true. If we have a good regression estimator,
the corresponding classifier should also be good. In fact, we have the following powerful result linking the
quality of a regression estimator and the excess risk.

Theorem 5.1 Assume we use the 0 − 1 loss. Let m


b be a regression estimator and b
cm be the corresponding
classifier. Then sZ
Z
E(b
cm ) ≤ 2 |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x).

Namely, if we have a regression estimator whose overall quality is good, the corresponding classifier will also
have a small excess risk (i.e., perform comparably well compared to the optimal classifier).

5.3 Density Estimation Approach (Naive Bayes)

In addition to using a regression function to construct a classifier, we can use a density estimator for
classification. This approach is often known as the naive Bayes approach.
5-4 Lecture 5: Classification

A key insight is from the Bayes rule:


p(x, y) p(x|y)P (Y = y) py (x)PY (y)
P (y|x) = P (Y = y|X = x) = = = ,
p(x) p(x) p(x)
P
where p(x) = y p(x, y) = p(x, 0) + p(x, 1) = p0 (x)PY (0) + p1 (x)PY (1). Thus, the Bayes classifier can be
written as
(
0, if P (0|x) ≥ P (1|x)
c∗ (x) =
1, if P (1|x) > P (0|x)
( p0 (x)PY (0) p1 (x)PY (1)
0, if p(x) ≥ p(x)
= p1 (x)PY (1) p0 (x)PY (0)
1, if p(x) > p(x)
(
0, if p0 (x)PY (0) ≥ p1 (x)PY (1)
= .
1, if p1 (x)PY (1) > p0 (x)PY (0)
Thus, if we can estimate p0 (x), p1 (x), and PY (y), we can construct a classifier.
PY (y) is very easy to estimate. It is the probability of seeing an observation with label y. As a result, a
simple estimator is to use the ratio of observations with this label. Namely,
n
1X
PbY (y) = I(Yi = y).
n i=1

py (x) is just the conditional density of X given the label being y. Thus, we can simply apply a density
estimator to those observations with a class label y.
Example: kernel density estimator. Using a kernel density estimator (KDE), we obtain
n  
1 X Xi − x
pby,kde (x) = I(Yi = y)K ,
ny h i=1 h
Pn ny
where ny = i=1 I(Yi = y) is the number of observations with label being y. Note that PbY (y) = n . Thus,
a classifier based on a KDE is
(
0, if pb0,kde (x)PbY (0) ≥ pb1,kde (x)PbY (1)
cKDE (x) =
b
1, if pb1,kde (x)PbY (1) > pb0,kde (x)PbY (0)
( Pn  Pn
0, if i=1 I(Yi = 0)K Xih−x ≥ i=1 I(Yi = 1)K Xih−x

= Pn Pn  .
1, if i=1 I(Yi = 1)K Xih−x > i=1 I(Yi = 0)K Xih−x


The classifier b
cKDE (x) is also called the kernel classifier.
Example: density basis approach. We can use the basis approach as well. Assume that we consider M
basis and we use the cosine basis {φ1 (x), φ2 (x), · · · }. The estimator of py (x) will be the density estimator
using only those observations with a label y. Let
n
1 X
θby,` = I(Yi = y)φ` (Xi )
ny i=1

be the estimator of the `-th coefficient using only those observations with a label y. The corresponding
density estimator is
M M n n M
X X 1 X 1 X X
pby,M (x) = θby,` φ` (x) = I(Yi = y)φ` (Xi )φ` (x) = I(Yi = y) φ` (Xi )φ` (x).
ny i=1 ny i=1
`=1 `=1 `=1
Lecture 5: Classification 5-5

Thus, the corresponding classifier is


(
0, if pb0,M (x)PbY (0) ≥ pb1,M (x)PbY (1)
cM (x) =
b
1, if pb1,M (x)PbY (1) > pb0,M (x)PbY (0)
( Pn PM Pn PM
0, if i=1 I(Yi = 0) `=1 φ` (Xi )φ` (x) ≥ i=1 I(Yi = 1) `=1 φ` (Xi )φ` (x)
= Pn PM Pn PM .
1, if i=1 I(Yi = 1) `=1 φ` (Xi )φ` (x) > i=1 I(Yi = 0) `=1 φ` (Xi )φ` (x)

5.4 Confusion Matrix

Given a classifier and a set of labeled data, we can illustrate the quality of classification using a confusion
matrix. In binary classification, a confusion matrix is a 2 × 2 matrix (you can view it as a contingency table)
as follows:

Actual label: 0 Actual label: 1


Predicted label: 0 n00 n01
Predicted label: 1 n10 n11

nij is the number of instances/observations where the predicted label is i and actual label is j.
The quantity
n10 + n01
,
n00 + n01 + n10 + n11
is called the misclassification rate and is an empirical estimate of the risk of the classifier.
If the class label 0 stands for ‘normal case’ while the label 1 stands for ‘anomaly’, then we can interpret the
confusion matrix as

Actual label: 0 Actual label: 1


Predicted label: 0 True negative False negtaive
Predicted label: 1 False positive True positive

This interpretation is commonly used in engineering problem and medical research for detecting abnormal
situation.

5.5 k-NN Approach

The k-NN approach can be applied to classification as well. The idea is very simple – for a given point x0 ,
we find its k-th nearest data points. Then we compare the labels of these k points and assign the label of
x0 as the majority label in these k points.
Take the data in the following picture as an example. There are two classes: black dots and red crosses. We
are interested in the class label at the two blue boxes (x1 and x2 ).
Assume we use a 3-NN classifier b c3−N N . At point x1 , its 3-NN contains two black dots and one red cross,
so b
c3−N N (x1 ) = black dot. At point x2 , its 3-NN has one black dots and two red crosses, so b c3−N N (x1 ) =
red cross. Note that if there are ties, we will randomly assign the class label attaining the tie.
5-6 Lecture 5: Classification

x1

x2

The k-NN approach is simple and easy to operate. It can be easily generalize to multiple classes – the idea
is the same: we assign the class label according to the majority in the neighborhood.
How do we choose k? The choice of k is often done by a technique called cross-validation. The basic principle
is: we split the data into two parts, use one part to train our classifier and evaluate the performance on
the other part. Repeating the above procedure multiple times and applying it to each k, we can obtain an
estimate of the performance. We then choose the k with the best performance. We will talk about this topic
later.

5.6 Logistic Regression

The logistic regression is a regression model that is commonly applied to classification problems as well.
Like the method of converting a regression estimator to a classifier, the logistic regression use a regression
function as an intermediate step and then form a classifier. We first talk about some interesting examples.
Example. In graduate school admission, we are wondering how a student’s GPA affects the chance that
this applicant received the admission. In this case, each observations is a student and the response variable
Y represents whether the student received admission (Y = 1) or not (Y = 0). GPA is the covariate X.
Thus, we can model the probability

P (admitted|GPA = x) = P (Y = 1|X = x) = q(x).

Example. In medical research, people are often wondering if the heretability of the type-2 diabetes is related
to some mutation from of a gene. Researchers record if the subject has the type-2 diabetes (response) and
measure the mutation signature of genes (covariate X). Thus, the response variable Y = 1 if this subject
has the type-2 diabetes. A statistical model to associate the covariate X and the response Y is through

P (subject has type-2 diabetes|mutation signature = x) = P (Y = 1|X = x) = q(x).

Thus, the function q(x) now plays a key role in determining how the response Y and the covariate X are
associated. The logistic regression provides a simple and elegant way to characterize the function q(x) in a
‘linear’ way. Because q(x) represents a probability, it ranges within [0, 1] so naively using a linear regression
will not work. However, consider the following quantity:

q(x) P (Y = 1|X = x)
O(x) = = ∈ [0, ∞).
1 − q(x) P (Y = 0|X = x)
Lecture 5: Classification 5-7

The quantity O(x) is called the odds that measures the contrast between the event Y = 1 versus Y = 0.
When the odds is greater than 1, we have a higher change of getting Y = 1 than Y = 0. The odds
has an interesting asymmetric form– if P (Y = 1|X = x) = 2P (Y = 0|X = x), then O(x) = 2 but if
P (Y = 0|X = x) = 2P (Y = 1|X = x), then O(x) = 12 . To symmetrize the odds, a straight-forward
approach is to take (natural) logarithm of it:

q(x)
log O(x) = log .
1 − q(x)

This quantity is called log odds. The log odds has several beautiful properties, for instance when the two
probabilities are the same (P (Y = 1|X = x) = P (Y = 0|X = x)), log O(x) = 0, and

P (Y = 1|X = x) = 2P (Y = 0|X = x) ⇒ log O(x) = log 2


P (Y = 0|X = x) = 2P (Y = 1|X = x) ⇒ log O(x) = − log 2.

The logistic regression is to impose a linear model to the log odds. Namely, the logistic regression models

q(x)
log O(x) = log = β0 + β T x,
1 − q(x)

which leads to
T
eβ0 +β x
P (Y = 1|X = x) = q(x) = .
1 + eβ0 +β T x

Thus, the quantity q(x) = q(x; β0 , β) depends on the two parameter β0 , β. Here β0 behaves like the intercept
and β behaves like the slope vector (they are the intercept and slope in terms of the log odds).
When we observe data, how can we estimate these two parameters? In general, we will use the maximum
likelihood approach to estimate them. You can view the (minus) likelihood function as the loss function in
the classification (actually, we will use the log-likelihood function as the loss function). And the goal is to
find the parameter via minimizing such a loss.
Recall that we observe IID random sample:

(X1 , Y1 ), · · · , (Xn , Yn ).

Let pX (x) denotes the probability density of X; note that we will not use it in estimating β0 , β. For a given
pair Xi , Yi , recalled that the random variable Yi given Xi is just a Bernoulli random variable with parameter
q(x = Xi ). Thus, the PMF of Yi given Xi is

L(β0 , β|Xi , Yi ) = P (Y = Yi |Xi ) = q(Xi )Yi (1 − q(Xi ))1−Yi


T
!Yi  1−Yi
eβ0 +β Xi 1
=
1 + eβ0 +β T Xi 1 + eβ0 +β T Xi
T
eβ0 Yi +β Xi Yi
= .
1 + eβ0 +β T Xi

Note that here we construct the likelihood function using only the conditional PMF because similarly to the
linear regression, the distribution of the covariate X does not depends on the parameter β0 , β. Thus, the
5-8 Lecture 5: Classification

log-likelihood function is
n
X
`(β0 , β|X1 , Y1 , · · · , Xn , Yn ) = log L(β0 , β T |Xi , Yi )
i=1
n T
!
X eβ0 Yi +β Xi Yi
= log
i=1
1 + eβ0 +β T Xi
n  
X T
= β0 Yi + β T Xi Yi − log 1 + eβ0 +β Xi .
i=1

Our estimates are

βb0 , βb = argmax `(β0 , β|X1 , Y1 , · · · , Xn , Yn )


β0 ,β

= argmin − `(β0 , β|X1 , Y1 , · · · , Xn , Yn )


β0 ,β
n
1X
= argmin −`(β0 , β|Xi , Yi ) ,
β0 ,β n i=1 | {z }
loss function
| {z }
empirical estimate of the loss function

where the loss function is


 T

−`(β0 , β|Xi , Yi ) = β0 Yi + β T Xi Yi − log 1 + eβ0 +β Xi .

βb0 , βb does not have a closed-form solution in general so we cannot write down a simple expression of the
estimator. Despite this disadvantage, such a log-likelihood function can be optimized by a gradient ascent
approach such as the Newton-Raphson1 .

5.7 Decision Tree

Decision tree is another common approach in classification. A decision tree is like a regression tree – we
partition the space of covariate into rectangular regions and then assign each region a class label. If the tree
is given, the class label at each region is determined by the majority vote (using the majority of labels in
that region).
The following picture provides an example of a decision tree using the same data as the one in the previous
section.
In the left panel, we display the scatterplot and regions separated by a decision tree. The background color
denotes the estimated label. The right panel displays the tree structure.
In the case of binary classification (the class label Y = 0 or 1), the decision tree can be written as follows.
Let (X1 , Y1 ), · · · , (Xn , Yn ) denotes the data and R1 , · · · , Rk is a rectangular partition of the space of the
covariates. Let R(x) denotes the rectangular regions where x falls within. The decision tree is
 Pn 
i=1 Yi I(Xi ∈ R(x)) 1
cDT (x) = I P n > .
i=1 I(Xi ∈ R(x))
b
2
Here, you can see that the decision tree is essentially a classifier converted from a regression tree.
1 some references can be found: https://www.cs.princeton.edu/~bee/courses/lec/lec_jan24.pdf
Lecture 5: Classification 5-9

x2
8
x2
<5 ≥5
6
x1 x1
< 5.4 ≥ 5.4 <5 ≥5

< 6.6
x2 ≥ 6.6
4

4 6 8 x1

5.8 Random Forest

A modification of the decision tree is called the Random Forests, which is a bagging approach. The idea
of bagging is as follows. Suppose that we have one method of constructing a classifier. Given the original
data set, we perform the bootstrap–sampling with replacement from the original data– to generate several
bootstrap samples. Suppose that we have generated B bootstrap samples. If we train the classifier on each
of the B bootstrap sample, we obtain a classifier. Thus, by training the classifier independently over each
bootstrap sample, we obtain b c1 , · · · , b
cB , B classifiers. We then combine these B classifiers to form our final
classifier using the majority vote, i.e,,
( PB
1, if B1 b=1 b
cb (x) ≥ 1
2
c(x) =
e .
0, otherwise

The formal name of bagging is bootstrap aggregation.


The random forest is the bagging decision tree with one additional modification– when training each classifier,
instead of using all features, we only use a randomly selected subset of features. Informally, you can say

Random Forest = Decision Tree + Bagging + Random features.

There will be three tuning parameters in training a random forest: the number of leaves in the decision tree
k, the bootstrap sample number B, and the number of randomly selected features d0 . Note that sometimes
instead of using the bootstrap to generate a bootstrap sample, people would use subsampling method that
randomly select only a fraction of the original sample in the bagging stage (each bagging sample has different
subsets of the original data). If we use subsampling method, there will be one additional tuning parameter–
the size of subsample.
The random forest can be applied to not only classification problem but also the regression problem. In the
regression problem, the bagging stage will average the regression function from each bootstrap sample and
the averaged value will be the final regression estimator. Namely,

B
1 X
m(x)
e = m
b b (x),
B
b=1

b 1 (x), · · · , m
where m b B (x) are the regression tree estimator of each bootstrap sample.
5-10 Lecture 5: Classification

5.8.1 Some theory about random forest

A useful tutorial summarizing recent advances in the theoretical behaviors of random forest can be found in

Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197-227.

One useful theoretical result on random forest is the following theorem:

Theorem 5.2 (Biau 2012) Suppose the regression function m is Lipschitz and only depends on a subset
1
of features S ⊂ {1, · · · , d} and the probability of selecting a feature j ∈ S is kSk (1 + o(1)), where kSk is the
cardinality of S. Then when the number of leaves kn  n4S log 2/(4S log 2+3) ,
3
  4kSk log
2 1 2+3
E(|m(X)
e − m(X)| ) = O .
n

This is from

Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research,
13(Apr), 1063-1095.

When using a subsampling method in training the random forest, here is an improved theoretical result that
also regularizes the tuning parameters under the additive model.

Theorem 5.3 (Scornet, Biau and Vert (2015)) Suppose that Y = j mk (Xj )+ where X ∼ Uni[0, 1]d ,  ∼
P

N (0, σ 2 ) and each mj is continuous. Assume that the split is chosen using the maximum drop in sums of
squares. Let tn be the number of leaves on each tree and an be the size of each subsample. If tn , an → ∞
9
and tn logan
an
→ 0 then
E(|m(X)
e − m(X)|2 ) → 0.

The above result is from

Scornet, E., Biau, G., & Vert, J. P. (2015). Consistency of random forests. The Annals of
Statistics, 43(4), 1716-1741.

Random forest is also related to the kNN approach; see the following paper

Lin, Y. and Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of the
American Statistical Association, 101, p 578.

It is possible to use the random forest to perform statistical inference(e.g., hypothesis test and/or confidence
intervals). See the following two papers:

Mentch, L., & Hooker, G. (2014). Ensemble trees and clts: Statistical inference for supervised
learning. stat, 1050, 25.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using
random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
Lecture 5: Classification 5-11

5.8.2 Kernel based random forest

Recently, there is a new approach that combines random forest and kernel regression. For a point x, let
Aj (x) be the cell such that x falls in j-th regression tree. The b-th regression tree estimator can be written
as Pn
Yi I(Xi ∈ Ab (x))
mb b (x) = Pi=1
n .
i=1 I(Xi ∈ Ab (x))
Thus, the random forest regression function is
B
1 X
m(x)
e = m
b b (x)
B
b=1
B n
1 XX I(Xi ∈ Ab (x))
= Pn Yi
B i=1
b=1`=1 I(X` ∈ Ab (x))
n B
1 X X I(Xi ∈ Ab (x))
= Yi
B i=1 Nb (x)
b=1
B
n X
1 X
= Wbi Yi
B i=1 b=1
n
X
= Wi Yi
i=1

Pn
where and Nb (x) = i=1 I(Xi ∈ Ab (x)) is the total number observations of b-th sample in the cell Ab (x)
i ∈Ab (x))
PB
and Wbi = I(XN b (x)
, Wi = b=1 Wbi /B are both weights.

Notice that the weight Wbi ∝ I(Xi ∈ Ab (x)). This motivates us to construct a kernel function that is
PB
proportional to b=1 I(Xi ∈ Ab (x)), which leads to
Pn
Yi K(x, Xi )
b KeRF (x) = Pi=1
m n ,
`=1 K(x, X` )

where
B
1 X
K(x, y) = I(y ∈ Ab (x)).
B
b=1
This method is called kernel based random forest (KeRF). Note that in KeRF, the kernel function is a
random function. Its randomness can be described by the IID random regions
A1 , · · · , AB ∼ G,
where G is a distribution of random regions. Although it looks complicated, the randomness of each Ab is
determined by 1. the bootstrap sample and 2. the randomly chosen features. So G is not so difficult to
analyze.
Warning! The KeRF is different from the random forest although the construction of KeRF is motivated
by random forest.
More details about KeRF can be found in the following paper:

Scornet E. Random forests and kernel methods. (2016). IEEE Transactions on Information
Theory. 62(3):1485-500.
5-12 Lecture 5: Classification

5.9 Training Classifiers as an Optimization Problem

Even without any probabilistic model, classification problem can be viewed as an optimization problem. The
key element is to replace the risk function E(L(c(X), Y ) by the empirical risk (training error)
n
bn (c) = 1
X
R L(c(Xi ), Yi ).
n i=1

Consider a collection of classifiers C. Then the goal is to find the best classifier c∗ ∈ C such that the the
empirical risk R
bn (c) is minimized. Namely,

c∗ = argmin R
bn (c).
c∈C

Ideally, this should work because


n
!
1X
E(R
bn (c)) = E L(c(Xi ), Yi ) = E(L(c(X1 ), Y1 )) = R(c) (5.4)
n i=1

is the actual risk function. Namely, R


bn (c) is an unbiased estimator of the risk function.

Here are some concrete examples.


Linear classifier. Assume that we consider a linear classifier:

cβ0 ,β (x) = I(β0 + β T x > 0),

where β0 is a number like the intercept and β is the vector like the slope of each covariate/feature. Namely,
how we assign the class label purely depends on the value of β0 + β T x. If this value is positive, then we give
it a label 1. If the value is negative, then we give it a label 0. Then the set

Clin = {cβ0 ,β : β0 ∈ R, β ∈ Rd }

is a collection of all linear classifier. The idea of empirical risk minimization is to find the classifier in Clin
such that the empirical risk R bn (·) is minimized. Because every classifier is indexed by the two quantities
(parameters) β0 , β, finding the one that minimizes the empirical risk is equivalent to finding the best β0 , β
minimizing R bn (·).

Logistic regression. Similar to the linear classifier, the logistic regression can be viewed as a classifier of
the form !
T
eβ0 +β x 1
cβ0 ,β (x) = I > .
1 + eβ0 +β T x
e
2
Then we can define the set
cβ0 ,β : β0 ∈ R, β ∈ Rd }
Clogistic = {e
as the collection of all classifiers from a logistic regression model. The MLE approach becomes an empirical
risk minimization method using a particular loss function – the log-likelihood function.
Decision tree (fixed k, fixed leaves). The decision tree classifier b cDT can be viewed as a classifier from
the empirical risk minimization as well. For simplicity, we assume that the regions/leaves of the decision
trees R1 , · · · , Rk are fixed. Then any decision tree classifier can be written as
k
X
cDT (x) = αj I(x ∈ Rj ),
j=1
Lecture 5: Classification 5-13

where α1 , · · · , αk are quantities/parameters that determine how we will predict the class label of region
R1 , · · · , Rk , respectively. And the collection of all possible classifiers will be

CDT = {cDT (x) : αj ∈ {0, 1}, j = 1, · · · , k}.

Note that there are only 2k classifiers in the set CDT . The estimator b
cDT is just the one that minimizes the
bn (·) with a 0 − 1 loss.
empirical loss R
Decision tree (fixed k, non-fixed leaves). When the regions/leaves are not fixed, the collection of all
possible decision tree classifiers is much more complex. Here is an abstract way of describing such a collection.
Recall that at each split of the tree, we pick one region and one feature and a threshold (e.g., x1 > 10 versus
x1 ≤ 10), so every split can be represented by three indices: the region we are splitting, the feature index
(which feature is this split occurs), and the threshold index (what is the split level). Because at the `-th
split, there is only ` regions, the `-th split is characterized by a triplet (ω, m, λ), where ω = {1, 2, · · · , `} and
m ∈ {1, 2, · · · , d} and λ ∈ R. If a tree has k leaves, there will be k − 1 splits. So any decision tree classifier
is indexed by
α1 , · · · , αk , (ω1 , m1 , λ1 ), · · · , (ωk−1 , mk−1 , λk−1 ).
Namely, given a set of these values, we can construct a unique decision tree classifier. Thus, the collection
of all decision tree with k leaves can be written as

CDT (k) = {cDT (x) = cα,ω,m,λ (x) : αj ∈ {0, 1},


(ω` , m` , λ` ) ∈ {1, · · · , `} × {1, 2, · · · , d} × R,
j = 1, · · · , k, ` = 1, · · · , k − 1}.

In reality, when we train a decision tree classifier with a fixed number k, the regions are also computed from
the data. Thus, we are actually finding b cDT such that

cDT = argmin R
b bn (c).
c∈CDT (k)

Decision tree (both k andSleaves are non-fixed). If we train the classifier with k being un-specified,
then we are finding b cDT from k∈N CDT (k) that minimizes the empirical risk. However, if we really consider
all possible k, such an optimal decision tree is not unique and may be problematic. When k > n, we can make
each leave contains at most and the predicted label is just the label of that observation. Such a classifier
has 0 empirical risk but it may have very poor performance in future prediction because we are overfitting
the data. Overfitting the data implies that R bn (c) and R(c) are very different, even if R
bn (c) is an unbiased
estimator of R(c)!

Why will this happen? Rbn (c) is just the sample-average version of R(c), right? Is this contradict to the law
of large number that Rn (c) converges to R(c)?
b

It is true that R bn (c) is an unbiased estimator of R(c) and yes indeed the law of large number is applicable
in this case. BUT a key requirement for using the law of large number is that we assume c is fixed. Namely,
if the classifier c is fixed, then the law of large number guarantees that the empirical risk R
bn (c) converges
to the true risk function R(c).
However, when we are finding the best classifier, we are consider many many many possible classifiers c.
Although for a given classifier c the law of large number works, it may not work when we consider many
classifiers. The empirical risk minimization works if
P
bn (c) − R(c) →
sup R 0
c∈C
5-14 Lecture 5: Classification

Namely, the convergence is uniform for all classifiers in the collection that we are considering. In the next
few lectures we will be talking about how the above uniform convergence may be established.
Remark.

• Regression problem. The empirical risk minimization method can also be applied to regression
problem. We just replace the classifier by the regression function and the loss function can be chosen
as the L2 loss (squared distance). In this formulation, we obtain

R(m) = E kY − m(X)k2


and
n
bn (m) = 1
X
R kYi − m(Xi )k2 .
n i=1
Estimating a regression function can thus be written as an empirical risk minimization – we minimizes
Rbn (m) for m ∈ M to obtain our regression estimator. The set M is a collection of many regression
functions. Similar to the classification problem, we need a uniform convergence of the empirical risk
to make sure we have a good regression estimator.
• Penalty function. Another approach to handle the difference supc∈C |R bn (c) − R(c)| is to add an
extra quantity to Rn (c) so that such a uniform difference is somewhat being controlled. Instead of
b
minimizing Rbn (c), we minimizes
Rbn (c) + Pλ (c),

where Pλ is a penalty function. This is just the penalized regression approach but now applying it to
a classification problem. When the penalty function is chosen in a good way, we can make sure the
optimal classifier/regression estimator from the (penalized) empirical risk minimization indeed has a
very small risk.

5.10 Other approaches

There are many other classification methods that are not covered in our lecture but I will highly recommend
you to learn it. Here are some famous methods/keywords: boosting (adaBoost), neural nets (deep learning),
ensemble learning.

You might also like