0% found this document useful (0 votes)

52 views14 pages

Lec5 Class

This document summarizes a lecture on classification in statistical machine learning. It introduces classification problems and provides examples like email spam detection and image classification. It formally defines classification as predicting a class label y for a feature vector x. The goal is to minimize classification error. Common approaches to classification include modeling the data probabilistically and finding the classifier that maximizes the probability P(y|x). Regression can also be used to estimate the probabilities P(y|x) and derive a classifier from the regression estimates.

Uploaded by

araymundom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views14 pages

Lec5 Class

Uploaded by

araymundom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

STAT 535: Statistical Machine Learning Winter 2019

Lecture 5: Classification
Instructor: Yen-Chi Chen

5.1 Introduction

Classification is one of the most important data analysis problems. Much early work on this topic was done
by statisticians but in the past 20 years, computer science and machine learning communities have made
much much more progress on this topic.
Here are some classical applications of classification.

• Email spam. Email service provider such as Google often faced the problem of classifying a new
email. The problem they want to address is like: given an email, how do you decide if it is a spam, an
ordinary email, or an important one?
• Image classification. If you have used Facebook, you may notice that whenever your photo contains
a picture of one of your friends, Facebook may ask you if you want to tag your friend, even if you did
not manually tell the computer that this is your friend. How do they know that picture is a human
and that guy is your friend?

We consider a simple scenario – binary classification. Namely, there are only two possible classes that we
will consider. We will just denote the two classes as 0 and 1.
The classification problem can be formalized as follows. Given a feature vector x0 , we want to create a
classifier c that maps x0 into 0 or 1. Namely, we want to find a function c(x0 ) that outputs only two
possible number: 0 and 1. Moreover, we want to make sure that our classification error is small. Let y0 be
the actual label of x0 . We measure the classification error using a loss function L such that the the loss
of making a prediction c(x0 ) when the actual class label is y0 is L(c(x0 ), y0 ). A common loss function is the
0-1 loss, which is L(c(x0 ), y0 ) = I(c(x0 ) 6= y0 ). Namely, when we make a wrong classification, we loss 1
point and we do not lose anything if we make the correct classification.
How do we find the classifier c? A good news is: often we have a labeled sample (data) (X1 , Y1 ), · · · , (Xn , Yn )
available. Then we will find c using this dataset.
In statistics, we often model the data as an IID random sample from a distribution. We now define several
useful distribution functions:
p0 (x) = p(X = x|Y = 0) : the density of X when the actual label is 0,
p1 (x) = p(X = x|Y = 1) : the density of X when the actual label is 1,
(5.1)
P (y|x) = P (Y = y|X = x) : the probability of being in the class y when the feature is x,
PY (y) = P (Y = y) : the probability of observing the class y, regardless of the feature value.

Using a probability model, we will define the risk function, which is the expected value of the loss function
when the input is random. The risk of a classifier c is

R(c) = E(L(c(X), Y )).

5-1
5-2 Lecture 5: Classification

Ideally, we want to find a classifier that minimizes the risk because such a classifier will minimize our expected
losses.
Assume that we know the 4 quantities in equation (5.1), what class label will you predict when seeing a
feature X = x? An intuitive choice is that we should predict the value y that maximizer P (y|x). Namely,
we predict the label using the one with highest probability. Such classifier can be written as
(
0, if P (0|x) ≥ P (1|x),
c∗ (x) = argmaxy=0,1 P (y|x) = (5.2)
1, if P (1|x) > P (0|x).

Is this classifier good in the sense of the classification error (risk)? The answer depends on the loss function.
A good news is: this classifier is the optimal classifier for the 0 − 1 loss. Namely,

R(c∗ ) = min R(c)

when using a 0 − 1 loss. However, if we are using other loss function, this classifier will not be the best one
(with the smallest expected loss).
Derivation of c∗ is optimal under 0 − 1 loss. Given a classifier c, the risk function R(c) = E(L(c(X), Y )).
Using the property of expectation, we can further write it as

R(c) = E(L(c(X), Y )) = E(E(L(c(X), Y )|X)).

| {z }
(A)

For the quantity (A), we have

E(L(c(X), Y )|X) = L(c(X), 1)p(Y = 1|X) + L(c(X), 0)p(Y = 0|X)

= I(c(X) 6= 1)p(Y = 1|X) + I(c(X) 6= 0)p(Y = 0|X)
(
p(Y = 1|X) if c(X) = 0
=
p(Y = 0|X) if c(X) = 1.

Thus, seeing a feature X, the expected loss we have when predicting c(X) = 0 is P (Y = 1|X) whereas when
prediction c(X) = 1 is P (Y = 0|X). The optimal choice is predicting c(X) = 0 if P (Y = 1|X) ≤ P (Y = 0|X)
and c(X) = 1 if P (Y = 1|X) > P (Y = 0|X) (the equality does not matter), which is the classifier c∗ .
When a classifier attains the optimal risk (i.e., having a risk of minc R(c)), it is called a Bayes classifier.
Thus, the classifier c∗ is the Bayes classifier in 0 − 1 loss.
For a classifier c, we define its excess risk (regret) as

E(c) = R(c) − min R(c).

The excess risk is a quantity that measures how the quality of c is away from the optimal/Bayes classifier.
If we cannot find the Bayes classifier, we will at least try to find a classifier whose excess risk is small.

5.2 Regression Approach

If we know the P (y|x), we can build the Bayes classifier and this classifier is the optimal one in terms of
the risk function. However, P (y|x) is a population quantity, which is often unknown to us. All we have is
a random sample (X1 , Y1 ), · · · , (Xn , Yn ). So the question becomes: how do we estimate P (y|x) using the
data?
Lecture 5: Classification 5-3

It turns out that this is a problem we know have to solve. Here is just one hint: because the response variable
Y only takes two possible values {0, 1}, it is actually a Bernoulli random variable! Thus, E(Y ) = P (Y = 1),
which implies
E(Y |X = x) = P (Y = 1|X = x) = P (1|x). (5.3)
Namely, P (1|x) is the regression function! Using the fact that P (0|x) + P (1|x) = 1, an estimator of P (1|x)
leads to an estimator of P (y|x) for both y = 0 and y = 1.
Thus, as long as we have a regression estimator, we can convert it into a classifier. Here is one example of
using kernel regression. Let
Pn
Yi K Xih−x

b K (x) = Pi=1
m n Xi −x

i=1 K h

be the kernel regression. Then

PbK (1|x) = m
b K (x), PbK (0|x) = 1 − m
b K (x).

Thus, a classifier based on kernel regression is

Namely, the classifier will output 1 whenever the estimated regression function is greater than half and 0
otherwise.
Will this classifier be a good one? Intuitively, it should be true. If we have a good regression estimator,
the corresponding classifier should also be good. In fact, we have the following powerful result linking the
quality of a regression estimator and the excess risk.

Theorem 5.1 Assume we use the 0 − 1 loss. Let m

b be a regression estimator and b
cm be the corresponding
classifier. Then sZ
Z
E(b
cm ) ≤ 2 |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x).

Namely, if we have a regression estimator whose overall quality is good, the corresponding classifier will also
have a small excess risk (i.e., perform comparably well compared to the optimal classifier).

5.3 Density Estimation Approach (Naive Bayes)

In addition to using a regression function to construct a classifier, we can use a density estimator for
classification. This approach is often known as the naive Bayes approach.
5-4 Lecture 5: Classification

A key insight is from the Bayes rule:

p(x, y) p(x|y)P (Y = y) py (x)PY (y)
P (y|x) = P (Y = y|X = x) = = = ,
p(x) p(x) p(x)
P
where p(x) = y p(x, y) = p(x, 0) + p(x, 1) = p0 (x)PY (0) + p1 (x)PY (1). Thus, the Bayes classifier can be
written as
(
0, if P (0|x) ≥ P (1|x)
c∗ (x) =
1, if P (1|x) > P (0|x)
( p0 (x)PY (0) p1 (x)PY (1)
0, if p(x) ≥ p(x)
= p1 (x)PY (1) p0 (x)PY (0)
1, if p(x) > p(x)
(
0, if p0 (x)PY (0) ≥ p1 (x)PY (1)
= .
1, if p1 (x)PY (1) > p0 (x)PY (0)
Thus, if we can estimate p0 (x), p1 (x), and PY (y), we can construct a classifier.
PY (y) is very easy to estimate. It is the probability of seeing an observation with label y. As a result, a
simple estimator is to use the ratio of observations with this label. Namely,
n
1X
PbY (y) = I(Yi = y).
n i=1

py (x) is just the conditional density of X given the label being y. Thus, we can simply apply a density
estimator to those observations with a class label y.
Example: kernel density estimator. Using a kernel density estimator (KDE), we obtain
n
1 X Xi − x
pby,kde (x) = I(Yi = y)K ,
ny h i=1 h
Pn ny
where ny = i=1 I(Yi = y) is the number of observations with label being y. Note that PbY (y) = n . Thus,
a classifier based on a KDE is
(
0, if pb0,kde (x)PbY (0) ≥ pb1,kde (x)PbY (1)
cKDE (x) =
b
1, if pb1,kde (x)PbY (1) > pb0,kde (x)PbY (0)
( Pn Pn
0, if i=1 I(Yi = 0)K Xih−x ≥ i=1 I(Yi = 1)K Xih−x

= Pn Pn .
1, if i=1 I(Yi = 1)K Xih−x > i=1 I(Yi = 0)K Xih−x

The classifier b
cKDE (x) is also called the kernel classifier.
Example: density basis approach. We can use the basis approach as well. Assume that we consider M
basis and we use the cosine basis {φ1 (x), φ2 (x), · · · }. The estimator of py (x) will be the density estimator
using only those observations with a label y. Let
n
1 X
θby,` = I(Yi = y)φ` (Xi )
ny i=1

be the estimator of the `-th coefficient using only those observations with a label y. The corresponding
density estimator is
M M n n M
X X 1 X 1 X X
pby,M (x) = θby,` φ` (x) = I(Yi = y)φ` (Xi )φ` (x) = I(Yi = y) φ` (Xi )φ` (x).
ny i=1 ny i=1
`=1 `=1 `=1
Lecture 5: Classification 5-5

Thus, the corresponding classifier is

(
0, if pb0,M (x)PbY (0) ≥ pb1,M (x)PbY (1)
cM (x) =
b
1, if pb1,M (x)PbY (1) > pb0,M (x)PbY (0)
( Pn PM Pn PM
0, if i=1 I(Yi = 0) `=1 φ` (Xi )φ` (x) ≥ i=1 I(Yi = 1) `=1 φ` (Xi )φ` (x)
= Pn PM Pn PM .
1, if i=1 I(Yi = 1) `=1 φ` (Xi )φ` (x) > i=1 I(Yi = 0) `=1 φ` (Xi )φ` (x)

5.4 Confusion Matrix

Given a classifier and a set of labeled data, we can illustrate the quality of classification using a confusion
matrix. In binary classification, a confusion matrix is a 2 × 2 matrix (you can view it as a contingency table)
as follows:

Actual label: 0 Actual label: 1

Predicted label: 0 n00 n01
Predicted label: 1 n10 n11

nij is the number of instances/observations where the predicted label is i and actual label is j.
The quantity
n10 + n01
,
n00 + n01 + n10 + n11
is called the misclassification rate and is an empirical estimate of the risk of the classifier.
If the class label 0 stands for ‘normal case’ while the label 1 stands for ‘anomaly’, then we can interpret the
confusion matrix as

Actual label: 0 Actual label: 1

Predicted label: 0 True negative False negtaive
Predicted label: 1 False positive True positive

This interpretation is commonly used in engineering problem and medical research for detecting abnormal
situation.

5.5 k-NN Approach

The k-NN approach can be applied to classification as well. The idea is very simple – for a given point x0 ,
we find its k-th nearest data points. Then we compare the labels of these k points and assign the label of
x0 as the majority label in these k points.
Take the data in the following picture as an example. There are two classes: black dots and red crosses. We
are interested in the class label at the two blue boxes (x1 and x2 ).
Assume we use a 3-NN classifier b c3−N N . At point x1 , its 3-NN contains two black dots and one red cross,
so b
c3−N N (x1 ) = black dot. At point x2 , its 3-NN has one black dots and two red crosses, so b c3−N N (x1 ) =
red cross. Note that if there are ties, we will randomly assign the class label attaining the tie.
5-6 Lecture 5: Classification

The k-NN approach is simple and easy to operate. It can be easily generalize to multiple classes – the idea
is the same: we assign the class label according to the majority in the neighborhood.
How do we choose k? The choice of k is often done by a technique called cross-validation. The basic principle
is: we split the data into two parts, use one part to train our classifier and evaluate the performance on
the other part. Repeating the above procedure multiple times and applying it to each k, we can obtain an
estimate of the performance. We then choose the k with the best performance. We will talk about this topic
later.

5.6 Logistic Regression

The logistic regression is a regression model that is commonly applied to classification problems as well.
Like the method of converting a regression estimator to a classifier, the logistic regression use a regression
function as an intermediate step and then form a classifier. We first talk about some interesting examples.
Example. In graduate school admission, we are wondering how a student’s GPA affects the chance that
this applicant received the admission. In this case, each observations is a student and the response variable
Y represents whether the student received admission (Y = 1) or not (Y = 0). GPA is the covariate X.
Thus, we can model the probability

P (admitted|GPA = x) = P (Y = 1|X = x) = q(x).

Example. In medical research, people are often wondering if the heretability of the type-2 diabetes is related
to some mutation from of a gene. Researchers record if the subject has the type-2 diabetes (response) and
measure the mutation signature of genes (covariate X). Thus, the response variable Y = 1 if this subject
has the type-2 diabetes. A statistical model to associate the covariate X and the response Y is through

P (subject has type-2 diabetes|mutation signature = x) = P (Y = 1|X = x) = q(x).

Thus, the function q(x) now plays a key role in determining how the response Y and the covariate X are
associated. The logistic regression provides a simple and elegant way to characterize the function q(x) in a
‘linear’ way. Because q(x) represents a probability, it ranges within [0, 1] so naively using a linear regression
will not work. However, consider the following quantity:

q(x) P (Y = 1|X = x)
O(x) = = ∈ [0, ∞).
1 − q(x) P (Y = 0|X = x)
Lecture 5: Classification 5-7

The quantity O(x) is called the odds that measures the contrast between the event Y = 1 versus Y = 0.
When the odds is greater than 1, we have a higher change of getting Y = 1 than Y = 0. The odds
has an interesting asymmetric form– if P (Y = 1|X = x) = 2P (Y = 0|X = x), then O(x) = 2 but if
P (Y = 0|X = x) = 2P (Y = 1|X = x), then O(x) = 12 . To symmetrize the odds, a straight-forward
approach is to take (natural) logarithm of it:

q(x)
log O(x) = log .
1 − q(x)

This quantity is called log odds. The log odds has several beautiful properties, for instance when the two
probabilities are the same (P (Y = 1|X = x) = P (Y = 0|X = x)), log O(x) = 0, and

P (Y = 1|X = x) = 2P (Y = 0|X = x) ⇒ log O(x) = log 2

P (Y = 0|X = x) = 2P (Y = 1|X = x) ⇒ log O(x) = − log 2.

The logistic regression is to impose a linear model to the log odds. Namely, the logistic regression models

q(x)
log O(x) = log = β0 + β T x,
1 − q(x)

which leads to
T
eβ0 +β x
P (Y = 1|X = x) = q(x) = .
1 + eβ0 +β T x

Thus, the quantity q(x) = q(x; β0 , β) depends on the two parameter β0 , β. Here β0 behaves like the intercept
and β behaves like the slope vector (they are the intercept and slope in terms of the log odds).
When we observe data, how can we estimate these two parameters? In general, we will use the maximum
likelihood approach to estimate them. You can view the (minus) likelihood function as the loss function in
the classification (actually, we will use the log-likelihood function as the loss function). And the goal is to
find the parameter via minimizing such a loss.
Recall that we observe IID random sample:

(X1 , Y1 ), · · · , (Xn , Yn ).

Let pX (x) denotes the probability density of X; note that we will not use it in estimating β0 , β. For a given
pair Xi , Yi , recalled that the random variable Yi given Xi is just a Bernoulli random variable with parameter
q(x = Xi ). Thus, the PMF of Yi given Xi is

L(β0 , β|Xi , Yi ) = P (Y = Yi |Xi ) = q(Xi )Yi (1 − q(Xi ))1−Yi

T
!Yi 1−Yi
eβ0 +β Xi 1
=
1 + eβ0 +β T Xi 1 + eβ0 +β T Xi
T
eβ0 Yi +β Xi Yi
= .
1 + eβ0 +β T Xi

Note that here we construct the likelihood function using only the conditional PMF because similarly to the
linear regression, the distribution of the covariate X does not depends on the parameter β0 , β. Thus, the
5-8 Lecture 5: Classification

log-likelihood function is
n
X
`(β0 , β|X1 , Y1 , · · · , Xn , Yn ) = log L(β0 , β T |Xi , Yi )
i=1
n T
!
X eβ0 Yi +β Xi Yi
= log
i=1
1 + eβ0 +β T Xi
n
X T
= β0 Yi + β T Xi Yi − log 1 + eβ0 +β Xi .
i=1

Our estimates are

βb0 , βb = argmax `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

β0 ,β

= argmin − `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

β0 ,β
n
1X
= argmin −`(β0 , β|Xi , Yi ) ,
β0 ,β n i=1 | {z }
loss function
| {z }
empirical estimate of the loss function

where the loss function is

T

−`(β0 , β|Xi , Yi ) = β0 Yi + β T Xi Yi − log 1 + eβ0 +β Xi .

βb0 , βb does not have a closed-form solution in general so we cannot write down a simple expression of the
estimator. Despite this disadvantage, such a log-likelihood function can be optimized by a gradient ascent
approach such as the Newton-Raphson1 .

5.7 Decision Tree

Decision tree is another common approach in classification. A decision tree is like a regression tree – we
partition the space of covariate into rectangular regions and then assign each region a class label. If the tree
is given, the class label at each region is determined by the majority vote (using the majority of labels in
that region).
The following picture provides an example of a decision tree using the same data as the one in the previous
section.
In the left panel, we display the scatterplot and regions separated by a decision tree. The background color
denotes the estimated label. The right panel displays the tree structure.
In the case of binary classification (the class label Y = 0 or 1), the decision tree can be written as follows.
Let (X1 , Y1 ), · · · , (Xn , Yn ) denotes the data and R1 , · · · , Rk is a rectangular partition of the space of the
covariates. Let R(x) denotes the rectangular regions where x falls within. The decision tree is
Pn
i=1 Yi I(Xi ∈ R(x)) 1
cDT (x) = I P n > .
i=1 I(Xi ∈ R(x))
b
2
Here, you can see that the decision tree is essentially a classifier converted from a regression tree.
1 some references can be found: https://www.cs.princeton.edu/~bee/courses/lec/lec_jan24.pdf
Lecture 5: Classification 5-9

x2
8
x2
<5 ≥5
6
x1 x1
< 5.4 ≥ 5.4 <5 ≥5

< 6.6
x2 ≥ 6.6
4

4 6 8 x1

5.8 Random Forest

A modification of the decision tree is called the Random Forests, which is a bagging approach. The idea
of bagging is as follows. Suppose that we have one method of constructing a classifier. Given the original
data set, we perform the bootstrap–sampling with replacement from the original data– to generate several
bootstrap samples. Suppose that we have generated B bootstrap samples. If we train the classifier on each
of the B bootstrap sample, we obtain a classifier. Thus, by training the classifier independently over each
bootstrap sample, we obtain b c1 , · · · , b
cB , B classifiers. We then combine these B classifiers to form our final
classifier using the majority vote, i.e,,
( PB
1, if B1 b=1 b
cb (x) ≥ 1
2
c(x) =
e .
0, otherwise

The formal name of bagging is bootstrap aggregation.

The random forest is the bagging decision tree with one additional modification– when training each classifier,
instead of using all features, we only use a randomly selected subset of features. Informally, you can say

Random Forest = Decision Tree + Bagging + Random features.

There will be three tuning parameters in training a random forest: the number of leaves in the decision tree
k, the bootstrap sample number B, and the number of randomly selected features d0 . Note that sometimes
instead of using the bootstrap to generate a bootstrap sample, people would use subsampling method that
randomly select only a fraction of the original sample in the bagging stage (each bagging sample has different
subsets of the original data). If we use subsampling method, there will be one additional tuning parameter–
the size of subsample.
The random forest can be applied to not only classification problem but also the regression problem. In the
regression problem, the bagging stage will average the regression function from each bootstrap sample and
the averaged value will be the final regression estimator. Namely,

B
1 X
m(x)
e = m
b b (x),
B
b=1

b 1 (x), · · · , m
where m b B (x) are the regression tree estimator of each bootstrap sample.
5-10 Lecture 5: Classification

5.8.1 Some theory about random forest

A useful tutorial summarizing recent advances in the theoretical behaviors of random forest can be found in

Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197-227.

One useful theoretical result on random forest is the following theorem:

Theorem 5.2 (Biau 2012) Suppose the regression function m is Lipschitz and only depends on a subset
1
of features S ⊂ {1, · · · , d} and the probability of selecting a feature j ∈ S is kSk (1 + o(1)), where kSk is the
cardinality of S. Then when the number of leaves kn n4S log 2/(4S log 2+3) ,
3
4kSk log
2 1 2+3
E(|m(X)
e − m(X)| ) = O .
n

This is from

Biau, G. (2012). Analysis of a random forests model. Journal of Machine Learning Research,
13(Apr), 1063-1095.

When using a subsampling method in training the random forest, here is an improved theoretical result that
also regularizes the tuning parameters under the additive model.

Theorem 5.3 (Scornet, Biau and Vert (2015)) Suppose that Y = j mk (Xj )+ where X ∼ Uni[0, 1]d , ∼
P

N (0, σ 2 ) and each mj is continuous. Assume that the split is chosen using the maximum drop in sums of
squares. Let tn be the number of leaves on each tree and an be the size of each subsample. If tn , an → ∞
9
and tn logan
an
→ 0 then
E(|m(X)
e − m(X)|2 ) → 0.

The above result is from

Scornet, E., Biau, G., & Vert, J. P. (2015). Consistency of random forests. The Annals of
Statistics, 43(4), 1716-1741.

Random forest is also related to the kNN approach; see the following paper

Lin, Y. and Jeon, Y. (2006). Random Forests and Adaptive Nearest Neighbors. Journal of the
American Statistical Association, 101, p 578.

It is possible to use the random forest to perform statistical inference(e.g., hypothesis test and/or confidence
intervals). See the following two papers:

Mentch, L., & Hooker, G. (2014). Ensemble trees and clts: Statistical inference for supervised
learning. stat, 1050, 25.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using
random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
Lecture 5: Classification 5-11

5.8.2 Kernel based random forest

Recently, there is a new approach that combines random forest and kernel regression. For a point x, let
Aj (x) be the cell such that x falls in j-th regression tree. The b-th regression tree estimator can be written
as Pn
Yi I(Xi ∈ Ab (x))
mb b (x) = Pi=1
n .
i=1 I(Xi ∈ Ab (x))
Thus, the random forest regression function is
B
1 X
m(x)
e = m
b b (x)
B
b=1
B n
1 XX I(Xi ∈ Ab (x))
= Pn Yi
B i=1
b=1`=1 I(X` ∈ Ab (x))
n B
1 X X I(Xi ∈ Ab (x))
= Yi
B i=1 Nb (x)
b=1
B
n X
1 X
= Wbi Yi
B i=1 b=1
n
X
= Wi Yi
i=1

Pn
where and Nb (x) = i=1 I(Xi ∈ Ab (x)) is the total number observations of b-th sample in the cell Ab (x)
i ∈Ab (x))
PB
and Wbi = I(XN b (x)
, Wi = b=1 Wbi /B are both weights.

Notice that the weight Wbi ∝ I(Xi ∈ Ab (x)). This motivates us to construct a kernel function that is
PB
proportional to b=1 I(Xi ∈ Ab (x)), which leads to
Pn
Yi K(x, Xi )
b KeRF (x) = Pi=1
m n ,
`=1 K(x, X` )

where
B
1 X
K(x, y) = I(y ∈ Ab (x)).
B
b=1
This method is called kernel based random forest (KeRF). Note that in KeRF, the kernel function is a
random function. Its randomness can be described by the IID random regions
A1 , · · · , AB ∼ G,
where G is a distribution of random regions. Although it looks complicated, the randomness of each Ab is
determined by 1. the bootstrap sample and 2. the randomly chosen features. So G is not so difficult to
analyze.
Warning! The KeRF is different from the random forest although the construction of KeRF is motivated
by random forest.
More details about KeRF can be found in the following paper:

Scornet E. Random forests and kernel methods. (2016). IEEE Transactions on Information
Theory. 62(3):1485-500.
5-12 Lecture 5: Classification

5.9 Training Classifiers as an Optimization Problem

Even without any probabilistic model, classification problem can be viewed as an optimization problem. The
key element is to replace the risk function E(L(c(X), Y ) by the empirical risk (training error)
n
bn (c) = 1
X
R L(c(Xi ), Yi ).
n i=1

Consider a collection of classifiers C. Then the goal is to find the best classifier c∗ ∈ C such that the the
empirical risk R
bn (c) is minimized. Namely,

c∗ = argmin R
bn (c).
c∈C

Ideally, this should work because

n
!
1X
E(R
bn (c)) = E L(c(Xi ), Yi ) = E(L(c(X1 ), Y1 )) = R(c) (5.4)
n i=1

is the actual risk function. Namely, R

bn (c) is an unbiased estimator of the risk function.

Here are some concrete examples.

Linear classifier. Assume that we consider a linear classifier:

cβ0 ,β (x) = I(β0 + β T x > 0),

where β0 is a number like the intercept and β is the vector like the slope of each covariate/feature. Namely,
how we assign the class label purely depends on the value of β0 + β T x. If this value is positive, then we give
it a label 1. If the value is negative, then we give it a label 0. Then the set

Clin = {cβ0 ,β : β0 ∈ R, β ∈ Rd }

is a collection of all linear classifier. The idea of empirical risk minimization is to find the classifier in Clin
such that the empirical risk R bn (·) is minimized. Because every classifier is indexed by the two quantities
(parameters) β0 , β, finding the one that minimizes the empirical risk is equivalent to finding the best β0 , β
minimizing R bn (·).

Logistic regression. Similar to the linear classifier, the logistic regression can be viewed as a classifier of
the form !
T
eβ0 +β x 1
cβ0 ,β (x) = I > .
1 + eβ0 +β T x
e
2
Then we can define the set
cβ0 ,β : β0 ∈ R, β ∈ Rd }
Clogistic = {e
as the collection of all classifiers from a logistic regression model. The MLE approach becomes an empirical
risk minimization method using a particular loss function – the log-likelihood function.
Decision tree (fixed k, fixed leaves). The decision tree classifier b cDT can be viewed as a classifier from
the empirical risk minimization as well. For simplicity, we assume that the regions/leaves of the decision
trees R1 , · · · , Rk are fixed. Then any decision tree classifier can be written as
k
X
cDT (x) = αj I(x ∈ Rj ),
j=1
Lecture 5: Classification 5-13

where α1 , · · · , αk are quantities/parameters that determine how we will predict the class label of region
R1 , · · · , Rk , respectively. And the collection of all possible classifiers will be

CDT = {cDT (x) : αj ∈ {0, 1}, j = 1, · · · , k}.

Note that there are only 2k classifiers in the set CDT . The estimator b
cDT is just the one that minimizes the
bn (·) with a 0 − 1 loss.
empirical loss R
Decision tree (fixed k, non-fixed leaves). When the regions/leaves are not fixed, the collection of all
possible decision tree classifiers is much more complex. Here is an abstract way of describing such a collection.
Recall that at each split of the tree, we pick one region and one feature and a threshold (e.g., x1 > 10 versus
x1 ≤ 10), so every split can be represented by three indices: the region we are splitting, the feature index
(which feature is this split occurs), and the threshold index (what is the split level). Because at the `-th
split, there is only ` regions, the `-th split is characterized by a triplet (ω, m, λ), where ω = {1, 2, · · · , `} and
m ∈ {1, 2, · · · , d} and λ ∈ R. If a tree has k leaves, there will be k − 1 splits. So any decision tree classifier
is indexed by
α1 , · · · , αk , (ω1 , m1 , λ1 ), · · · , (ωk−1 , mk−1 , λk−1 ).
Namely, given a set of these values, we can construct a unique decision tree classifier. Thus, the collection
of all decision tree with k leaves can be written as

CDT (k) = {cDT (x) = cα,ω,m,λ (x) : αj ∈ {0, 1},

(ω` , m` , λ` ) ∈ {1, · · · , `} × {1, 2, · · · , d} × R,
j = 1, · · · , k, ` = 1, · · · , k − 1}.

In reality, when we train a decision tree classifier with a fixed number k, the regions are also computed from
the data. Thus, we are actually finding b cDT such that

cDT = argmin R
b bn (c).
c∈CDT (k)

Decision tree (both k andSleaves are non-fixed). If we train the classifier with k being un-specified,
then we are finding b cDT from k∈N CDT (k) that minimizes the empirical risk. However, if we really consider
all possible k, such an optimal decision tree is not unique and may be problematic. When k > n, we can make
each leave contains at most and the predicted label is just the label of that observation. Such a classifier
has 0 empirical risk but it may have very poor performance in future prediction because we are overfitting
the data. Overfitting the data implies that R bn (c) and R(c) are very different, even if R
bn (c) is an unbiased
estimator of R(c)!

Why will this happen? Rbn (c) is just the sample-average version of R(c), right? Is this contradict to the law
of large number that Rn (c) converges to R(c)?
b

It is true that R bn (c) is an unbiased estimator of R(c) and yes indeed the law of large number is applicable
in this case. BUT a key requirement for using the law of large number is that we assume c is fixed. Namely,
if the classifier c is fixed, then the law of large number guarantees that the empirical risk R
bn (c) converges
to the true risk function R(c).
However, when we are finding the best classifier, we are consider many many many possible classifiers c.
Although for a given classifier c the law of large number works, it may not work when we consider many
classifiers. The empirical risk minimization works if
P
bn (c) − R(c) →
sup R 0
c∈C
5-14 Lecture 5: Classification

Namely, the convergence is uniform for all classifiers in the collection that we are considering. In the next
few lectures we will be talking about how the above uniform convergence may be established.
Remark.

• Regression problem. The empirical risk minimization method can also be applied to regression
problem. We just replace the classifier by the regression function and the loss function can be chosen
as the L2 loss (squared distance). In this formulation, we obtain

R(m) = E kY − m(X)k2

and
n
bn (m) = 1
X
R kYi − m(Xi )k2 .
n i=1
Estimating a regression function can thus be written as an empirical risk minimization – we minimizes
Rbn (m) for m ∈ M to obtain our regression estimator. The set M is a collection of many regression
functions. Similar to the classification problem, we need a uniform convergence of the empirical risk
to make sure we have a good regression estimator.
• Penalty function. Another approach to handle the difference supc∈C |R bn (c) − R(c)| is to add an
extra quantity to Rn (c) so that such a uniform difference is somewhat being controlled. Instead of
b
minimizing Rbn (c), we minimizes
Rbn (c) + Pλ (c),

where Pλ is a penalty function. This is just the penalized regression approach but now applying it to
a classification problem. When the penalty function is chosen in a good way, we can make sure the
optimal classifier/regression estimator from the (penalized) empirical risk minimization indeed has a
very small risk.

5.10 Other approaches

There are many other classification methods that are not covered in our lecture but I will highly recommend
you to learn it. Here are some famous methods/keywords: boosting (adaBoost), neural nets (deep learning),
ensemble learning.

Midterm2 PDF
100% (3)
Midterm2 PDF
25 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Lecture 4 Day 3 Stochastic Frontier Analysis
100% (1)
Lecture 4 Day 3 Stochastic Frontier Analysis
45 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Lecture2 Classification PartI
No ratings yet
Lecture2 Classification PartI
100 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Lec 2
No ratings yet
Lec 2
37 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Lec 1
No ratings yet
Lec 1
42 pages
W8-Supervised Learning Methods
No ratings yet
W8-Supervised Learning Methods
30 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
Bayesian
No ratings yet
Bayesian
23 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
Lista Fabio Cozman
No ratings yet
Lista Fabio Cozman
6 pages
On Why Discretization Works For Naive-Bayes Classifiers: I I I I I I
No ratings yet
On Why Discretization Works For Naive-Bayes Classifiers: I I I I I I
8 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
16 pages
On The Optimality of The Simple Bayesian Classifier Under Zero-One Loss
No ratings yet
On The Optimality of The Simple Bayesian Classifier Under Zero-One Loss
28 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
31 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Machine Learning - Classification
No ratings yet
Machine Learning - Classification
13 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Class Adv Classification IV
No ratings yet
Class Adv Classification IV
49 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Supervised Classification 3601
No ratings yet
Supervised Classification 3601
39 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
No ratings yet
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
26 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
11 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Chapter - 5 (New) PDF
No ratings yet
Chapter - 5 (New) PDF
17 pages
module_3_Last Part
No ratings yet
module_3_Last Part
16 pages
Classification With NaiveBayes
No ratings yet
Classification With NaiveBayes
19 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Multivariate classification
No ratings yet
Multivariate classification
7 pages
2223hk1 Slide03 ML2022
No ratings yet
2223hk1 Slide03 ML2022
33 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
No ratings yet
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
79 pages
ML-09-naive-bayes-classifier
No ratings yet
ML-09-naive-bayes-classifier
24 pages
Temario Isl or
No ratings yet
Temario Isl or
15 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
Lecture No. 03
No ratings yet
Lecture No. 03
23 pages
UNIT- iv
No ratings yet
UNIT- iv
169 pages
Bayes Classification
No ratings yet
Bayes Classification
86 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Bayesian Classifiers: Lectured by Ha Hoang Kha, Ph.D. Ho Chi Minh City University of Technology
No ratings yet
Bayesian Classifiers: Lectured by Ha Hoang Kha, Ph.D. Ho Chi Minh City University of Technology
31 pages
Quantitative Methods Module 1
No ratings yet
Quantitative Methods Module 1
24 pages
Chapter 2 - Linear Classifiers
No ratings yet
Chapter 2 - Linear Classifiers
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Restricted Maximum Likelihood (REML) Estimation of Variance Components in the Mixed Model
No ratings yet
Restricted Maximum Likelihood (REML) Estimation of Variance Components in the Mixed Model
8 pages
Statistical Inference
No ratings yet
Statistical Inference
62 pages
DPO
No ratings yet
DPO
14 pages
Generalized Linear Model Multivariate Poisson With Artificial Marginal (GLM-MPAM) : Application of Vehicle Insurance
No ratings yet
Generalized Linear Model Multivariate Poisson With Artificial Marginal (GLM-MPAM) : Application of Vehicle Insurance
9 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Gec410 Note Vi
No ratings yet
Gec410 Note Vi
50 pages
Lavaan
No ratings yet
Lavaan
54 pages
Paglamidis Konstantinos
No ratings yet
Paglamidis Konstantinos
42 pages
Bayesian Methods for the Analysis of Small Sample Multilevel Data With a Complex Variance Structure
No ratings yet
Bayesian Methods for the Analysis of Small Sample Multilevel Data With a Complex Variance Structure
14 pages
MSC Maths 1819 Aff Coll
No ratings yet
MSC Maths 1819 Aff Coll
33 pages
1 s2.0 S0957417422004535 Main
No ratings yet
1 s2.0 S0957417422004535 Main
11 pages
Link To Publication in University of Groningen/UMCG Research Database
No ratings yet
Link To Publication in University of Groningen/UMCG Research Database
30 pages
RBF2
No ratings yet
RBF2
40 pages
MSC Statistics
No ratings yet
MSC Statistics
36 pages
Koutsoyiannis PMP Hershfield wrr1999 PDF
No ratings yet
Koutsoyiannis PMP Hershfield wrr1999 PDF
10 pages
Nonlife Actuarial Models: Model Evaluation and Selection
No ratings yet
Nonlife Actuarial Models: Model Evaluation and Selection
26 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
ECE 5314: Power System Operation & Control: Vassilis Kekatos
No ratings yet
ECE 5314: Power System Operation & Control: Vassilis Kekatos
30 pages
Note On The Evaluation of Generative Models: These Authors Contributed Equally To This Work. Now at Google Deepmind
No ratings yet
Note On The Evaluation of Generative Models: These Authors Contributed Equally To This Work. Now at Google Deepmind
10 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Time Series: Chapter 4 - Estimation
No ratings yet
Time Series: Chapter 4 - Estimation
53 pages
Math662TB 09S
100% (2)
Math662TB 09S
712 pages
Whidden 2019 AJ 157 119
No ratings yet
Whidden 2019 AJ 157 119
15 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Logit and Probit Models
No ratings yet
Logit and Probit Models
44 pages
Models Credit Policy,, PDF
No ratings yet
Models Credit Policy,, PDF
16 pages
Jelinski - Moranda Model For Software Reliability Prediction and Its G.A. Based Optimised Simulation Trajectory
No ratings yet
Jelinski - Moranda Model For Software Reliability Prediction and Its G.A. Based Optimised Simulation Trajectory
7 pages
AeROS Brochure
No ratings yet
AeROS Brochure
6 pages

Lec5 Class

Uploaded by

Lec5 Class

Uploaded by

STAT 535: Statistical Machine Learning Winter 2019

R(c) = E(L(c(X), Y )).

R(c∗ ) = min R(c)

R(c) = E(L(c(X), Y )) = E(E(L(c(X), Y )|X)).

For the quantity (A), we have

E(L(c(X), Y )|X) = L(c(X), 1)p(Y = 1|X) + L(c(X), 0)p(Y = 0|X)

E(c) = R(c) − min R(c).

5.2 Regression Approach

be the kernel regression. Then

Thus, a classifier based on kernel regression is

Theorem 5.1 Assume we use the 0 − 1 loss. Let m

5.3 Density Estimation Approach (Naive Bayes)

A key insight is from the Bayes rule:

Thus, the corresponding classifier is

5.4 Confusion Matrix

Actual label: 0 Actual label: 1

Actual label: 0 Actual label: 1

5.5 k-NN Approach

5.6 Logistic Regression

P (admitted|GPA = x) = P (Y = 1|X = x) = q(x).

P (subject has type-2 diabetes|mutation signature = x) = P (Y = 1|X = x) = q(x).

P (Y = 1|X = x) = 2P (Y = 0|X = x) ⇒ log O(x) = log 2

L(β0 , β|Xi , Yi ) = P (Y = Yi |Xi ) = q(Xi )Yi (1 − q(Xi ))1−Yi

Our estimates are

βb0 , βb = argmax `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

= argmin − `(β0 , β|X1 , Y1 , · · · , Xn , Yn )

where the loss function is

5.7 Decision Tree

5.8 Random Forest

The formal name of bagging is bootstrap aggregation.

Random Forest = Decision Tree + Bagging + Random features.

5.8.1 Some theory about random forest

One useful theoretical result on random forest is the following theorem:

The above result is from

5.8.2 Kernel based random forest

5.9 Training Classifiers as an Optimization Problem

Ideally, this should work because

is the actual risk function. Namely, R

Here are some concrete examples.

cβ0 ,β (x) = I(β0 + β T x > 0),

CDT = {cDT (x) : αj ∈ {0, 1}, j = 1, · · · , k}.

CDT (k) = {cDT (x) = cα,ω,m,λ (x) : αj ∈ {0, 1},

5.10 Other approaches

You might also like