0% found this document useful (0 votes)
13 views22 pages

02-Classification - Commented2

This document discusses classification models in machine learning. It defines classification tasks and describes scoring and probabilistic classifiers. It also covers decision boundaries, linear classifiers, and using sigmoids to transform model scores to probabilities for binary classification.

Uploaded by

Dreamless
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views22 pages

02-Classification - Commented2

This document discusses classification models in machine learning. It defines classification tasks and describes scoring and probabilistic classifiers. It also covers decision boundaries, linear classifiers, and using sigmoids to transform model scores to probabilities for binary classification.

Uploaded by

Dreamless
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Advanced Statistical Learning

Chapter 2: Classification
Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund


Winter term 2020/21
CLASSIFICATION TASKS
In classification, we aim at predicting a discrete output

y ∈ Y = {C1 , ..., Cg }
with 2 ≤ g < ∞ given data D .

In this course, we assume the classes to be encoded as


Y = {0, 1} or Y = {−1, 1} (in the binary case g = 2)
Y = {1, 2, ..., g } (in the multiclass case g ≥ 3)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 1 / 21
Classifiers

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 2 / 21
CLASSIFICATION MODELS
We defined models f : X → Rg as functions that output (continuous)
scores / probabilities and not (discrete) classes. Why?
From an optimization perspective, it is much (!) easier to optimize
costs for continuous-valued functions
Scores / probabilities (for classes) contain more information than
the class labels alone
As we will see later, scores can easily be transformed into class
labels; but class labels cannot be transformed into scores
We distinguish scoring and probabilistic classifiers.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 3 / 21
CLASSIFICATION MODELS
Scoring Classifiers:
Construct g discriminant / scoring functions f1 , ..., fg : X → R
Scores f1 (x), ..., fg (x) are transformed into classes by choosing
the class with the maximum score

h(x) = arg max fk (x).


k ∈{1,2,...,g }

For g = 2, a single discriminant function f (x) = f1 (x) − f−1 (x) is


sufficient (note that it would be more natural here to label the
classes with {+1, −1})

h(x ) = 1
⇐⇒ f1 (x) > f−1 (x)
⇐⇒ f1 (x) − f−1 (x) > 0
⇐⇒ f (x) > 0
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 4 / 21
CLASSIFICATION MODELS
Class labels are constructed by h(x) = sgn(f (x))
|f (x)| is called “confidence”

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 5 / 21
CLASSIFICATION MODELS
Probabilistic Classifiers:
Construct g probability functions
P
π1 , ..., πg : X → [0, 1],i πi = 1
Probabilities π1 (x), ..., πg (x) are transformed into labels by
predicting the class with the maximum probability

h(x) = arg max πk (x)


{k ∈1,2,...,g }

If g = 2, usually one probability function π(x) is outputted by most


models (note that it would be more natural here to label the
classes with {0, 1})
If g = 2, transformation into discrete classes by thresholding:
h(x) := 1(π(x) ≥ c ) for some threshold c ∈ [0, 1] (for binary
classification we usually set c = 0.5)
Probabilistic classifiers can also be seen as scoring classifiers

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 6 / 21
CLASSIFICATION MODELS
Remark: If we want to emphasize that our model outputs probabilities
we denote the model as π : X → [0, 1]g ; if we are talking about models
in a general sense, we write f comprising both probabilistic and scoring
classifiers (context will make this clear!)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 7 / 21
CLASSIFICATION MODELS
Both scoring and probabilistic classifiers can be turned into
discrete classes via thresholding (binary case) / predicting the
class with the maximum score (multiclass)
Discrete classes, which are often intrinsically produced by scores,
can not be transferred into scores again (Attention: Arrow in the
figure might be misleading!)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 8 / 21
DECISION BOUNDARIES

A decision boundary is defined as a hypersurface that partitions


the input space X into g (the number of classes) decision regions

Xk = {x ∈ X : h(x) = k }

Ties between those regions are called decision boundaries

{x ∈ X : ∃ i 6= j s.t. fi (x) = fj (x)


and fi (x), fj (x) ≥ fk (x) ∀k 6= i , j }

(in the general multiclass case) or

f (x) = c ,
if c ∈ R was used as threshold (usually c = 0 for scoring
classifiers and c = 0.5 for probabilistic classifiers) in the binary
case.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 9 / 21
DECISION BOUNDARIES
4.5 4.5
Sepal.Width

Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length

4.5 4.5
Sepal.Width

Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
Different shapes of decision boundaries. Classifiers (QDA, decision tree, nonlinear
SVM, Naive-Bayes)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 10 / 21
LINEAR CLASSIFIERS
If the discriminant functions fk (x) can be specified as linear functions
(possibly through a rank-preserving, monotone transformation
g : R → R), i.e.

g (fk (x)) = wk> x + bk ,


we will call the classifier a linear classifier.

We can then write a tie between scores as

fi (x) = fj (x)
⇐⇒ g (fi (x)) = g (fj (x))
⇐⇒ wi> x + bi = wj> x + bj
⇐⇒ (wi − wj )> x + (bi − bj ) = 0
⇐⇒ wij> x + bij = 0,

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 11 / 21
LINEAR CLASSIFIERS
with wij := wi − wj and bij = bi − bj . This is a hyperplane separating
two classes.

Note that linear classifiers can represent non-linear decision boundaries


in the original input space if we use derived features like higher order
interactions, polynomial features, basis function expansions, etc.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 12 / 21
SCALING: SIGMOIDS (BINARY)
Probabilistic classifiers are often preferred because probabilities yield a
more natural interpretation. Any score-generating model can be turned
into a probability estimator.

For g = 2, we can use a transformation function s : R → [0, 1]

π(x) := s (f (x)) ∈ [0, 1]


to map scores to probabilities.

A commonly used type of transformation functions are sigmoid


functions: A sigmoid is a bounded, differentiable, real-valued function
s : R → [0, 1] that has a non-negative derivative at each point.

In deep learning sigmoids are used as activation functions.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 13 / 21
SCALING: SIGMOIDS (BINARY)
Examples for sigmoid functions:
Arctan function: s(t ) = arctan(t )
et −e−t
Hyperbolic tangent: s(t ) = tanh(t ) = et +e−t
1
Logistic function: s(t ) = 1+e−t
Any cumulative density function (cdf), e.g. the normal distribution
(also called probit function)

1.00

0.75

0.50
y

arctan
0.25
tanh
logistic
0.00 probit
−5.0 −2.5 0.0 2.5 5.0
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 14 / 21
SCALING: SIGMOIDS (BINARY)
The logistic function

1
s(t ) =
1 + e−t
is a popular choice for transforming scores to probabilities (used, for
example, in logistic regression). Properties of the logistic function s(t ):
limt → −∞ s(t ) = 0 and limt →∞ s(t ) = 1
∂ s (t ) exp(t )
∂t = (1+exp(t ))2
= s(t )(1 − s(t ))
s(t ) is symmetrical about the point (0, 12 )
1.00
0.75
s(t)

0.50
0.25
0.00
−4 0 4
t

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 15 / 21
SCALING: SOFTMAX (MULTICLASS)
Any multiclass scoring classifier can be transformed into probabilities
using a transformation that maps the scores to a vector of probabilities

(f1 (x ), ..., fg (x )) 7→ (π1 (x ), ..., πg (x )) ,


fulfilling
πk (x ) ∈ [0, 1] for all k = 1, ..., g
Pg
k =1 πk (x ) = 1.

A commonly used function is the softmax function, which is defined on


a numerical vector z (which could be our scores):

exp(zk )
s(z )k = P
j exp(zj )

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 16 / 21
SCALING: SOFTMAX (MULTICLASS)
It is a generalization of the logistic function (check for g = 2). It
“squashes” a g-dimensional real-valued vector z to a vector of the same
dimension, with every entry in the range [0, 1] and all entries adding up
to 1.

For a categorical response variable y ∈ {1, . . . , g } with g > 2 the


model extends to

exp(fk (x))
πk (x) = Pg .
j =1 exp(fj (x))

Compared to the arg max operator, the soft max keeps information about the other,
non-maximal elements in a reversible way. Thus the function is called soft max.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 17 / 21
Generative vs. Discriminative Approaches

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 18 / 21
GENERATIVE APPROACHES
Two fundamental approaches exist to construct classifiers: The
generative approach and the discriminant approach.

The generative approach employs the Bayes theorem:

P(x|y = k )P(y = k )
πk (x) = P(y = k | x) = ∝ P(x|y = k )πk
P(x)

and models P(x | y = k ) (usually by assuming something about the


structure of this distribution) to allow the computation of πk (x). The
discriminant functions in this approach are

πk (x) or log P(x|y = k ) + log πk

Discriminant approaches try to model the discriminant functions


directly, often by loss minimization.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 19 / 21
GENERATIVE APPROACHES
Examples:
Linear discriminant analysis: generative, linear. Each class’s
density is a multivariate Gaussian with

P(x | y = k ) ∼ N (µ
µk , Σ )

with equal covariances.


Quadratic discriminant analysis: generative, non-linear. Each
class’s density is a multivariate Gaussian with

P(x | y = k ) ∼ N (µ
µk , Σ k )

with unequal covariances Σ k .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 20 / 21
GENERATIVE APPROACHES
Naive Bayes: generative, non-linear. “Naive” conditional
independence assumption, i.e. features given the category y are
conditionally independent of each other
p
Y
P(x | y = k ) = P((x1 , ..., xp ) | y = k ) = P(xj | y = k ).
j =1

Logistic regression: discriminant, (usually) linear

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 21 / 21

You might also like