0% found this document useful (0 votes)

13 views22 pages

02-Classification - Commented2

This document discusses classification models in machine learning. It defines classification tasks and describes scoring and probabilistic classifiers. It also covers decision boundaries, linear classifiers, and using sigmoids to transform model scores to probabilities for binary classification.

Uploaded by

Dreamless

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views22 pages

02-Classification - Commented2

Uploaded by

Dreamless

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Advanced Statistical Learning

Chapter 2: Classification
Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund

Winter term 2020/21
CLASSIFICATION TASKS
In classification, we aim at predicting a discrete output

y ∈ Y = {C1 , ..., Cg }
with 2 ≤ g < ∞ given data D .

In this course, we assume the classes to be encoded as

Y = {0, 1} or Y = {−1, 1} (in the binary case g = 2)
Y = {1, 2, ..., g } (in the multiclass case g ≥ 3)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 1 / 21
Classifiers

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 2 / 21
CLASSIFICATION MODELS
We defined models f : X → Rg as functions that output (continuous)
scores / probabilities and not (discrete) classes. Why?
From an optimization perspective, it is much (!) easier to optimize
costs for continuous-valued functions
Scores / probabilities (for classes) contain more information than
the class labels alone
As we will see later, scores can easily be transformed into class
labels; but class labels cannot be transformed into scores
We distinguish scoring and probabilistic classifiers.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 3 / 21
CLASSIFICATION MODELS
Scoring Classifiers:
Construct g discriminant / scoring functions f1 , ..., fg : X → R
Scores f1 (x), ..., fg (x) are transformed into classes by choosing
the class with the maximum score

h(x) = arg max fk (x).

k ∈{1,2,...,g }

For g = 2, a single discriminant function f (x) = f1 (x) − f−1 (x) is

sufficient (note that it would be more natural here to label the
classes with {+1, −1})

h(x ) = 1
⇐⇒ f1 (x) > f−1 (x)
⇐⇒ f1 (x) − f−1 (x) > 0
⇐⇒ f (x) > 0
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 4 / 21
CLASSIFICATION MODELS
Class labels are constructed by h(x) = sgn(f (x))
|f (x)| is called “confidence”

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 5 / 21
CLASSIFICATION MODELS
Probabilistic Classifiers:
Construct g probability functions
P
π1 , ..., πg : X → [0, 1],i πi = 1
Probabilities π1 (x), ..., πg (x) are transformed into labels by
predicting the class with the maximum probability

h(x) = arg max πk (x)

{k ∈1,2,...,g }

If g = 2, usually one probability function π(x) is outputted by most

models (note that it would be more natural here to label the
classes with {0, 1})
If g = 2, transformation into discrete classes by thresholding:
h(x) := 1(π(x) ≥ c ) for some threshold c ∈ [0, 1] (for binary
classification we usually set c = 0.5)
Probabilistic classifiers can also be seen as scoring classifiers

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 6 / 21
CLASSIFICATION MODELS
Remark: If we want to emphasize that our model outputs probabilities
we denote the model as π : X → [0, 1]g ; if we are talking about models
in a general sense, we write f comprising both probabilistic and scoring
classifiers (context will make this clear!)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 7 / 21
CLASSIFICATION MODELS
Both scoring and probabilistic classifiers can be turned into
discrete classes via thresholding (binary case) / predicting the
class with the maximum score (multiclass)
Discrete classes, which are often intrinsically produced by scores,
can not be transferred into scores again (Attention: Arrow in the
figure might be misleading!)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 8 / 21
DECISION BOUNDARIES

A decision boundary is defined as a hypersurface that partitions

the input space X into g (the number of classes) decision regions

Xk = {x ∈ X : h(x) = k }

Ties between those regions are called decision boundaries

{x ∈ X : ∃ i 6= j s.t. fi (x) = fj (x)

and fi (x), fj (x) ≥ fk (x) ∀k 6= i , j }

(in the general multiclass case) or

f (x) = c ,
if c ∈ R was used as threshold (usually c = 0 for scoring
classifiers and c = 0.5 for probabilistic classifiers) in the binary
case.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 9 / 21
DECISION BOUNDARIES
4.5 4.5
Sepal.Width

Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length

4.5 4.5
Sepal.Width

Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
Different shapes of decision boundaries. Classifiers (QDA, decision tree, nonlinear
SVM, Naive-Bayes)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 10 / 21
LINEAR CLASSIFIERS
If the discriminant functions fk (x) can be specified as linear functions
(possibly through a rank-preserving, monotone transformation
g : R → R), i.e.

g (fk (x)) = wk> x + bk ,

we will call the classifier a linear classifier.

We can then write a tie between scores as

fi (x) = fj (x)
⇐⇒ g (fi (x)) = g (fj (x))
⇐⇒ wi> x + bi = wj> x + bj
⇐⇒ (wi − wj )> x + (bi − bj ) = 0
⇐⇒ wij> x + bij = 0,

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 11 / 21
LINEAR CLASSIFIERS
with wij := wi − wj and bij = bi − bj . This is a hyperplane separating
two classes.

Note that linear classifiers can represent non-linear decision boundaries

in the original input space if we use derived features like higher order
interactions, polynomial features, basis function expansions, etc.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 12 / 21
SCALING: SIGMOIDS (BINARY)
Probabilistic classifiers are often preferred because probabilities yield a
more natural interpretation. Any score-generating model can be turned
into a probability estimator.

For g = 2, we can use a transformation function s : R → [0, 1]

π(x) := s (f (x)) ∈ [0, 1]

to map scores to probabilities.

A commonly used type of transformation functions are sigmoid

functions: A sigmoid is a bounded, differentiable, real-valued function
s : R → [0, 1] that has a non-negative derivative at each point.

In deep learning sigmoids are used as activation functions.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 13 / 21
SCALING: SIGMOIDS (BINARY)
Examples for sigmoid functions:
Arctan function: s(t ) = arctan(t )
et −e−t
Hyperbolic tangent: s(t ) = tanh(t ) = et +e−t
1
Logistic function: s(t ) = 1+e−t
Any cumulative density function (cdf), e.g. the normal distribution
(also called probit function)

1.00

0.75

0.50
y

arctan
0.25
tanh
logistic
0.00 probit
−5.0 −2.5 0.0 2.5 5.0
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 14 / 21
SCALING: SIGMOIDS (BINARY)
The logistic function

1
s(t ) =
1 + e−t
is a popular choice for transforming scores to probabilities (used, for
example, in logistic regression). Properties of the logistic function s(t ):
limt → −∞ s(t ) = 0 and limt →∞ s(t ) = 1
∂ s (t ) exp(t )
∂t = (1+exp(t ))2
= s(t )(1 − s(t ))
s(t ) is symmetrical about the point (0, 12 )
1.00
0.75
s(t)

0.50
0.25
0.00
−4 0 4
t

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 15 / 21
SCALING: SOFTMAX (MULTICLASS)
Any multiclass scoring classifier can be transformed into probabilities
using a transformation that maps the scores to a vector of probabilities

(f1 (x ), ..., fg (x )) 7→ (π1 (x ), ..., πg (x )) ,

fulfilling
πk (x ) ∈ [0, 1] for all k = 1, ..., g
Pg
k =1 πk (x ) = 1.

A commonly used function is the softmax function, which is defined on

a numerical vector z (which could be our scores):

exp(zk )
s(z )k = P
j exp(zj )

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 16 / 21
SCALING: SOFTMAX (MULTICLASS)
It is a generalization of the logistic function (check for g = 2). It
“squashes” a g-dimensional real-valued vector z to a vector of the same
dimension, with every entry in the range [0, 1] and all entries adding up
to 1.

For a categorical response variable y ∈ {1, . . . , g } with g > 2 the

model extends to

exp(fk (x))
πk (x) = Pg .
j =1 exp(fj (x))

Compared to the arg max operator, the soft max keeps information about the other,
non-maximal elements in a reversible way. Thus the function is called soft max.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 17 / 21
Generative vs. Discriminative Approaches

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 18 / 21
GENERATIVE APPROACHES
Two fundamental approaches exist to construct classifiers: The
generative approach and the discriminant approach.

The generative approach employs the Bayes theorem:

P(x|y = k )P(y = k )
πk (x) = P(y = k | x) = ∝ P(x|y = k )πk
P(x)

and models P(x | y = k ) (usually by assuming something about the

structure of this distribution) to allow the computation of πk (x). The
discriminant functions in this approach are

πk (x) or log P(x|y = k ) + log πk

Discriminant approaches try to model the discriminant functions

directly, often by loss minimization.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 19 / 21
GENERATIVE APPROACHES
Examples:
Linear discriminant analysis: generative, linear. Each class’s
density is a multivariate Gaussian with

P(x | y = k ) ∼ N (µ
µk , Σ )

with equal covariances.

Quadratic discriminant analysis: generative, non-linear. Each
class’s density is a multivariate Gaussian with

P(x | y = k ) ∼ N (µ
µk , Σ k )

with unequal covariances Σ k .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 20 / 21
GENERATIVE APPROACHES
Naive Bayes: generative, non-linear. “Naive” conditional
independence assumption, i.e. features given the category y are
conditionally independent of each other
p
Y
P(x | y = k ) = P((x1 , ..., xp ) | y = k ) = P(xj | y = k ).
j =1

Logistic regression: discriminant, (usually) linear

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 21 / 21

Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Lecture 2
No ratings yet
Lecture 2
130 pages
Classification: Probabilistic Generative Model
No ratings yet
Classification: Probabilistic Generative Model
34 pages
Unit 3-Generative Models
No ratings yet
Unit 3-Generative Models
23 pages
Slide ML 0915
No ratings yet
Slide ML 0915
24 pages
08 Generative
No ratings yet
08 Generative
23 pages
Generative Models For Classification Neural Networks
No ratings yet
Generative Models For Classification Neural Networks
43 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
Chapter 04b Riskmin-Class - Commented2
No ratings yet
Chapter 04b Riskmin-Class - Commented2
30 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Lecture 04
No ratings yet
Lecture 04
28 pages
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
No ratings yet
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
21 pages
Mil780 Classification
No ratings yet
Mil780 Classification
18 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Temario Isl or
No ratings yet
Temario Isl or
15 pages
Bishop-Valencia-07
No ratings yet
Bishop-Valencia-07
22 pages
Bayes Classification
No ratings yet
Bayes Classification
86 pages
7. Statistical Perspective
No ratings yet
7. Statistical Perspective
85 pages
Discriminant, Generative, Discriminative Models
No ratings yet
Discriminant, Generative, Discriminative Models
98 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Chapter5 DA
No ratings yet
Chapter5 DA
39 pages
Week2_Part1_Summer_Partial_Notes
No ratings yet
Week2_Part1_Summer_Partial_Notes
75 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
report
No ratings yet
report
50 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
MLE and MAP Classifier
No ratings yet
MLE and MAP Classifier
55 pages
LLM Engineering_ Master AI, Large Language Models & Agents _ Udemy
No ratings yet
LLM Engineering_ Master AI, Large Language Models & Agents _ Udemy
13 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Asdad
No ratings yet
Asdad
14 pages
unsupervised_learning_clustering_math
No ratings yet
unsupervised_learning_clustering_math
28 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Lec 1
No ratings yet
Lec 1
42 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
AE - Tema 5 - Two-class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-class Fisher Discriminant Analysis
6 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
ml41
No ratings yet
ml41
49 pages
ML-chap10_2024_110300
No ratings yet
ML-chap10_2024_110300
29 pages
linear-models-for-classification
No ratings yet
linear-models-for-classification
21 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
74 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
NB 13
No ratings yet
NB 13
27 pages
Bayesian
No ratings yet
Bayesian
21 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
80 pages
cs229-notes2
No ratings yet
cs229-notes2
14 pages
Lec 6
No ratings yet
Lec 6
14 pages
Inf2b Learn Note10 2up
No ratings yet
Inf2b Learn Note10 2up
7 pages
Stock Market Prediction Using Machine Learning
100% (1)
Stock Market Prediction Using Machine Learning
49 pages
R PPT 30
No ratings yet
R PPT 30
45 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
n9 PDF
No ratings yet
n9 PDF
6 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
ECS7020P ClassificationExercisesSolutions II
No ratings yet
ECS7020P ClassificationExercisesSolutions II
7 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
MUCLecture_2024_3256178
No ratings yet
MUCLecture_2024_3256178
10 pages
Security and Resilience in Cyber Physical Systems 1st edition by Masoud Abbaszadeh 303097166X 9783030971663 - Download the ebook today and own the complete version
100% (5)
Security and Resilience in Cyber Physical Systems 1st edition by Masoud Abbaszadeh 303097166X 9783030971663 - Download the ebook today and own the complete version
90 pages
Advanced Cybersecurity Measures in IT Service Operations and Their Crucial Role in Safeguar[1] (1)
No ratings yet
Advanced Cybersecurity Measures in IT Service Operations and Their Crucial Role in Safeguar[1] (1)
21 pages
Chapter 5 Univariate-Modelling
No ratings yet
Chapter 5 Univariate-Modelling
81 pages
2501.05424v1
No ratings yet
2501.05424v1
22 pages
ICCIT Overall Schedule
No ratings yet
ICCIT Overall Schedule
70 pages
Chapter 4a Riskmin-Reg - Commented4
No ratings yet
Chapter 4a Riskmin-Reg - Commented4
54 pages
Group 9 Project Report
No ratings yet
Group 9 Project Report
51 pages
ECG_Arrhythmia_Classification_Using_Transfer_Learning_from_2-_Dimensional_Deep_CNN_Features
No ratings yet
ECG_Arrhythmia_Classification_Using_Transfer_Learning_from_2-_Dimensional_Deep_CNN_Features
4 pages
Summer Training Report
No ratings yet
Summer Training Report
53 pages
Untitled Document
No ratings yet
Untitled Document
19 pages
Group - 29 - Final Report
No ratings yet
Group - 29 - Final Report
76 pages
03 Hypothesis Spaces Commented4
No ratings yet
03 Hypothesis Spaces Commented4
45 pages
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
64 pages
1822 B.tech It Batchno 340
No ratings yet
1822 B.tech It Batchno 340
48 pages
Chapter 5 - Excursion B-Splines - Commented2
No ratings yet
Chapter 5 - Excursion B-Splines - Commented2
47 pages
Aryan Sunil Mishra
No ratings yet
Aryan Sunil Mishra
1 page
AI and Cybersecurity in Modern Databases: Innovative Approaches To Threat Detection and Response
No ratings yet
AI and Cybersecurity in Modern Databases: Innovative Approaches To Threat Detection and Response
14 pages
Seminar Paper AI
No ratings yet
Seminar Paper AI
8 pages
Machine Learning! - by Semi Koen - Towards Data Science
No ratings yet
Machine Learning! - by Semi Koen - Towards Data Science
18 pages
Reflect 11 Eng. 201 Unit 1 (Q & A)
No ratings yet
Reflect 11 Eng. 201 Unit 1 (Q & A)
44 pages
LLM Basics
No ratings yet
LLM Basics
3 pages
Dense Net
No ratings yet
Dense Net
28 pages
Symphony RetailAI-AI - Opportunities in Retail
No ratings yet
Symphony RetailAI-AI - Opportunities in Retail
8 pages
Stock Price Predictor 1.1
No ratings yet
Stock Price Predictor 1.1
9 pages
Class-Balanced Loss Based On Effective Number of Samples
No ratings yet
Class-Balanced Loss Based On Effective Number of Samples
11 pages
Lesson 01
No ratings yet
Lesson 01
6 pages
17 Data Science Certifications For Your Career in 2024
No ratings yet
17 Data Science Certifications For Your Career in 2024
5 pages
Sensors 22 06463 v2
No ratings yet
Sensors 22 06463 v2
33 pages
Time: 3 Hours Total Marks: 70: Printed Page 1 of 2 Sub Code: RCS702
No ratings yet
Time: 3 Hours Total Marks: 70: Printed Page 1 of 2 Sub Code: RCS702
2 pages
Machine Learning For Microstrip Patch Antenna Design: Observations and Recommendations
No ratings yet
Machine Learning For Microstrip Patch Antenna Design: Observations and Recommendations
2 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

02-Classification - Commented2

Uploaded by

02-Classification - Commented2

Uploaded by

Advanced Statistical Learning

Department of Statistics – TU Dortmund

In this course, we assume the classes to be encoded as

h(x) = arg max fk (x).

For g = 2, a single discriminant function f (x) = f1 (x) − f−1 (x) is

h(x) = arg max πk (x)

If g = 2, usually one probability function π(x) is outputted by most

A decision boundary is defined as a hypersurface that partitions

Ties between those regions are called decision boundaries

{x ∈ X : ∃ i 6= j s.t. fi (x) = fj (x)

(in the general multiclass case) or

g (fk (x)) = wk> x + bk ,

We can then write a tie between scores as

Note that linear classifiers can represent non-linear decision boundaries

For g = 2, we can use a transformation function s : R → [0, 1]

π(x) := s (f (x)) ∈ [0, 1]

A commonly used type of transformation functions are sigmoid

In deep learning sigmoids are used as activation functions.

(f1 (x ), ..., fg (x )) 7→ (π1 (x ), ..., πg (x )) ,

A commonly used function is the softmax function, which is defined on

For a categorical response variable y ∈ {1, . . . , g } with g > 2 the

The generative approach employs the Bayes theorem:

and models P(x | y = k ) (usually by assuming something about the

πk (x) or log P(x|y = k ) + log πk

Discriminant approaches try to model the discriminant functions

with equal covariances.

with unequal covariances Σ k .

Logistic regression: discriminant, (usually) linear

You might also like