0% found this document useful (0 votes)
1 views

pa_01_density_estimation

The document discusses the principles of pattern recognition and machine learning, emphasizing the differences between supervised and unsupervised learning, as well as the importance of parameters and hyperparameters in model training. It introduces non-parametric density estimation techniques, including histograms and kernel-based methods, and highlights the challenges of high-dimensional spaces and model selection. Additionally, it addresses hyperparameter tuning for unsupervised methods and the use of cross-validation to optimize density estimation on limited data.

Uploaded by

Nhb Sohel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

pa_01_density_estimation

The document discusses the principles of pattern recognition and machine learning, emphasizing the differences between supervised and unsupervised learning, as well as the importance of parameters and hyperparameters in model training. It introduces non-parametric density estimation techniques, including histograms and kernel-based methods, and highlights the challenges of high-dimensional spaces and model selection. Additionally, it addresses hyperparameter tuning for unsupervised methods and the use of cross-validation to optimize density estimation on limited data.

Uploaded by

Nhb Sohel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Lecture Pattern Analysis

Part 01: Introduction and First Sampling

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Pattern Recognition Recap and Unsupervised Learning

• Remember the steps of the classical pattern recognition pipeline:

Prepro- Feature Classi-


(Data) Sampling (Class)
cessing Extraction fication

→x f (x) y

• Fundamental ML assumption: good feature representations map similar


objects to similar features
• Classifier training is almost always supervised,
i.e. a training sample is a tupel (xi , yi ) (cf. lecture “Pattern Recognition”)
• Unsupervised ML works without labels, i.e., it only operates on inputs (xi )
• Unsup. ML can be seen as representation or summary of a distribution
• So, “classification versus representation” could be a jingle to further distinguish
PR from PA (cf. our discussion in the joint meeting)

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 1
Further Aspects of Interest: Parameters and Hyperparameters

• Every machine learning model has parameters


• For example, linear regression predicts with d parameters βi a d-dimensional
hyperplane that predicts y for a d-dimensional input x̃ = (1, x1 , . . . , xd −1 )⊤ ,
d
X

y = β x̃ = βi · x̃i (1)
i =0

• Less parameters make the model more robust, more parameters make the
model more flexible
• To continue the example, consider linear regression on a basis expansion of
a scalar unknown x, e.g., fitting a d-dimensional polynomial to the vector
(1, x , x 2 , . . . , x d ): larger d enables more complex polynomials
• The dimension d is a hyperparameter, i.e., a parameter that somehow
parameterizes the choice of parameters
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 2
Further Aspects of Interest: Local Operators and High
Dimensional Spaces

• Thinking about model flexibility: more “local” models are more flexible, but
require more parameters and are less robust
• How can we find a good trade-off? This is the model selection problem

• Another issue: all local models perform poorly in higher dimensional spaces
• A probably surprising consequence is that high-dimensional methods must
be non-local along some direction

• Also summarization methods (clustering) performs poorly in higher


dimensional spaces

• All these points motivate to also look into dimensionality reduction

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 3
A Study of Distributions

• In PA, we look at data in feature spaces


• To understand and manipulate these data points, they are mathematically
commonly represented as probability distribution functions (PDFs)
• Additionally, inference allows to draw conclusions from distributions

• Common operations on distributions:


• Fitting a distribution model to the data (parametric or non-parametric)
represents the data as a distribution
• Sampling from a distribution creates new data points that follow the
distribution (i.e., they are plausible)
• Factorizing a distribution is a key technique for reducing the complexity

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 4
Recap on Probability Vocabulary

• Let X , Y denote two random variables


• Important vocabulary and equations are:
Joint distribution p (X , Y )

Conditional distribution of X given Y p(X |Y )

Sum rule / marginalization over Y p (X ) p(X , Y )


P
=
Y

Product rule p (X , Y ) = p(Y |X ) · p(X )


p(X |Y )·p(Y )
Bayes rule p(Y |X ) = p (X )
likelihood·prior
Bayes rule in the language of ML posterior = evidence

• Please browse the book by Bishop, Sec. 1.2.3, to refresh your mind if
necessary!

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 5
Sampling from a PDF

• Oftentimes, it is necessary to draw samples from a PDF


• Example:
• Logistic Regression fits a single regression curve to the data (cf. PR)
• Bayesian Logistic Regression fits a distribution of curves

The distribution is narrow at observations (crosses), and wider otherwise


• Sample curves from the distribution to obtain its spread (“uncertainty”)

• Special PDFs like Gaussians have closed-form solutions for sampling


• We look now at a sampling method that works on arbitrary PDFs
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 6
Idea of the Sampling Algorithm

• The key idea is to use the cumulative density function (CDF) P (z ) of p(X ),

Zz
P (z ) = p(X )dX (2)
−∞

• A sample uniformly drawn from the CDF y -axis intersects P (z ) at location z


• This z position is our random draw from p(x ):

1 1

p(X ) P (Z ) P (Z )

X z z

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 7
Sampling Algorithm

• Split the domain of p(X ) into discrete bins, enumerate the bins
• On these bins, calculate the cumulative density function (CDF) P (z ) of p(X )
• Draw a uniformly distributed number u between 0 and 1
• The sample from the PDF is

z∗ = argmin P (z ) ≥ u , (3)
z

i.e., the value where u intersects the CDF

• Note that the range of a CDF is always [0; 1]


• The CDF “warps” the uniform sample into a sample from distribution p(X ).

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 8
Practical Realization and Limitations

• Theoretically, we can split any space into bins


• In practice, however, it is not clear how small or large these bins shall be
• Do empty bins indicate that it is impossible to sample from there? Or do they
indicate a lack of observations?
• Hence, the method requires many observations (as a naive rule-of-thumb,
one can use 30 · b observations for b bins)
• Note that for a fixed bin width, the number of bins grows exponentially with
the dimensionality of the space

• These limitations make the method somewhat unattractive in practice.


We will look at more advanced sampling strategies later

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 9
Lecture Pattern Analysis

Part 02: Non-Parametric Density Estimation

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Introduction

• Density Estimation = create a PDF from a set of samples


• The lecture Pattern Recognition introduces parametric density estimation:
• There, a parametric model (e.g., a Gaussian) is fitted to the data
• Maximum Likelihood (ML) estimator:

θ ∗ = argmax p(x1 , . . . , xN |θ) (1)


θ

• Maximum a Posteriori (MAP) estimator:

Bayes p(x1 , . . . , xN |θ) · p(θ)


θ ∗ = argmax p(θ|x1 , . . . , xN ) = (2)
θ p(x1 , . . . , xN )

• Browse the PR slides if you like to know more

• Parametric density estimators require a good function representation


• Non-parametric density estimators can operate on arbitrary distributions
C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 1
Non-Parametric Density Estimation: Histograms

• Non-parametric estimators do not use functions with a limited set of


parameters
• A simple non-parametric baseline is to create a histogram of samples1
• The number of bins is important to obtain a good fit

• Pro: Good for a quick visualization


• Pro: “Cheap” for many samples in low-dimensional space
• Con: Discontinuities at bin boundaries
• Con: Scales poorly to high dimensions (cf. curse of dimensionality later)
1
See introduction of Bishop Sec. 2.5

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 2


Improving on the Histogram Approach

• A kernel-based method and a nearest-neighbor method are slightly better


• Both variants share their mathematical framework:
• Let p(x) be a PDF in D-dim. space, and R a small region around x
R
→ The probability mass in R is p = p(x) dx
R
• Assumption 1: in R are many points → p is a relative frequency,
# points in R K
p = = (3)
total # of points N

• Assumption 2: R is small enough s.t. p(x) is approximately constant,


Z Z
p = p(x) dx = p(x) dx = p(x) · V (4)
R R

• Both assumptions together are slightly contradictory, but they yield


K # points in R
p(x) = = (5)
N·V total # of points · Volume of R
C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 3
Kernel-based DE: Parzen Window Estimator (1/2)

• The Parzen window estimator fixes V and leaves K /N variable2


• D-dimensional Parzen window kernel function (a.k.a. “box kernel”):

1 if |ui | ≤ 12 ∀i = 1, . . . , D
k (u) = (6)
0 otherwise

• Calculate K with this kernel function:


N  
X x − xi
K (x) = k (7)
h
i =1

where h is a scaling factor that adjusts the box size


• Hence, the whole density is
N  
1 X 1 x − xi
p(x) = k (8)
N hD h
i =1

2
See Bishop Sec. 2.5.1

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 4


Kernel-based DE: Parzen Window Estimator (2/2)

• The kernel removes much of the discretization error of the fixed-distance


histogram bins, but it still leads to blocky estimates
• Replacing the box kernel by a Gauss kernel further smooths the result,

N D /2 ∥x − xi ∥22
 
1 X 1
p(x) = · exp − , (9)
N 2π h2 2
2h
i =1

where h is the standard deviation of the Gaussian


• Mathematically, also any other kernel is possible if these conditions hold:

k (u) ≥0 (10)

Z
k (u) du =1 (11)

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 5


K-Nearest Neighbors (k-NN) Density Estimation

• Recall our derived equation for estimating the density

K # points in R
p(x) = = (12)
N·V total # of points · Volume of R

• The Parzen window estimator fixes V , and K varies


• The k-Nearest Neighbors estimator fixes K , and V varies
• k-NN calculates V from the distance of the K nearest neighbors3

• Note that both the Parzen window estimator and the k-NN estimator are
“non-parametric”, but they are not free of parameters
• The kernel scaling h and the number of neighbors k are hyper-parameters,
i.e., some form of prior knowledge to guide the model creation
• The model parameters are the samples themselves. Both estimators need to
store all samples, which is why they are also called memory methods
3
See Bishop Sec. 5.2.2

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 6


First Glance at the Model Selection Problem
• Optimizing the hyperparameters is also called Model Selection Problem
• Hyperparameters must be optimized on a held-out part of the training data,
the validation set:
train on training data with different hyperparameter sets hi , evaluate on
validation data to get the best performing set h∗ via maximum likelihood (ML)
• What if hyperparameters are optimized directly on the training data?
Then the most complex (largest, most flexible) model wins, because it
achieves the lowest training error
• When training data is limited, then cross validation (CV) may be a good
approximation for the generalization error
• In CV, the data is subdivided into k folds (partitions). Do k training/eval. runs
(using each fold once for validation and the rest for training), and select that
h∗ with ML across all folds
• The choice of k is a hyper-hyperparameter that trades computational time
and quality of the predicted error (cf. Hastie et al. Chap. 7)
C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 7
Hyperparameter Tuning for Unsupervised Methods

• Unsupervised tasks do not predict labels and hence there is no performance


measure like accuracy
• This also affects model selection: how to decide for model A over model B?
• Hence, we need other ways to calculate a quantitative performance measure.
We will explore several approaches to performance measurement for
unsupervised methods

• One generic approach is to measure the success in a downstream


application of the unsupervised method
• For density estimation, this can be the likelihood that some held-out
observations are drawn from the predicted density
• Hence, the best hyperparameters for density estimation are maximum
likelihood estimates for producing these held-out observations

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 8


Hyperparameter Tuning for Densities on Limited Data

• Here is a specific instantiation of hyperparameter estimation for density


estimators in a cross-validation setting (i.e., on limited data)
• The trick is to optimize the DE hyperparameters by using the likelihood of
held-out samples as objective:
• Split the N samples xi from dataset S into J folds:
j j j
Stest = {x⌊N /J ⌋·j , . . . , x⌊N /J ⌋·(j +1)−1 } , Strain = S \ Stest
• Let α be the unknown hyperparameters, and
j
let pj (x|α) be the density estimate for samples Strain on hyperparams α
Then, the ML estimate is

J −1
Y Y
α∗ = argmax pj (x|α) (13)
α
j =0 x∈S j
test

• In practice, take the logarithm (“log likelihood”) to mitigate numerical issues


→ the product becomes a sum

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 9


Lecture Pattern Analysis

Part 03: Bias and Variance

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Introduction

• The motivation behind the hyperparameter optimization is to aim for


generalization to new data
• For kernel density estimation, the pitfalls are:
• Too large kernel: covers all space with some probability mass, but the density
is too uniform (does not represent the structure)
• Too small kernel: closely represents the training data, but might assign too low
probabilities in areas without training data
• In contrast, the “optimal”1 kernel size: represents the structure of the training
data and also covers unobserved areas to some extent
• This is an instance of the bias-variance tradeoff2

1
This may sound as if there were a unique minimum, maybe even of a convex function — in practice, there is not that one single best solution; so read this as a
somewhat hypothetical statement
2
See PR lecture or Hastie/Tibshirani/Friedman Sec. 7-7.3 if more details are desired

C. Riess | Part 03: Bias and Variance 28. April 2025 1


Bias and Variance in Regression

• Bias is the square of the average deviation of an estimator from the ground
truth
• Variance denotes is the variance of the estimates, i.e., the expected squared
deviation from the estimated mean3
• Informal interpretation:
• High bias indicates model undercomplexity: we obtain a poor fit to the data
• High variance indicates model overcomplexity: the fit also models not just
the structure of the data, but also its noise
• Higher model complexity (= more model parameters) tends to lower bias and
higher variance
• We will usually not be able to get bias and variance simultaneously to 0
• Regularization increases bias and lowers variance

3
See Hastie/Tibshirani/Friedman Sec. 7.3 Eqn. (7.9) for a detailed derivation

C. Riess | Part 03: Bias and Variance 28. April 2025 2


Sketches for Model Undercomplexity and Overcomplexity

• Note that this example implicitly contains a smoothness assumption


• It does not claim that there is a universally best fit on arbitrary input
distributions (because of the No-Free-Lunch Theorem)
C. Riess | Part 03: Bias and Variance 28. April 2025 3
Transferring Bias and Variance to our Density Estimators

• Our kernel framework can directly replicate these investigations by


retargeting our kernels to regression or classification:
• Regression:
• Estimate f (x) at position x as a kernel-weighted sum of the neighbors or
• as a k -NN mean of k neighbors
• Classification:
• Estimate for classes c1 and c2 individual densities, evaluate pc1 (x) and pc2 (x),
and select the class with higher probability or
• Select the majority class within k nearest neighbors
• We will then observe that
• Larger kernel support / larger k increases bias and lowers variance
• Smaller kernel support / smaller k lowers bias and increases variance

• Analogously, we can use the notion of bias/variance also on our initial


unsupervised density estimation task

C. Riess | Part 03: Bias and Variance 28. April 2025 4

You might also like