pa_01_density_estimation
pa_01_density_estimation
Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Pattern Recognition Recap and Unsupervised Learning
→x f (x) y
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 1
Further Aspects of Interest: Parameters and Hyperparameters
• Less parameters make the model more robust, more parameters make the
model more flexible
• To continue the example, consider linear regression on a basis expansion of
a scalar unknown x, e.g., fitting a d-dimensional polynomial to the vector
(1, x , x 2 , . . . , x d ): larger d enables more complex polynomials
• The dimension d is a hyperparameter, i.e., a parameter that somehow
parameterizes the choice of parameters
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 2
Further Aspects of Interest: Local Operators and High
Dimensional Spaces
• Thinking about model flexibility: more “local” models are more flexible, but
require more parameters and are less robust
• How can we find a good trade-off? This is the model selection problem
• Another issue: all local models perform poorly in higher dimensional spaces
• A probably surprising consequence is that high-dimensional methods must
be non-local along some direction
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 3
A Study of Distributions
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 4
Recap on Probability Vocabulary
• Please browse the book by Bishop, Sec. 1.2.3, to refresh your mind if
necessary!
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 5
Sampling from a PDF
• The key idea is to use the cumulative density function (CDF) P (z ) of p(X ),
Zz
P (z ) = p(X )dX (2)
−∞
1 1
p(X ) P (Z ) P (Z )
X z z
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 7
Sampling Algorithm
• Split the domain of p(X ) into discrete bins, enumerate the bins
• On these bins, calculate the cumulative density function (CDF) P (z ) of p(X )
• Draw a uniformly distributed number u between 0 and 1
• The sample from the PDF is
z∗ = argmin P (z ) ≥ u , (3)
z
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 8
Practical Realization and Limitations
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 9
Lecture Pattern Analysis
Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Introduction
2
See Bishop Sec. 2.5.1
N D /2 ∥x − xi ∥22
1 X 1
p(x) = · exp − , (9)
N 2π h2 2
2h
i =1
k (u) ≥0 (10)
Z
k (u) du =1 (11)
K # points in R
p(x) = = (12)
N·V total # of points · Volume of R
• Note that both the Parzen window estimator and the k-NN estimator are
“non-parametric”, but they are not free of parameters
• The kernel scaling h and the number of neighbors k are hyper-parameters,
i.e., some form of prior knowledge to guide the model creation
• The model parameters are the samples themselves. Both estimators need to
store all samples, which is why they are also called memory methods
3
See Bishop Sec. 5.2.2
J −1
Y Y
α∗ = argmax pj (x|α) (13)
α
j =0 x∈S j
test
Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Introduction
1
This may sound as if there were a unique minimum, maybe even of a convex function — in practice, there is not that one single best solution; so read this as a
somewhat hypothetical statement
2
See PR lecture or Hastie/Tibshirani/Friedman Sec. 7-7.3 if more details are desired
• Bias is the square of the average deviation of an estimator from the ground
truth
• Variance denotes is the variance of the estimates, i.e., the expected squared
deviation from the estimated mean3
• Informal interpretation:
• High bias indicates model undercomplexity: we obtain a poor fit to the data
• High variance indicates model overcomplexity: the fit also models not just
the structure of the data, but also its noise
• Higher model complexity (= more model parameters) tends to lower bias and
higher variance
• We will usually not be able to get bias and variance simultaneously to 0
• Regularization increases bias and lowers variance
3
See Hastie/Tibshirani/Friedman Sec. 7.3 Eqn. (7.9) for a detailed derivation