0% found this document useful (0 votes)

1 views

pa_01_density_estimation

The document discusses the principles of pattern recognition and machine learning, emphasizing the differences between supervised and unsupervised learning, as well as the importance of parameters and hyperparameters in model training. It introduces non-parametric density estimation techniques, including histograms and kernel-based methods, and highlights the challenges of high-dimensional spaces and model selection. Additionally, it addresses hyperparameter tuning for unsupervised methods and the use of cross-validation to optimize density estimation on limited data.

Uploaded by

Nhb Sohel

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

pa_01_density_estimation

Uploaded by

Nhb Sohel

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Lecture Pattern Analysis

Part 01: Introduction and First Sampling

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Pattern Recognition Recap and Unsupervised Learning

• Remember the steps of the classical pattern recognition pipeline:

Prepro- Feature Classi-

(Data) Sampling (Class)
cessing Extraction fication

→x f (x) y

• Fundamental ML assumption: good feature representations map similar

objects to similar features
• Classifier training is almost always supervised,
i.e. a training sample is a tupel (xi , yi ) (cf. lecture “Pattern Recognition”)
• Unsupervised ML works without labels, i.e., it only operates on inputs (xi )
• Unsup. ML can be seen as representation or summary of a distribution
• So, “classification versus representation” could be a jingle to further distinguish
PR from PA (cf. our discussion in the joint meeting)

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 1
Further Aspects of Interest: Parameters and Hyperparameters

• Every machine learning model has parameters

• For example, linear regression predicts with d parameters βi a d-dimensional
hyperplane that predicts y for a d-dimensional input x̃ = (1, x1 , . . . , xd −1 )⊤ ,
d
X
⊤
y = β x̃ = βi · x̃i (1)
i =0

• Less parameters make the model more robust, more parameters make the
model more flexible
• To continue the example, consider linear regression on a basis expansion of
a scalar unknown x, e.g., fitting a d-dimensional polynomial to the vector
(1, x , x 2 , . . . , x d ): larger d enables more complex polynomials
• The dimension d is a hyperparameter, i.e., a parameter that somehow
parameterizes the choice of parameters
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 2
Further Aspects of Interest: Local Operators and High
Dimensional Spaces

• Thinking about model flexibility: more “local” models are more flexible, but
require more parameters and are less robust
• How can we find a good trade-off? This is the model selection problem

• Another issue: all local models perform poorly in higher dimensional spaces
• A probably surprising consequence is that high-dimensional methods must
be non-local along some direction

• Also summarization methods (clustering) performs poorly in higher

dimensional spaces

• All these points motivate to also look into dimensionality reduction

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 3
A Study of Distributions

• In PA, we look at data in feature spaces

• To understand and manipulate these data points, they are mathematically
commonly represented as probability distribution functions (PDFs)
• Additionally, inference allows to draw conclusions from distributions

• Common operations on distributions:

• Fitting a distribution model to the data (parametric or non-parametric)
represents the data as a distribution
• Sampling from a distribution creates new data points that follow the
distribution (i.e., they are plausible)
• Factorizing a distribution is a key technique for reducing the complexity

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 4
Recap on Probability Vocabulary

• Let X , Y denote two random variables

• Important vocabulary and equations are:
Joint distribution p (X , Y )

Conditional distribution of X given Y p(X |Y )

Sum rule / marginalization over Y p (X ) p(X , Y )

P
=
Y

Product rule p (X , Y ) = p(Y |X ) · p(X )

p(X |Y )·p(Y )
Bayes rule p(Y |X ) = p (X )
likelihood·prior
Bayes rule in the language of ML posterior = evidence

• Please browse the book by Bishop, Sec. 1.2.3, to refresh your mind if
necessary!

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 5
Sampling from a PDF

• Oftentimes, it is necessary to draw samples from a PDF

• Example:
• Logistic Regression fits a single regression curve to the data (cf. PR)
• Bayesian Logistic Regression fits a distribution of curves

The distribution is narrow at observations (crosses), and wider otherwise

• Sample curves from the distribution to obtain its spread (“uncertainty”)

• Special PDFs like Gaussians have closed-form solutions for sampling

• We look now at a sampling method that works on arbitrary PDFs
C. Riess | Part 01: Introduction and First Sampling 28. April 2025 6
Idea of the Sampling Algorithm

• The key idea is to use the cumulative density function (CDF) P (z ) of p(X ),

Zz
P (z ) = p(X )dX (2)
−∞

• A sample uniformly drawn from the CDF y -axis intersects P (z ) at location z

• This z position is our random draw from p(x ):

1 1

p(X ) P (Z ) P (Z )

X z z

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 7
Sampling Algorithm

• Split the domain of p(X ) into discrete bins, enumerate the bins
• On these bins, calculate the cumulative density function (CDF) P (z ) of p(X )
• Draw a uniformly distributed number u between 0 and 1
• The sample from the PDF is

z∗ = argmin P (z ) ≥ u , (3)
z

i.e., the value where u intersects the CDF

• Note that the range of a CDF is always [0; 1]

• The CDF “warps” the uniform sample into a sample from distribution p(X ).

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 8
Practical Realization and Limitations

• Theoretically, we can split any space into bins

• In practice, however, it is not clear how small or large these bins shall be
• Do empty bins indicate that it is impossible to sample from there? Or do they
indicate a lack of observations?
• Hence, the method requires many observations (as a naive rule-of-thumb,
one can use 30 · b observations for b bins)
• Note that for a fixed bin width, the number of bins grows exponentially with
the dimensionality of the space

• These limitations make the method somewhat unattractive in practice.

We will look at more advanced sampling strategies later

C. Riess | Part 01: Introduction and First Sampling 28. April 2025 9
Lecture Pattern Analysis

Part 02: Non-Parametric Density Estimation

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Introduction

• Density Estimation = create a PDF from a set of samples

• The lecture Pattern Recognition introduces parametric density estimation:
• There, a parametric model (e.g., a Gaussian) is fitted to the data
• Maximum Likelihood (ML) estimator:

θ ∗ = argmax p(x1 , . . . , xN |θ) (1)

• Maximum a Posteriori (MAP) estimator:

Bayes p(x1 , . . . , xN |θ) · p(θ)

θ ∗ = argmax p(θ|x1 , . . . , xN ) = (2)
θ p(x1 , . . . , xN )

• Browse the PR slides if you like to know more

• Parametric density estimators require a good function representation

• Non-parametric density estimators can operate on arbitrary distributions
C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 1
Non-Parametric Density Estimation: Histograms

• Non-parametric estimators do not use functions with a limited set of

parameters
• A simple non-parametric baseline is to create a histogram of samples1
• The number of bins is important to obtain a good fit

• Pro: Good for a quick visualization

• Pro: “Cheap” for many samples in low-dimensional space
• Con: Discontinuities at bin boundaries
• Con: Scales poorly to high dimensions (cf. curse of dimensionality later)
1
See introduction of Bishop Sec. 2.5

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 2

Improving on the Histogram Approach

• A kernel-based method and a nearest-neighbor method are slightly better

• Both variants share their mathematical framework:
• Let p(x) be a PDF in D-dim. space, and R a small region around x
R
→ The probability mass in R is p = p(x) dx
R
• Assumption 1: in R are many points → p is a relative frequency,
# points in R K
p = = (3)
total # of points N

• Assumption 2: R is small enough s.t. p(x) is approximately constant,

Z Z
p = p(x) dx = p(x) dx = p(x) · V (4)
R R

• Both assumptions together are slightly contradictory, but they yield

K # points in R
p(x) = = (5)
N·V total # of points · Volume of R
C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 3
Kernel-based DE: Parzen Window Estimator (1/2)

• The Parzen window estimator fixes V and leaves K /N variable2

• D-dimensional Parzen window kernel function (a.k.a. “box kernel”):

1 if |ui | ≤ 12 ∀i = 1, . . . , D
k (u) = (6)
0 otherwise

• Calculate K with this kernel function:

N
X x − xi
K (x) = k (7)
h
i =1

where h is a scaling factor that adjusts the box size

• Hence, the whole density is
N
1 X 1 x − xi
p(x) = k (8)
N hD h
i =1

2
See Bishop Sec. 2.5.1

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 4

Kernel-based DE: Parzen Window Estimator (2/2)

• The kernel removes much of the discretization error of the fixed-distance

histogram bins, but it still leads to blocky estimates
• Replacing the box kernel by a Gauss kernel further smooths the result,

N D /2 ∥x − xi ∥22

1 X 1
p(x) = · exp − , (9)
N 2π h2 2
2h
i =1

where h is the standard deviation of the Gaussian

• Mathematically, also any other kernel is possible if these conditions hold:

k (u) ≥0 (10)

Z
k (u) du =1 (11)

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 5

K-Nearest Neighbors (k-NN) Density Estimation

• Recall our derived equation for estimating the density

K # points in R
p(x) = = (12)
N·V total # of points · Volume of R

• The Parzen window estimator fixes V , and K varies

• The k-Nearest Neighbors estimator fixes K , and V varies
• k-NN calculates V from the distance of the K nearest neighbors3

• Note that both the Parzen window estimator and the k-NN estimator are
“non-parametric”, but they are not free of parameters
• The kernel scaling h and the number of neighbors k are hyper-parameters,
i.e., some form of prior knowledge to guide the model creation
• The model parameters are the samples themselves. Both estimators need to
store all samples, which is why they are also called memory methods
3
See Bishop Sec. 5.2.2

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 6

First Glance at the Model Selection Problem
• Optimizing the hyperparameters is also called Model Selection Problem
• Hyperparameters must be optimized on a held-out part of the training data,
the validation set:
train on training data with different hyperparameter sets hi , evaluate on
validation data to get the best performing set h∗ via maximum likelihood (ML)
• What if hyperparameters are optimized directly on the training data?
Then the most complex (largest, most flexible) model wins, because it
achieves the lowest training error
• When training data is limited, then cross validation (CV) may be a good
approximation for the generalization error
• In CV, the data is subdivided into k folds (partitions). Do k training/eval. runs
(using each fold once for validation and the rest for training), and select that
h∗ with ML across all folds
• The choice of k is a hyper-hyperparameter that trades computational time
and quality of the predicted error (cf. Hastie et al. Chap. 7)
C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 7
Hyperparameter Tuning for Unsupervised Methods

• Unsupervised tasks do not predict labels and hence there is no performance

measure like accuracy
• This also affects model selection: how to decide for model A over model B?
• Hence, we need other ways to calculate a quantitative performance measure.
We will explore several approaches to performance measurement for
unsupervised methods

• One generic approach is to measure the success in a downstream

application of the unsupervised method
• For density estimation, this can be the likelihood that some held-out
observations are drawn from the predicted density
• Hence, the best hyperparameters for density estimation are maximum
likelihood estimates for producing these held-out observations

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 8

Hyperparameter Tuning for Densities on Limited Data

• Here is a specific instantiation of hyperparameter estimation for density

estimators in a cross-validation setting (i.e., on limited data)
• The trick is to optimize the DE hyperparameters by using the likelihood of
held-out samples as objective:
• Split the N samples xi from dataset S into J folds:
j j j
Stest = {x⌊N /J ⌋·j , . . . , x⌊N /J ⌋·(j +1)−1 } , Strain = S \ Stest
• Let α be the unknown hyperparameters, and
j
let pj (x|α) be the density estimate for samples Strain on hyperparams α
Then, the ML estimate is

J −1
Y Y
α∗ = argmax pj (x|α) (13)
α
j =0 x∈S j
test

• In practice, take the logarithm (“log likelihood”) to mitigate numerical issues

→ the product becomes a sum

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 9

Lecture Pattern Analysis

Part 03: Bias and Variance

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
28. April 2025
Introduction

• The motivation behind the hyperparameter optimization is to aim for

generalization to new data
• For kernel density estimation, the pitfalls are:
• Too large kernel: covers all space with some probability mass, but the density
is too uniform (does not represent the structure)
• Too small kernel: closely represents the training data, but might assign too low
probabilities in areas without training data
• In contrast, the “optimal”1 kernel size: represents the structure of the training
data and also covers unobserved areas to some extent
• This is an instance of the bias-variance tradeoff2

1
This may sound as if there were a unique minimum, maybe even of a convex function — in practice, there is not that one single best solution; so read this as a
somewhat hypothetical statement
2
See PR lecture or Hastie/Tibshirani/Friedman Sec. 7-7.3 if more details are desired

C. Riess | Part 03: Bias and Variance 28. April 2025 1

Bias and Variance in Regression

• Bias is the square of the average deviation of an estimator from the ground
truth
• Variance denotes is the variance of the estimates, i.e., the expected squared
deviation from the estimated mean3
• Informal interpretation:
• High bias indicates model undercomplexity: we obtain a poor fit to the data
• High variance indicates model overcomplexity: the fit also models not just
the structure of the data, but also its noise
• Higher model complexity (= more model parameters) tends to lower bias and
higher variance
• We will usually not be able to get bias and variance simultaneously to 0
• Regularization increases bias and lowers variance

3
See Hastie/Tibshirani/Friedman Sec. 7.3 Eqn. (7.9) for a detailed derivation

C. Riess | Part 03: Bias and Variance 28. April 2025 2

Sketches for Model Undercomplexity and Overcomplexity

• Note that this example implicitly contains a smoothness assumption

• It does not claim that there is a universally best fit on arbitrary input
distributions (because of the No-Free-Lunch Theorem)
C. Riess | Part 03: Bias and Variance 28. April 2025 3
Transferring Bias and Variance to our Density Estimators

• Our kernel framework can directly replicate these investigations by

retargeting our kernels to regression or classification:
• Regression:
• Estimate f (x) at position x as a kernel-weighted sum of the neighbors or
• as a k -NN mean of k neighbors
• Classification:
• Estimate for classes c1 and c2 individual densities, evaluate pc1 (x) and pc2 (x),
and select the class with higher probability or
• Select the majority class within k nearest neighbors
• We will then observe that
• Larger kernel support / larger k increases bias and lowers variance
• Smaller kernel support / smaller k lowers bias and increases variance

• Analogously, we can use the notion of bias/variance also on our initial

unsupervised density estimation task

C. Riess | Part 03: Bias and Variance 28. April 2025 4

Data Science An Introduction To Statistics and Machine Learning (Matthias Plaue) (Z-Library)
100% (1)
Data Science An Introduction To Statistics and Machine Learning (Matthias Plaue) (Z-Library)
372 pages
Homework Nonprm Solution
No ratings yet
Homework Nonprm Solution
2 pages
FFT
No ratings yet
FFT
10 pages
Study of Van Emde Boas Tree With Application To Dijkstra: Advanced Problem Solving
No ratings yet
Study of Van Emde Boas Tree With Application To Dijkstra: Advanced Problem Solving
16 pages
Non-Parametric Methods
No ratings yet
Non-Parametric Methods
51 pages
Histogram Density Estimation
No ratings yet
Histogram Density Estimation
17 pages
Empirical Finance1
No ratings yet
Empirical Finance1
31 pages
TEAA - Memory Based Tecniques
No ratings yet
TEAA - Memory Based Tecniques
23 pages
Ast Part1 PDF
No ratings yet
Ast Part1 PDF
20 pages
densityestimation
No ratings yet
densityestimation
28 pages
CpE646 7v3 PDF
No ratings yet
CpE646 7v3 PDF
40 pages
Racine - 2007 - Nonparametric Econometrics A Primer
No ratings yet
Racine - 2007 - Nonparametric Econometrics A Primer
88 pages
A Primer in Nonparametric Econometrics
No ratings yet
A Primer in Nonparametric Econometrics
88 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
UNIT 3 - Frequentist Statistics
No ratings yet
UNIT 3 - Frequentist Statistics
65 pages
The Study of Different Types of Kernel Density Estimators: Minge Sha, Yonggang Xie
No ratings yet
The Study of Different Types of Kernel Density Estimators: Minge Sha, Yonggang Xie
5 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
M3 DensityEstimation v1
No ratings yet
M3 DensityEstimation v1
65 pages
Lecture Notes On Bayesian Nonparametrics: Version: May 16, 2014
No ratings yet
Lecture Notes On Bayesian Nonparametrics: Version: May 16, 2014
108 pages
BNP PDF
No ratings yet
BNP PDF
108 pages
U4 ProbabilityDensityEstimation
No ratings yet
U4 ProbabilityDensityEstimation
6 pages
Lec 04
No ratings yet
Lec 04
70 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
MT2023-Sol
No ratings yet
MT2023-Sol
8 pages
Tabak-Turner
No ratings yet
Tabak-Turner
20 pages
13 Density Estimation Note
No ratings yet
13 Density Estimation Note
48 pages
Non Parametric Density Estimation
No ratings yet
Non Parametric Density Estimation
4 pages
Parameter Estimation - PR
No ratings yet
Parameter Estimation - PR
66 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Modern Multivariate Statistical Techniques: - Nonparametric Density Estimation Xi Chen Nov 6
No ratings yet
Modern Multivariate Statistical Techniques: - Nonparametric Density Estimation Xi Chen Nov 6
20 pages
UNIT2SVMKNN
No ratings yet
UNIT2SVMKNN
31 pages
Review of Kernel Density Estimation
No ratings yet
Review of Kernel Density Estimation
35 pages
Geoff Bohling NonParClass
No ratings yet
Geoff Bohling NonParClass
26 pages
Density Estimation Is A Statistical Technique Used
No ratings yet
Density Estimation Is A Statistical Technique Used
16 pages
Densityestimation
No ratings yet
Densityestimation
33 pages
Chap 4
No ratings yet
Chap 4
21 pages
Merged Exercises
No ratings yet
Merged Exercises
238 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
A Short Course On Nonparametric Curve Estimation R PDF
No ratings yet
A Short Course On Nonparametric Curve Estimation R PDF
114 pages
Density Estimation
No ratings yet
Density Estimation
17 pages
2009 Paninsky Nonparametric estimation of entropy and distributions
No ratings yet
2009 Paninsky Nonparametric estimation of entropy and distributions
34 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
I2ml3e Chap8
No ratings yet
I2ml3e Chap8
28 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
16 pages
09 ML Nonparametric Machine Learning
No ratings yet
09 ML Nonparametric Machine Learning
19 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
Robust Kernel Density Estimation-Kim And Scott
No ratings yet
Robust Kernel Density Estimation-Kim And Scott
37 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
Statistics 202C Study Guide: Part I: Sampling Basic Unstructured Distributions and Monte Carlo Basics
No ratings yet
Statistics 202C Study Guide: Part I: Sampling Basic Unstructured Distributions and Monte Carlo Basics
14 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
Bayesian Nonparametric Models: Peter Orbanz, Cambridge University Yee Whye Teh, University College London
No ratings yet
Bayesian Nonparametric Models: Peter Orbanz, Cambridge University Yee Whye Teh, University College London
14 pages
Bayes Intro PT 2
No ratings yet
Bayes Intro PT 2
13 pages
Mean-Shift Tracking: R.Collins, CSE, PSU CSE598G Spring 2006
No ratings yet
Mean-Shift Tracking: R.Collins, CSE, PSU CSE598G Spring 2006
93 pages
Bhattacharya Nonparametric
No ratings yet
Bhattacharya Nonparametric
30 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
00 - FIR Filtering Results Review & Practical Applications
No ratings yet
00 - FIR Filtering Results Review & Practical Applications
70 pages
Thuật toán NLP
No ratings yet
Thuật toán NLP
57 pages
A Combined EBSD and Machine Learning Study
No ratings yet
A Combined EBSD and Machine Learning Study
15 pages
W1-2 Introduction
No ratings yet
W1-2 Introduction
46 pages
Python ML Cheat Sheet
100% (1)
Python ML Cheat Sheet
29 pages
Image Compression Techniques
No ratings yet
Image Compression Techniques
5 pages
Array Sample Questions
No ratings yet
Array Sample Questions
12 pages
Main Assessment: Subject: Quantitative Techniques Subject Code: Kwn10Ab / Qth115E
No ratings yet
Main Assessment: Subject: Quantitative Techniques Subject Code: Kwn10Ab / Qth115E
12 pages
Revision Questions: 25 Comment(s)
No ratings yet
Revision Questions: 25 Comment(s)
6 pages
2019 Fin Econ
No ratings yet
2019 Fin Econ
6 pages
Vanishing Gradient Problem in Deep Learning Understanding Intuition and Solutions
No ratings yet
Vanishing Gradient Problem in Deep Learning Understanding Intuition and Solutions
8 pages
Introduction To State Space Models (SSM) 2
No ratings yet
Introduction To State Space Models (SSM) 2
1 page
Anomaly Detection in Structural Health Monitoring with Ensemble Learning and Reinforcement Learning
No ratings yet
Anomaly Detection in Structural Health Monitoring with Ensemble Learning and Reinforcement Learning
16 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Stat Modelling Notes
No ratings yet
Stat Modelling Notes
49 pages
Chaos and Complexity Research Compendium
100% (3)
Chaos and Complexity Research Compendium
368 pages
The Geometry of Quantum Computation
No ratings yet
The Geometry of Quantum Computation
23 pages
Download Complete Introduction to Algorithms for Data Mining and Machine Learning 1st edition - eBook PDF PDF for All Chapters
100% (1)
Download Complete Introduction to Algorithms for Data Mining and Machine Learning 1st edition - eBook PDF PDF for All Chapters
62 pages
Almayahi 2020
No ratings yet
Almayahi 2020
6 pages
Multiple_Disease_Prediction_Using_ML_and_Doctor_Recommendation_by_Sentiment_Analysis
No ratings yet
Multiple_Disease_Prediction_Using_ML_and_Doctor_Recommendation_by_Sentiment_Analysis
4 pages
Decision Theory
No ratings yet
Decision Theory
5 pages
Tut 2
100% (1)
Tut 2
10 pages
PDF Introduction To Algorithms Lesson Plan
No ratings yet
PDF Introduction To Algorithms Lesson Plan
3 pages
MAD101-SU24 Assignment2
No ratings yet
MAD101-SU24 Assignment2
2 pages
Unit - V Aem
No ratings yet
Unit - V Aem
144 pages
Crypto Assignment 4
No ratings yet
Crypto Assignment 4
9 pages

pa_01_density_estimation

Uploaded by

pa_01_density_estimation

Uploaded by

Lecture Pattern Analysis

Part 01: Introduction and First Sampling

• Remember the steps of the classical pattern recognition pipeline:

Prepro- Feature Classi-

• Fundamental ML assumption: good feature representations map similar

• Every machine learning model has parameters

• Also summarization methods (clustering) performs poorly in higher

• All these points motivate to also look into dimensionality reduction

• In PA, we look at data in feature spaces

• Common operations on distributions:

• Let X , Y denote two random variables

Conditional distribution of X given Y p(X |Y )

Sum rule / marginalization over Y p (X ) p(X , Y )

Product rule p (X , Y ) = p(Y |X ) · p(X )

• Oftentimes, it is necessary to draw samples from a PDF

The distribution is narrow at observations (crosses), and wider otherwise

• Special PDFs like Gaussians have closed-form solutions for sampling

• A sample uniformly drawn from the CDF y -axis intersects P (z ) at location z

i.e., the value where u intersects the CDF

• Note that the range of a CDF is always [0; 1]

• Theoretically, we can split any space into bins

• These limitations make the method somewhat unattractive in practice.

Part 02: Non-Parametric Density Estimation

• Density Estimation = create a PDF from a set of samples

θ ∗ = argmax p(x1 , . . . , xN |θ) (1)

• Maximum a Posteriori (MAP) estimator:

Bayes p(x1 , . . . , xN |θ) · p(θ)

• Browse the PR slides if you like to know more

• Parametric density estimators require a good function representation

• Non-parametric estimators do not use functions with a limited set of

• Pro: Good for a quick visualization

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 2

• A kernel-based method and a nearest-neighbor method are slightly better

• Assumption 2: R is small enough s.t. p(x) is approximately constant,

• Both assumptions together are slightly contradictory, but they yield

• The Parzen window estimator fixes V and leaves K /N variable2

• Calculate K with this kernel function:

where h is a scaling factor that adjusts the box size

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 4

• The kernel removes much of the discretization error of the fixed-distance

where h is the standard deviation of the Gaussian

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 5

• Recall our derived equation for estimating the density

• The Parzen window estimator fixes V , and K varies

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 6

• Unsupervised tasks do not predict labels and hence there is no performance

• One generic approach is to measure the success in a downstream

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 8

• Here is a specific instantiation of hyperparameter estimation for density

• In practice, take the logarithm (“log likelihood”) to mitigate numerical issues

C. Riess | Part 02: Non-Parametric Density Estimation 28. April 2025 9

Part 03: Bias and Variance

• The motivation behind the hyperparameter optimization is to aim for

C. Riess | Part 03: Bias and Variance 28. April 2025 1

C. Riess | Part 03: Bias and Variance 28. April 2025 2

• Note that this example implicitly contains a smoothness assumption

• Our kernel framework can directly replicate these investigations by

• Analogously, we can use the notion of bias/variance also on our initial

C. Riess | Part 03: Bias and Variance 28. April 2025 4

You might also like