U4 ProbabilityDensityEstimation
U4 ProbabilityDensityEstimation
I. I NTRODUCTION
A. What is (probability) density estimation?
To understand what density estimation is we should first
recapitulate what a probability density function (pdf) is: Given
a random variable X we can specify the probability density as
a function f whose values are the relative likelihoods, that the
value of the random variable X would equal a given random
sample. So if we’d like to know the probability, that a sample
falls into an interval from a to b we would calculate the area
under the graph of the density function f given by the formula
1. Z b Fig. 2. Density estimate constructed from the observations of the direction
of turtle data. [4]
P (a < X < b) = f (x)dx (1)
a
This function is continuous, nonnegative and the integral
each of the turtles was observed to swim when released [2].
over X integrates to one. With density estimation we try to
The graph has two maxima, one global maxima at 60 and one
estimate this unknown probability density function [1] from
local at about 260 which means that most of the turtles swam
the observed data points X1 , ..., Xn . We call this estimated
into the 60 direction and a small group preferred the opposite
function in the following fˆ.
direction. This multimodality is easily comprehensible from
B. Why do we need (probability) density estimation? the density estimate.
One of the most common uses of pdfs is in the basic While we saw that the plotted density estimate of a given
investigation of the properties of observed data, like skewness dataset is a good starting point for an initial analysis it’s
and multimodality. Figure 1 shows the probability of different also very useful for presenting the results to a client. For
heights of a steel surface from a collection of observations. this application density estimates often are a good fit because
We see that the highest density is at around 35µm and that of their comprehensibility to non-mathematicians. Apart from
the probability that the height of the steel surface is in the the graphical output, density estimates are also often used as
range from 20µm to 40 µm is quite high. If we calculate the intermediate products for other algorithms and applications
integral over this range we get the exact probability. We can like for example classification [5] or discriminant analysis [6].
also see that the distribution of the height is skewed, so that
the tail on the left side is longer than the tail on the right side. C. Intuitive approaches to pdf
In the example of Figure 2 we see the density estimate of So how do we get an estimate of the unknown density of
a turtle dataset. This dataset contains the directions in which our samples X1 to Xn ? Well lets start with the simplest way
50 0.1
8 · 10−2
40
Density estimate
Age [years]
6 · 10−2
30
4 · 10−2
20
2 · 10−2
0.3 our histogram estimator must differ quite a lot from the
true density function. This difference between the expected
value of our estimator and the true density function of the
0.2 data is called Bias. In the case of the histogram one easy
way to decrease our bias is to increase our number of bins.
But if maximize our number of bins we end up with a
0.1 scatterplot as in figure I-C. As we discussed in the Introduction
a scatterplot is difficult to read, because of its enourmous
variance. Therefore we are forced into a difficult tradeoff:
0 If we decrease bias, we increase variance and vice versa.
5 15 25 35 45 55 65
This tradeoff is characteristic for the density estimation. An-
Age [years] other problem is that the values of adjacent bins can vary
extremely simply if there are fluctuations in the sample. If
Fig. 5. Starting position 5, ”Population in Afghanistan” 2015 with the large
bin-width of 10 years we increase the amount of bins we can fix this problem,
but then we create the problem of oversmoothing our density
and thereby loosing details in our distribution. One way to
solve this are adaptive methods which we’ll introduce with
0.4 the nearest neighbour method.
A. Multivariate Histograms
Density estimate
The biggest challenge for Kernel Estimators are varying data Because the distance dk (t) in the tails is larger than in the
densities. With the fixed window-width of the kernel estimator dense part of the distribution, the estimate is there smaller and
we often have spurious noise in the tails of the estimate like thereby smooths adaptive to the underlying density.
in Fig. 10a. If we smooth out the noise in the tails, we also While we see from the definition 9 that fˆ is continuous
smooth out the detail in the main part of the distribution as we its derivate at the positions of dk is discontinuous. A major
see in Fig. 10b. But often we’d like to preserve the variance difference to the KDE is that the nearest neighbor estimate
in regions with high density and smooth the data in regions doesn’t integrate to one, because the tails of fˆ decrease very
with a very low density. slowly. So if we’re only interested in a smaller part of the data
There are different adaptive methods to deal with this the NNE is fine, but if we’re interested in the entire dataset
problem. One way method is to use small bandwidths in we’d be better of with a different estimator.
regions with high density and large bandwidths in regions with
low density. [3]. This approach is also known as the adaptive IV. C ONCLUSION AND PROSPECT
kernel estimator. In the next chapter we’ll talk about a different Nonparametric density estimation has the huge advantage
approach to deal with this smoothing problem, called the (kth) that we can make less rigid assumptions about the underlying
nearest neighbor estimator. data. But we still have to choose a density estimator. The
Nevertheless the kernel density estimator is apart from the histogram is the most common pick for visualising results
histogram probably the most commonly used estimator and for a client or getting a first impression of the data. If a
certainly the most studied mathematically [1]. more advanced estimator is needed, for example if we need
derivatives of our estimate the kernel density estimator is often
C. Nearest Neighbor Estimator (NNE) a good fit.
The nearest neighbor estimator is proportional to the dis- The k-th nearest neighbour estimator is not that common
tance to the k-th nearest sample, instead of being based on for density estimation, but in cases with spurious noise in the
the number of samples falling into a bin with fixed width tails, where the KDE has problems finding a good smoothing
parameter, it still can be a good fit. The idea of the k-th nearest
neighbour on the other hand is very common for classification
problems. [2] The MISE and the bias and variance tradeoff
are also very useful in other disciplines like for example in
signal processing.
R EFERENCES
[1] B. Silverman, Density Estimation for Statistics and Data
Analysis, ser. Chapman & Hall/CRC Monographs on Statistics &
Applied Probability. Taylor & Francis, 1986. [Online]. Available:
https://books.google.de/books?id=e-xsrjsL7WkC
[2] E. Fix and J. L. Hodges Jr, “Discriminatory analysis-nonparametric
discrimination: consistency properties,” DTIC Document, Tech. Rep.,
1951.
[3] D. Scott, Multivariate Density Estimation: Theory, Practice, and
Visualization, ser. Wiley Series in Probability and Statistics. Wiley, 2009.
[Online]. Available: https://books.google.de/books?id=wdc8Xme FfkC
[4] B. Silverman, “Choosing the window width when estimating a density,”
Biometrika, vol. 65, no. 1, pp. 1–11, 1978, cited By 76.
[5] M. Kobos and J. Mandziuk, “Classification based on combination of
kernel density estimators,” Artificial Neural Networks - ICANN 2009, pp.
125–134, 2009.
[6] L. W. Xipeng Qiu, “Nearest neighbor discriminant analysis,” International
Journal of Pattern Recognition and Artificial Intelligence, 2006.
[7] “Afghanistan: Standard demographic and health surveys, 2015,” 2015.
[8] B. R., “Adaptive control processes,” Princeton University Press, Prince-
ton, NJ, 1961.