Machine Learning For Signal Processing Data Science Algorithms And Computational Statistics Max A Little download
Machine Learning For Signal Processing Data Science Algorithms And Computational Statistics Max A Little download
https://ebookbell.com/product/machine-learning-for-signal-
processing-data-science-algorithms-and-computational-statistics-
max-a-little-52556894
https://ebookbell.com/product/machine-learning-for-signal-processing-
data-science-algorithms-and-computational-statistics-max-a-
little-10842174
Linear Algebra For Data Science Machine Learning And Signal Processing
Jeffrey A Fessler
https://ebookbell.com/product/linear-algebra-for-data-science-machine-
learning-and-signal-processing-jeffrey-a-fessler-62065588
Signal Processing And Machine Learning For Biomedical Big Data Falk
https://ebookbell.com/product/signal-processing-and-machine-learning-
for-biomedical-big-data-falk-7159780
Machine Learning Algorithms For Signal And Image Processing Suman Lata
Tripathi
https://ebookbell.com/product/machine-learning-algorithms-for-signal-
and-image-processing-suman-lata-tripathi-48672584
Bayesian Tensor Decomposition For Signal Processing And Machine
Learning Modeling Tuningfree Algorithms And Applications Lei Cheng
https://ebookbell.com/product/bayesian-tensor-decomposition-for-
signal-processing-and-machine-learning-modeling-tuningfree-algorithms-
and-applications-lei-cheng-49152314
https://ebookbell.com/product/signal-processing-and-machine-learning-
for-brainmachine-interfaces-tanaka-22038246
Machine Learning Methods For Signal Image And Speech Processing Meerja
Akhil Jabbar
https://ebookbell.com/product/machine-learning-methods-for-signal-
image-and-speech-processing-meerja-akhil-jabbar-37258816
https://ebookbell.com/product/machine-learning-for-algorithmic-
trading-predictive-models-to-extract-signals-from-market-and-
alternative-data-for-systematic-trading-strategies-with-python-2nd-
edition-2nd-edition-stefan-jansen-11734656
https://ebookbell.com/product/practical-guide-for-biomedical-signals-
analysis-using-machine-learning-techniques-a-matlab-based-
approach-1st-edition-abdulhamit-subasi-10666766
Machine Learning for Signal
Processing
Machine Learning for Signal
Processing
Data Science, Algorithms, and Computational
Statistics
Max A. Little
1
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Max A. Little 2019
The moral rights of the author have been asserted
First Edition published in 2019
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2019944777
ISBN 978–0–19–871493–4
DOI: 10.1093/oso/9780198714934.001.0001
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Preface v
Preface
Digital signal processing (DSP) is one of the ‘foundational’ but some-
what invisible, engineering topics of the modern world, without which,
many of the technologies we take for granted: the digital telephone,
digital radio, television, CD and MP3 players, WiFi, radar, to name
just a few, would not be possible. A relative newcomer by comparison,
statistical machine learning is the theoretical backbone of exciting tech-
nologies that are by now starting to reach a level of ubiquity, such as
automatic techniques for car registration plate recognition, speech recog-
nition, stock market prediction, defect detection on assembly lines, robot
guidance and autonomous car navigation. Statistical machine learning
has origins in the recent merging of classical probability and statistics
with artificial intelligence, which exploits the analogy between intelligent
information processing in biological brains and sophisticated statistical
modelling and inference.
DSP and statistical machine learning are of such wide importance to
the knowledge economy that both have undergone rapid changes and
seen radical improvements in scope and applicability. Both DSP and
statistical machine learning make use of key topics in applied mathe-
matics such as probability and statistics, algebra, calculus, graphs and
networks. Therefore, intimate formal links between the two subjects
exist and because of this, an emerging consensus view is that DSP and
statistical machine learning should not be seen as separate subjects. The
many overlaps that exist between the two subjects can be exploited to
produce new digital signal processing tools of surprising utility and effi-
ciency, and wide applicability, highly suited to the contemporary world
of pervasive digital sensors and high-powered and yet cheap, comput-
ing hardware. This book gives a solid mathematical foundation to the
topic of statistical machine learning for signal processing, including the
contemporary concepts of the probabilistic graphical model (PGM) and
nonparametric Bayes, concepts which have only more recently emerged
as important for solving DSP problems.
The book is aimed at advanced undergraduates or first-year PhD stu-
dents as well as researchers and practitioners. It addresses the founda-
tional mathematical concepts, illustrated with pertinent and practical
examples across a range of problems in engineering and science. The aim
is to enable students with an undergraduate background in mathemat-
ics, statistics or physics, from a wide range of quantitative disciplines,
to get quickly up to speed with the latest techniques and concepts in
this fast-moving field. The accompanying software will enable readers
to test out the techniques to their own signal analysis problems. The
presentation of the mathematics is much along the lines of a standard
undergraduate physics or statistics textbook, free of distracting techni-
cal complexities and jargon, while not sacrificing rigour. It would be an
excellent textbook for emerging courses in machine learning for signals.
Contents
Preface v
List of Figures xv
1 Mathematical foundations 1
1.1 Abstract algebras 1
Groups 1
Rings 3
1.2 Metrics 4
1.3 Vector spaces 5
Linear operators 7
Matrix algebra 7
Square and invertible matrices 8
Eigenvalues and eigenvectors 9
Special matrices 10
1.4 Probability and stochastic processes 12
Sample spaces, events, measures and distributions 12
Joint random variables: independence, conditionals, and
marginals 14
Bayes’ rule 16
Expectation, generating functions and characteristic func-
tions 17
Empirical distribution function and sample expectations 19
Transforming random variables 20
Multivariate Gaussian and other limiting distributions 21
Stochastic processes 23
Markov chains 25
1.5 Data compression and information
theory 28
The importance of the information map 31
Mutual information and Kullback-Leibler (K-L)
divergence 32
1.6 Graphs 34
Special graphs 35
1.7 Convexity 36
1.8 Computational complexity 37
Complexity order classes and big-O notation 38
viii Contents
2 Optimization 41
2.1 Preliminaries 41
Continuous differentiable problems and critical
points 41
Continuous optimization under equality constraints: La-
grange multipliers 42
Inequality constraints: duality and the Karush-Kuhn-Tucker
conditions 44
Convergence and convergence rates for iterative
methods 45
Non-differentiable continuous problems 46
Discrete (combinatorial) optimization problems 47
2.2 Analytical methods for continuous convex problems 48
L2 -norm objective functions 49
Mixed L2 -L1 norm objective functions 50
2.3 Numerical methods for continuous convex problems 51
Iteratively reweighted least squares (IRLS) 51
Gradient descent 53
Adapting the step sizes: line search 54
Newton’s method 56
Other gradient descent methods 58
2.4 Non-differentiable continuous convex problems 59
Linear programming 59
Quadratic programming 60
Subgradient methods 60
Primal-dual interior-point methods 62
Path-following methods 64
2.5 Continuous non-convex problems 65
2.6 Heuristics for discrete (combinatorial) optimization 66
Greedy search 67
(Simple) tabu search 67
Simulated annealing 68
Random restarting 69
3 Random sampling 71
3.1 Generating (uniform) random numbers 71
3.2 Sampling from continuous distributions 72
Quantile function (inverse CDF) and inverse transform
sampling 72
Random variable transformation methods 74
Rejection sampling 74
Adaptive rejection sampling (ARS) for log-concave densities 75
Special methods for particular distributions 78
3.3 Sampling from discrete distributions 79
Inverse transform sampling by sequential search 79
Contents ix
Bibliography 345
Index 353
List of Algorithms
Groups
An algebra is a structure that defines the rules of what happens when
pairs of elements of a set are acted upon by operations. A kind of
algebra known as a group (+, R) is the usual notion of addition with
pairs of real numbers. It is a group because it has an identity, the
number zero (when zero is added to any number it remains unchanged,
i.e.a+0 = 0+a = a, and every element in the set has an inverse (for any
number a, there is an inverse −a which means that a+(−a) = 0. Finally,
the operation is associative, which is to say that when operating on three
or more numbers, addition does not depend on the order in which the
numbers are added (i.e. a + (b + c) = (a + b) + c). Addition also has
the intuitive property that a + b = b + a, i.e. it does not matter if the
numbers are swapped: the operator is called commutative, and the group
is then called an Abelian group. Mirroring addition is multiplication
acting on the set of real numbers with zero removed (×, R − {0}), which
is also an Abelian group. The identity element is 1, and the inverses
Machine Learning for Signal Processing: Data Science, Algorithms, and Computational Statistics. Max A. Little.
c Max A. Little 2019. Published in 2019 by Oxford University Press. DOI: 10.1093/oso/9780198714934.001.0001
2 Mathematical foundations
Rings
Whilst groups deal with one operation on a set of numbers, rings are a
slightly more complex structure that often arises when two operations
are applied to the same set. The most immediately tangible example
is the operations of addition and multiplication on the set of integers Z
(the positive and negative whole numbers with zero). Using the defini-
tion above, the set of integers under addition form an Abelian group,
whereas under multiplication the integers form a simple structure known
as a monoid – a group without inverses. Multiplication with the integers
is associative, and there is an identity (the positive number one), but the
multiplicative inverses are not integers (they are fractions such as 1/2,
−1/5 etc.) Finally, in combination, integer multiplication distributes
over integer addition: a × (b + c) = a × b + a × c = (b + c) × a. These
properties define a ring: it has one operation that together with the set
forms an Abelian group, and another operation that, with the set, forms
a monoid, and the second operation distributes over the first. As with
integers, the set of real numbers under the usual addition and multipli-
Fig. 1.3: The table for the symme-
cation also has the structure of a ring. Another very important example
try group V4 = (◦, {e, h, v, r}) (top),
is the set of square matrices all of size N × N with real elements un- and the group M8 = (×8 , {1, 3, 5, 7})
der normal matrix addition and multiplication. Here the multiplicative (bottom), showing the isomorphism
identity element is the identity matrix of size N × N , and the additive between them obtained by mapping
e 7→ 1, h 7→ 3, v 7→ 5 and r 7→ 7.
identity element is the same size square matrix with all zero elements.
Rings are powerful structures that can lead to very substantial com-
putational savings for many statistical machine learning and signal pro-
cessing problems. For example, if we remove the condition that the
additive operation must have inverses, then we have a pair of monoids
that are distributive. This structure is known as a semiring or semifield
and it turns out that the existence of this structure in many machine
learning and signal processing problems makes these otherwise computa-
tionally intractable problems feasible. For example, the classical Viterbi
algorithm for determining the most likely sequence of hidden states in a
Hidden Markov Model (HMM) is an application of the max-sum semifield
on the dynamic Bayesian network that defines the stochastic dependen-
cies in the model.
Both Dummit and Foote (2004) and Rotman (2000) contain detailed
introductions to abstract algebra including groups and rings.
4 Mathematical foundations
1.2 Metrics
Distance is a fundamental concept in mathematics. Distance functions
play a key role in machine learning and signal processing, particularly
as measures of similarity between objects, for example, digital signals
encoded as items of a set. We will also see that a statistical model often
implies the use of a particular measure of distance, and this measure
determines the properties of statistical inferences that can be made.
A geometry is obtained by attaching a notion of distance to a set: it
becomes a metric space. A metric takes two points in the set and returns
1 a single (usually real) value representing the distance between them. A
metric must have the following properties to satisfy intuitive notions of
distance:
0
(1) Non-negativity: d (x, y) ≥ 0,
-1
(2) Symmetry: d (x, y) = d (y, x) ,
-1 0 1 (3) Coincidence: d (x, x) = 0, and
(4) Triangle inequality: d (x, z) ≤ d (x, y) + d (y, z).
1
Respectively, these requirements are that (1) distance cannot be nega-
tive, (2) the distance going from x to y is the same as that from y to
0 x, (3) only points lying on top of each other have zero distance between
them, and (4) the length of any one side of a triangle defined by three
points cannot be greater than the sum of the length of the other two
-1
sides. For example, the Euclidean metric on a D-dimensional set is:
-1 0 1
v
uD
uX 2
1 d (x, y) = t (xi − yi ) (1.1)
i=1
N
! p1
X p
kukp = |ui | (1.4)
i=1
(1) Non-negativity: u · v ≥ 0,
(2) Symmetry: u · v = v · u, and
1.3 Vector spaces 7
Linear operators
A linear operator or map acts on vectors to create other vectors, and
while doing so, preserve the operations of vector addition and scalar
multiplication. They are homomorphisms between vector spaces. Lin-
ear operators are fundamental to classical digital signal processing and
statistics, and so find heavy use in machine learning. Linear operators
L have the linear combination property:
Fig. 1.6: A ‘flow diagram’ depicting
linear operators. All linear opera-
tors share the property that the op-
L [α1 u1 + α2 u2 + · · · + αN uN ] = (1.9) erator L applied to the scaled sum
α1 L [u1 ] + α2 L [u2 ] + · · · + αN L [uN ] of (two or more) vectors α1 u1 + α2 u2
(top panel), is the same as the scaled
sum of the same operator applied to
What this says is that the operator commutes with scalar multiplica- each of these vectors first (bottom
tion and vector addition: we get the same result if we first scale, then panel). In other words, it does not
add the vectors, and then apply the operator to the result, or, apply the matter whether the operator is ap-
plied before or after the scaled sum.
operator to each vector, then scale them, and them add up the results
(Figure 1.6).
Matrices (which we discuss next), differentiation and integration, and
expectation in probability are all examples of linear operators. The
linearity of integration and differentiation are standard rules which can
be derived from the basic definitions. Linear maps in two-dimensional
space have a nice geometric interpretation: straight lines in the vector
space are mapped onto other straight lines (or onto a point if they are
degenerate maps). This idea extends to higher dimensional vector spaces
in the natural way.
Matrix algebra
When vectors are ‘stacked together’ they form a powerful structure
8 Mathematical foundations
gether in any order. A square matrix A with all zero elements except
for the main diagonal, i.e. aij = 0 unless i = j is called a diagonal ma-
trix. A special diagonal matrix, I, is the identity matrix where the main
diagonal entries aii = 1. Then, if the equality AB = I = BA holds, the The N × N identity matrix is de-
matrix B must be well-defined (and unique) and it is the inverse of A, noted IN or simply I when the
i.e. B = A−1 . We then say that A is invertible; if it is not invertible context is clear and the size can
then it is degenerate or singular. be omitted.
There are many equivalent conditions for matrix invertibility, for ex-
ample, the only solution to the equation Ax = 0 is the vector x = 0
or the columns of A are linearly independent. But one particularly
important way to test the invertibility of a matrix is to calculate the
determinant |A|: if the matrix is singular, the determinant is zero. It
follows that all invertible matrices have |A| =
6 0. The determinant calcu-
lation is quite elaborate for a general square matrix, formulas exist but
geometric intuition helps to understand these calculations: when a linear
map defined by a matrix acts on a geometric object in vector space with
a certain volume, the determinant is the scaling factor of the mapping.
Volumes under the action of the map are scaled by the magnitude of
the determinant. If the determinant is negative, the orientation of any
geometric object is reversed. Therefore, invertible transformations are
those that do not collapse the volume of any object in the vector space
to zero (Figure 1.7).
Another matrix operator which finds significant use is the trace tr (A)
of
PN a square matrix: this is just the sum of the diagonals, e.g. tr (A) =
i=1 aii . The trace
is invariant to addition tr (A + B) = tr (A)+tr (B),
transpose tr AT = tr (A) and multiplication tr (AB) = tr (BA). With
products of three or more matrices the trace is invariant to cyclic per-
mutations, with three matrices: tr (ABC) = tr (CAB) = tr (BCA).
Av = λv (1.12)
Any non-zero N × 1 vector v which solves this equation is known as
an eigenvector of A, and the scalar value λ is known as the associated
eigenvalue. Eigenvectors are not unique: they can be multiplied by any Fig. 1.8: An example of diagonal-
non-zero scalar and still remain eigenvectors with the same eigenvalues. izing a matrix. The diagonalizable
Thus, often the unit length eigenvectors are sought as the solutions to square matrix A has diagonal matrix
(1.12). D containing the eigenvalues, and
transformation matrix P containing
It should be noted that (1.12) arises for vector spaces in general e.g. the eigenbasis, so A = PDP−1 . A
linear operators. An important example occurs in the vector space of maps the rotated square (top), to
functions f (x) with differential the operator L = dx d
. Here, the cor- the rectangle in the same orientation
(at left). This is equivalent to first
responding eigenvalue problem is the differential equation L [f (x)] =
‘unrotating’ the square (the effect
λf (x), for which the solution is f (x) = aeλx for any (non-zero) scalar of P−1 ) such that it is aligned with
value a. This is known as an eigenfunction of the differential operator the co-ordinate axes, then stretch-
ing/compressing the square along
each axis (the effect of D), and finally
rotating back to the original orienta-
tion (the effect of P).
10 Mathematical foundations
L.
If they exist, the eigenvectors and eigenvalues of a square matrix A can
be found by obtaining all scalar values λ such that|(A − λI)| = 0. This
holds because Av − λv = 0 if and only if |(A − λI)| = 0. Expanding out
this determinant equation leads to an N -th order polynomial equation
in λ, namely aN λN + aN −1 λN −1 + · · · + a0 = 0, and the roots of this
equation are the eigenvalues.
This polynomial is known as the characteristic polynomial for A and
determines the existence of a set of eigenvectors that is also a basis for
the space, in the following way. The fundamental theorem of algebra
states that this polynomial has exactly N roots, but some may be re-
peated (i.e. occur more than once). If there are no repeated roots of the
characteristic polynomial, then the eigenvalues are all distinct, and so
there are N eigenvectors which are all linearly independent. This means
that they form a basis for the vector space, which is the eigenbasis for
the matrix.
Not all matrices have an eigenbasis. However matrices that do are
also diagonalizable, that is, they have the same geometric effect as a
diagonal matrix, but in a different basis other than the standard one.
This basis can be found by solving the eigenvalue problem. Placing all
the eigenvectors into the columns of a matrix P and all the corresponding
eigenvalues into a diagonal matrix D, then the matrix can be rewritten:
A = PDP−1 (1.13)
See Figure 1.8. A diagonal matrix simply scales all the coordinates of
the space by a different, fixed amount. They are very simple to deal with,
and have important applications in signal processing and machine learn-
ing. For example, the Gaussian distribution over multiple variables, one
of the most important distributions in practical applications, encodes
the probabilistic relationship between each variable in the problem with
the covariance matrix. By diagonalizing this matrix, one can find a lin-
ear mapping which makes all the variables statistically independent of
each other: this dramatically simplifies many subsequent calculations.
Despite the central importance of the eigenvectors and eigenvalues a
linear problem, it is generally not possible to find all the eigenvalues by
analytical calculation. Therefore one generally turns to iterative numer-
ical algorithms to obtain an answer to a certain precision.
Special matrices
Beyond what has been already discussed, there is not that much more to
be said about general matrices which have N × M degrees of freedom.
Special matrices with fewer degrees of freedom have very interesting
properties and occur frequently in practice.
Some of the most interesting special matrices are symmetric matrices
with real entries – self-transpose and so square by definition, i.e. AT =
A. These matrices are always diagonalizable, and have an orthogonal
eigenbasis. The eigenvalues are always real. If the inverse exists, it is
1.3 Vector spaces 11
where f (u, v) is the joint PDF . The sample space is now the plane
´ ,´ and so in order to satisfy the unit measure axiom, it must2 be that
2
R
R R
f (u, v) du dv = 1. The probability of any´region A of R is then
the multiple integral over that region: P (A) = A f (u, v) du dv.
PThe corresponding discrete case has that P (A × B) =
(a, for any product of events where Ω ,
P
a∈X(A) b∈Y (B) f b) A ∈ X
B ∈ ΩY , and ΩX ,ΩY are the sample spaces of X and Y respectively,
and f (a, b) is the joint PMF . P The joint PMF
P must sum to one over the
whole product sample space: a∈X(ΩX ) b∈Y (ΩY ) f (a, b) = 1.
More general joint events over N variables are defined similarly and
associated with multiple CDFs, PDFs and PMFs, e.g.
fX1 X2 ...XN (x1 , x2 . . . xN ) and, when the context is clear from the ar-
guments of the function, we drop the subscript in the name of the
function for notational simplicity. This naturally allows us to define
distribution functions over vectors of random variables, e.g. f (x) for
T
X = (X1 , X2 . . . XN ) where typically, each element of the vector comes
from the same sample space.
Given the joint PMF/PDF, we can always ‘remove’ one or more of
the variables in the joint set by integrating out this variable, e.g.:
ˆ
f (x1 , x3 , . . . , xN ) = f (x1 , x2 , x3 , . . . , xN ) dx2 (1.23)
R
This computation is known as marginalization.
When considering joint events, we can perform calculations about the
conditional probability of one event occurring, when another has already
occurred (or is otherwise fixed). This conditional probability is written
using the bar notation P ( X = x| Y = y): described as the ‘probability
that the random variable X = x, given that Y = y’. For PMFs and
PDFs we will shorten this to f ( x| y). This probability can be calculated
from the joint and single distributions of the conditioning variable:
f (x, y)
f ( x| y) = (1.24)
f (y)
In effect, the conditional PMF/PDF is what we obtain from restricting
the joint sample space to the set for which Y = y, and calculating the
measure of the intersection of the joint sample space for any chosen x.
The division by f (y)ensures that the conditional distribution is itself a
normalized measure on this restricted sample space, as we can show by
16 Mathematical foundations
marginalizing out X from the right hand side of the above equation.
If the distribution of X does not depend upon Y , we say that X
is independent of Y . In this case f ( x| y) = f (x). This implies that
f (x, y) = f (x) f (y), i.e. the joint distribution over X, Y factorizes into
a product of the marginal distributions over X, Y . Independence is a
central topic in statistical DSP and machine learning because whenever
two or more variables are independent, this can lead to very significant
simplifications that in some cases, make the difference between whether a
problem is tractable at all. In fact, it is widely recognized these days that
the main goal of statistical machine learning is to find good factorizations
of the joint distribution over all the random variables of a problem.
Bayes’ rule
If we have the distribution function of a random variable conditioned on
another, is it possible to swap the role of conditioned and conditioning
variables? The answer is yes: provided that we have all the marginal
distributions. This leads us into the territory of Bayesian reasoning.
The calculus is straightforward, but the consequences are of profound
importance to statistical DSP and machine learning. We will illustrate
the concepts using continuous random variables, but the principles are
general and apply to random variables over any sample space. Suppose
we have two random variables X, Y and we know the conditional dis-
tribution of X given Y , then the conditional distribution of Y on X
is:
f ( x| y) f (y)
f ( y| x) = (1.25)
f (x)
This is known as Bayes’ rule. In the Bayesian formalism, f ( x| y) is
known as the likelihood, f (y) is known as the prior, f (x) is the evidence
and f ( y| x) is the posterior.
Often, we do not know the distribution over X; but since the nu-
merator in Bayes’ rule is the joint probability of X and Y , this can be
obtained by marginalizing out Y from the numerator:
f ( x| y) f (y)
f ( y| x) = ´ (1.26)
R
f ( x| y) f (y) dy
This form of Bayes’ rule is ubiquitous because it allows calculation of
the posterior knowing only the likelihood and the prior.
Unfortunately, one of the hardest and most computationally intractable
problems in applying the Bayesian formalism arises when attempting
to evaluate integrals over many variables to calculate the posterior in
(1.26). Fortunately however, there are common situations in which it is
not necessary to know the evidence probability. A third restatement of
Bayes’ rule makes it clear that the evidence probability can be consid-
ered a ‘normalizer’ for the posterior, ensuring that the posterior satisfies
the unit measure property:
1.4 Probability and stochastic processes 17
f ( y| x) ∝ f ( x| y) f (y) (1.27)
This form is very commonly encountered in many statistical inference
problems in machine learning. For example, when we wish to know
the value of a parameter or random variable given some data which
maximizes the posterior, and the evidence probability is independent of
this variable or parameter, then we can exclude the evidence probability
from the calculations.
The real variable s becomes the new independent variable replacing the
discrete variable x. When the sum (1.35) converges absolutely, then the
MGF exists and can be used to find all the moments for the distribution
of X:
dk M
E Xk = (0) (1.36)
dtk
This can be shown to follow from the series expansion of the exponential
function. Using the Bernoulli example above, the MGF is M (s) =
1 − p + p exp (s). Often, the distribution of a random variable has a
simple form under the MGF that makes the task of manipulating random
variables relatively easy. For example, given a linear combination of
independent random variables:
N
X
XN = an Xn (1.37)
n=1
it is not a trivial matter to calculate the distribution of XN . However,
the MGF of the sum is just:
N
Y
MXN (s) = MXn (an s) (1.38)
n=1
1.4 Probability and stochastic processes 19
We can use this to show that for a linear combination (1.37) of indepen-
dent Gaussian random variables with mean µn and variance σn2 , the CF
of the sum is:
N N
!
X 1 2X 2 2
ψXN (s) = exp is an µn − s a σ (1.43)
n=1
2 n=1 n n
PN
which can be recognized as another Gaussian with mean n=1 an µn
PN
and variance n=1 a2n σn2 . This shows that the Gaussian is invariant to
linear transformations, a property known as (statistical) stability, which
is of fundamental importance in classical statistical DSP.
0.5
shown that, in a specific, probabilistic sense, this estimator converges
0
on the true CDF given an infinite amount of data. By differentiating
−4 −3 −2
x
−1 0
(1.44), the associated PMF (PDF) is a sum of Kronecker (Dirac) delta
∞ functions:
N
fN (x)
1 X
fN (x) = δ [xn − x] (1.45)
N n=1
0
−4 −3 −2 −1 0
x See Figure 1.10. The simplicity of this estimator makes it very useful
in practice. For example, the expectation with respect to the function
g (X) of the ECDF for a continuous random variable is:
Fig. 1.10: Empirical cumulative dis-
tribution function (ECDF) and den- ˆ
sity functions (EPDF) based on a
sample of size N = 20 from a Gaus- E [g (X)] = g (x) fN (x) dx (1.46)
sian random variable with µ = −2 Ω
ˆ N
and σ = 1.0. For the estimated ECDF 1 X
(top), the black ‘steps’ occur at the = g (x) δ [xn − x] dx
value of each sample, whereas the Ω N n=1
blue curve is the theoretical CDF for N ˆ N
the Gaussian. The EPDF (bottom) 1 X 1 X
consists of an infinitely tall Dirac
= g (x) δ [xn − x] dx = g (xn )
N n=1 Ω N n=1
‘spike’ occurring at each sample.
´
using the sift property of the delta function, f (x) δ [x − a] dx = f (a).
Therefore, the expectation of a random variable can be estimated from
the average of the expectation function applied to the data. These esti-
mates are known as sample expectations,
PN the most well-known of which
is the sample mean µ = E [X] = N1 n=1 xn .
x2
is the central limit theorem: for an infinite sequence of random variables
−1
with finite mean and variance which are all independent of each other,
the distribution of the sum of this sequence of random variables will tend −2
to the Gaussian. A simple proof using CFs exists. In many contexts this −3
−2 0 2
theorem is given as a justification for choosing the Gaussian distribution x1
as a model for some given data.
These desirable properties of a single Gaussian random variable carry 0.6
over to vectors with D elements; over which the multivariate Gaussian
0.5
distribution is:
0.4
f (x)
0.3
1 1
T −1
f (x; µ, Σ) = q exp − (x − µ) Σ (x − µ) (1.47) 0.2
D 2
(2π) |Σ| 0.1
T T −2 0 2
where x = (x1 , x2 , . . . , xD ) , the mean vector µ = (µ1 , µ2 , . . . , µD ) and x
Σ is the covariance matrix. Equal probability density contours of the
multivariate Gaussian are, in general, (hyper)-ellipses in D dimensions. Fig. 1.11: An example of the multi-
The maximum probability density occurs at x = µ. The positive-definite variate (D = 2) Gaussian with PDF
f (x1 , x2 )T ; µ, Σ (top). Height is
covariance matrix can be decomposed into a rotation and expansion or
probability density value, and the
contraction in each axis, around the point µ. Another very special and other two axes are x1 , x2 . The con-
important property of the multivariate Gaussian is that all marginals tours of constant probability are el-
are also (multivariate) Gaussians. Similarly, this means that condition- lipses (middle), here shown for prob-
ability density values of 0.03 (blue),
ing one multivariate Gaussians on another, gives another multivariate 0.05 (cyan) and 0.08 (black). The
Gaussian. All these properties are depicted in Figure 1.11 and described maximum probability density coin-
algebraically below. cides with the mean (here µ =
It is simple to extend the statistical stability property of the univariate (−1, 1)T ). The marginals are all
single-variable Gaussians, an exam-
Gaussian to multiple dimensions to show that this property also applies ple PDF of one of the marginals with
to the multivariate normal. Firstly, we need the notion of the CF for mean µ = −1 is shown at bottom.
joint random variables X ∈ RD , which is ψX (s) = E exp isT X
QD
exp − 12 s2i = exp − 12 sT s . This is the CF of the standard mul-
i=1
tivariate Gaussian which has zero mean vector and identity covariance,
X ∼ N (0, I). What happens if we apply the (affine) transformation
Y = AX + b with the (full-rank) matrix A ∈ RD×D and b ∈ RD ? The
effect of this transformation on an arbitrary CF is:
= exp is b ψX A s
T T
1
T
ψY (s) = exp is b exp − AT s
T
AT s
2
1
= exp is b − s AA s
T T T
(1.49)
2
which is the CF of another multivariate normal, from which we can
say that Y is multivariate normal with mean µ = b and covariance
Σ = AAT , or Y ∼ N (µ, Σ). It is now straightforward to predict
that application of another affine transformation Z = BY + c leads to
another multivariate Gaussian:
1
T T
ψZ (s) = exp isT c exp i BT s µ − BT s Σ BT s
2
1
= exp is (Bµ + c) − s BΣB s
T T T
(1.50)
2
m = µP − ΣP P Σ−1 P P̄
(xP̄ − µP̄ ) and S = ΣP P − ΣP P̄ Σ−1 Σ . For
P̄ P̄ P̄ P
detailed proofs, see Murphy (2012, section 4.3.4).
Although the Gaussian is special, it is not the only distribution with
the statistical stability property: another important example is the α-
stable distribution (which includes the Gaussian as a special case); in
fact, a generalization of the central limit theorem states that the dis-
tribution of the sum of independent distributions with infinite vari-
ance tends towards the α-stable distribution. Yet another broad class
of distributions that are also invariant to linear transformations are
the elliptical distributions whose densities, when they exist are defined
in terms of q a function of the Mahalanobis distance from the mean,
T
d (x, µ) = (x − µ) Σ−1 (x − µ). These useful distributions also have
elliptically-distributed marginals, and the multivariate Gaussian is a spe-
cial case.
Just as the central limit theorem is often given as a justification for
choosing the Gaussian, the extreme value theorem is often a justification
for choosing one of the extreme value distributions. Consider an infinite
sequence of identically distributed but independent random variables,
the maximum of this sequence is either Frechet, Weibull or Gumbel dis-
tributed, regardless of the distribution of the random variables in the
sequence.
Stochastic processes
Stochastic processes are key objects in statistical signal processing and
machine learning. In essence, a stochastic process is just a collection of
random variables Xt on the same sample space Ω (also known as the
state space), where the index t comes from an arbitrary setT which may
be finite or infinite in size, uncountable or countable. When the index
set is finite such as T = {1, 2 . . . N }, then we may consider the collection
to coincide with a vector of random variables.
In this book the index is nearly always synonymous with (relative)
time, and therefore plays a crucial role since nearly all signals are time-
based. When the index is a real number i.e T = R, the collection is
known as a continuous-time stochastic process. Countable collections are
known as discrete-time stochastic processes. Although continuous-time
processes are important for theoretical purposes, all recorded signals we
capture from the real world are discrete-time and finite. These signals
must be stored digitally so that signal processing computations may be
performed on them. A signal is typically sampled from the real world at
uniform intervals in time, and each sample takes up a finite number of
digital bits.
This latter constraint influences the choice of sample space for each
of the random variables in the process. A finite number of bits can only
encode a finite range of discrete values, so, being faithful to the digital
representation might suggest Ω = {0, 1, 2 . . . K − 1} where K = 2B for
B bits, but this digital representation can also be arranged to encode
real numbers with a finite precision, a common example being the 32
24 Mathematical foundations
bit floating point representation, and this might be more faithful to how
we model the real-world process we are sampling (see Chapter 8 for
details). Therefore, it is often mathematically realistic and convenient
to work with stochastic processes on the sample space of real numbers
Ω = R. Nevertheless, the choice of sample space depends crucially
upon the real-world interpretation of the recorded signal, and we should
not forget that the actual computations may only ever amount to an
approximation of the mathematical model upon which they are based.
If each member of the collection of variables is independent of every
other and each one has the same distribution, then to characterize the
distributional properties of the process, all that matters is the distribu-
tion of every Xt , say f (x) for the PDF over the real state space. Simple
processes with this property are said to be independent and identically
distributed (i.i.d.) and they are of crucial importance in many applica-
tions in this book, because the joint distribution over the entire process
factorizes into a product of the individual distributions which leads to
considerable computational simplifications in statistical inference prob-
lems.
However, much more interesting signals are those for which the stochas-
tic process is time-dependent, where each of the Xt is not, in general, in-
dependent. The distributional properties of any process can be analyzed
by consideration of the joint distributions of all finite-length collections
of the constituent random variables, known as the finite-dimensional
distributions (f.d.d.s) of the process. For the real state space the f.d.d.s
T
are defined by a vector of N time indices t = (t1 , t2 . . . tN ) , where the
T T
vector (Xt1 , Xt2 . . . XtN ) has the PDF ft (x) for x = (x1 , x2 . . . xN ) .
It is these f.d.d.s which encode the dependence structure between the
random variables in the process.
The f.d.d.s encode much more besides dependence structure. Statis-
tical DSP and machine learning make heavy use of this construct. For
example, Gaussian processes which are fundamental to ubiquitous topics
such as Kalman filtering in signal processing and nonparametric Bayes’
inference in machine learning are processes for which the f.d.d.s are all
multivariate Gaussians. Another example is the Dirichlet process used
in nonparametric Bayes’ which has Dirichlet distributions as f.d.d.s. (see
Chapter 10).
Strongly stationary processes are special processes in which the f.d.d.s
T
are invariant to time translation, i.e. (Xt1 , Xt2 . . . XtN ) has the same
T
f.d.d. as (Xt1 +τ , Xt2 +τ . . . XtN +τ ) , for τ > 0. This says that the lo-
cal distributional properties will be the same at all times. This is yet
another mathematical simplification and it is used extensively. A less
restrictive notion of stationarity occurs when the first and second joint
moments are invariant to time delays: cov [Xt , Xs ] = cov [Xt+τ , Xs+τ ]
and E [Xt ] = E [Xs ] for all t, s and τ > 0. This is known as weak
stationarity. Strong stationarity implies weak stationarity, but the im-
plication does not necessarily work the other way except in special cases
(stationary Gaussian processes are one example for which this is true).
The temporal covariance (autocovariance) of a weakly stationary process
1.4 Probability and stochastic processes 25
depends only on time translation τ : cov [Xt , Xt+τ ] = cov [X0 , Xτ ] (see
Chapter 7).
Markov chains
Another broad class of simple non-i.i.d. stochastic processes are those
for which the dependence in time has a finite effect, known as Markov
chains. Given a discrete-time process with index t = Z over a discrete
state space, such a process satisfies the Markov property:
for all t ≥ 1 and all xt . This conditional probability is called the transi-
tion distribution.
What this says is that the probability of the random variable at any
time t depends only upon the previous time index, or stated another
way, the process is independent given knowledge of the random variable
at the previous time index. The Markov property leads to considerable
simplifications which allow broad computational savings for inference in
hidden Markov models, for example. (Note that a chain which depends
Fig. 1.12: Markov (left) and non-
on M ≥ 1 previous time indices is known as a M -th order Markov chain, Markov (right) chains. For a Markov
but in this book we will generally only consider the case where M = 1, chain, the distribution of the random
because any higher-order chain can always be rewritten as a 1st-order variable at time t, Xt , depends only
upon the value of the previous ran-
chain by the choice of an appropriate state space). Markov chains are dom variable Xt−1 ; this is true for
illustrated in Figure 1.12. the chain on the left. The chain on
A further simplification occurs if the temporal dependency is time the right is non-Markov because X3
translation invariant: depends upon X1 , and XT −1 depends
upon an ‘external’ random variable
Y.
f (Xt+1 = xt+1 |Xt = xt ) = f (X2 = x2 |X1 = x1 ) (1.53)
for all t. Such Markov processes are strongly stationary. Now, using the
basic properties of probability we can find the probability distribution
over the state at time t + 1:
X X
f (Xt+1 = xt |Xt = x0 ) f (Xt = x0 ) = f (Xt+1 = xt+1 , Xt = x0 )
x0 x0
= f (Xt+1 = xt+1 ) (1.54)
or in matrix-vector form:
f (X = i |X 0 = j ) f (X 0 = j) = f (X = j |X 0 = i ) f (X 0 = i)
f (X = i, X 0 = j) = f (X = j, X 0 = i) (1.61)
which implies that pij = pji , in other words, these chains have symmetric
transition matrices, P = PT , which are doubly stochastic in that both
the columns and rows are normalized. Interestingly, the distribution q
is also a stationary distribution of the chain because:
X X X
pij qj = pji qi = qi pji = qi (1.62)
j j j
This symmetry also implies that such chains are reversible if in the
stationary state, i.e. pt = q:
i.e. a Gaussian with mean aXt−1 and standard deviation σ. This im-
portant example of a Gaussian Markov chain is known as an AR(1 ) or
28 Mathematical foundations
ˆ
f (Xt+1 = xt+1 ) = f (Xt+1 = xt+1 |Xt = x0 ) f (Xt = x0 ) dx0 (1.66)
0
0 0.5 1 f (X = x |X 0 = x0 ) f (X 0 = x0 ) = f (X = x0 |X 0 = x ) f (X 0 = x)
P (1.68)
Indeed, for the AR(1) case described above, it can be shown that
0.6 the chain is reversible
with Gaussian stationary distribution f (x) =
0.5 N x; 0, σ 2 1 − a2 .
0.4 For further reading, Grimmett and Stirzaker (2001) is a very compre-
H
P (X, Y )
P (X |Y ) = =⇒ H [X |Y ] = H [X, Y ] − H [Y ] (1.70)
P (Y )
This is intuitive: it states that the information in X given Y , is what
we get by subtracting the information in Y from the joint information
contained in X and Y simultaneously. For this reason we will generally
work with the Shannon entropy in this book, so that by referring to
‘entropy’ we mean Shannon entropy.
For random processes, Kolmogorov complexity and entropy are inti-
mately related. Consider a discrete-time, i.i.d. process Xn on the dis-
crete, finite set Ω. Also, denote by K [S|N ] the conditional Kolmogorov
complexity where the length of the sequence N is assumed known to the
program. We can show that as N → ∞ Cover and Thomas (2006, page
154):
M
X M
X
− log2 f (x) ≈ − N pi log2 pi = −N pi log2 pi
i=1 i=1
= N H [X]
which approximately coincides with both the total entropy, and the ex-
pected Kolmogorov complexity, asymptotically. The exact nature of this
approximation is made rigorous through the concepts of typical sets and
the asymptotic equipartition principle (Cover and Thomas, 2006, chap-
ter 3). So, the information map gets us directly from the probabilistic
to the data compression modeling points of view in machine learning.
A key algebraic observation here is that applying the information map
− ln P (X) has the effect of converting multiplicative probability calcu-
lations into additive information calculations. For conditional probabil-
ities:
P (X, Y )
− ln P (X |Y ) = − ln = − ln P (X, Y ) + ln P (Y ) (1.73)
P (Y )
and for independent random variables:
Language: English
BY
GUY BOOTHBY
ARTHUR WESTBROOK
COMPANY
CLEVELAND, OHIO, U. S. A.
CONTENTS.
CHAPTER I.
A Criminal in Disguise
CHAPTER II.
The Den of Iniquity
CHAPTER III.
The Duchess of Wiltshire's Diamonds
CHAPTER IV.
How Simon Carne Won the Derby
CHAPTER V.
A Service to the State
CHAPTER VI.
A Visit in the Night
CHAPTER VII.
The Man of Many Crimes
CHAPTER VIII.
An Imperial Finale
A PRINCE OF SWINDLERS
CHAPTER I.
A CRIMINAL IN DISGUISE.
"It is only reasonable to suppose that by this time you have become
acquainted with the nature of the peculiar services you have
rendered me. I am your debtor for as pleasant, and, at the same
time, as profitable a visit to London as any man could desire. In
order that you may not think me ungrateful, I will ask you to accept
the accompanying narrative of my adventures in your great
metropolis. Since I have placed myself beyond the reach of capture,
I will permit you to make any use of it you please. Doubtless you will
blame me, but you must at least do me the justice to remember
that, in spite of the splendid opportunities you permitted me, I
invariably spared yourself and family. You will think me mad thus to
betray myself, but, believe me, I have taken the greatest precautions
against discovery, and as I am proud of my London exploits, I have
not the least desire to hide my light beneath a bushel.
"With kind regards to Lady Amberley and yourself,
"I am, yours very sincerely,
"SIMON CARNE."
Needless to say I did not retire to rest before I had read the
manuscript through from beginning to end, with the result that the
morning following I communicated with the police. They were
hopeful that they might be able to discover the place where the
packet had been posted, but after considerable search it was found
that it had been handed by a captain of a yacht, name unknown, to
the commander of a homeward bound brig, off Finisterre, for
postage in Plymouth. The narrative, as you will observe, is written in
the third person, and, as far as I can gather, the handwriting is not
that of Simon Carne. As, however, the details of each individual
swindle coincide exactly with the facts as ascertained by the police,
there can be no doubt of their authenticity.
A year has now elapsed since my receipt of the packet. During
that time the police of almost every civilized country have been on
the alert to effect the capture of my whilom friend, but without
success. Whether his yacht sank and conveyed him to the bottom of
the ocean, or whether, as I suspect, she only carried him to a certain
part of the seas where he changed into another vessel and so
eluded justice, I cannot say. Even the Maharajah of Malar-Kadir has
heard nothing of him since. The fact, however, remains, I have,
innocently enough, compounded a series of felonies, and, as I said
at the commencement of this preface, the publication of the
narrative I have so strangely received is intended to be, as far as
possible, my excuse.
CHAPTER II.
THE DEN OF INIQUITY.
The night was close and muggy, such a night, indeed, as only
Calcutta, of all the great cities of the East, can produce. The reek of
the native quarter, that sickly, penetrating odor which once smelt, is
never forgotten, filled the streets and even invaded the sacred
precincts of Government House, where a man of gentlemanly
appearance, but sadly deformed, was engaged in bidding Her
Majesty the Queen of England's representative in India an almost
affectionate farewell.
"You will not forget your promise to acquaint us with your arrival
in London," said His Excellency as he shook his guest by the hand.
"We shall be delighted to see you, and if we can make your stay
pleasurable as well as profitable to you, you may be sure we shall
endeavor to do so."
"Your lordship is most hospitable, and I think I may safely
promise that I will avail myself of your kindness," replied the other.
"In the meantime 'good-bye,' and a pleasant voyage to you."
A few minutes later he had passed the sentry, and was making
his way along the Maidan to the point where the Chitpore Road
crosses it. Here he stopped and appeared to deliberate. He smiled a
little sardonically as the recollection of the evening's entertainment
crossed his mind, and, as if he feared he might forget something
connected with it, when he reached a lamp-post, took a note-book
from his pocket and made an entry in it.
"Providence has really been most kind," he said as he shut the
book with a snap, and returned it to his pocket. "And what is more, I
am prepared to be properly grateful. It was a good morning's work
for me when His Excellency decided to take a ride through the
Maharajah's suburbs. Now I have only to play my cards carefully and
success should be assured."
He took a cigar from his pocket, nipped off the end, and then lit
it. He was still smiling when the smoke had cleared away.
"It is fortunate that Her Excellency is, like myself, an
enthusiastic admirer of Indian art," he said. "It is a trump card, and I
shall play it for all it's worth when I get to the other side. But to-
night I have something of more importance to consider. I have to
find the sinews of war. Let us hope that the luck which has followed
me hitherto will still hold good, and that Liz will prove as tractable as
usual."
Almost as he concluded his soliloquy a ticcagharri made its
appearance, and, without being hailed, pulled up beside him. It was
evident that their meeting was intentional, for the driver asked no
question of his fare, who simply took his seat, laid himself back upon
the cushions, and smoked his cigar with the air of a man playing a
part in some performance that had been long arranged.
Ten minutes later the coachman had turned out of the Chitpore
Road into a narrow by-street. From this he broke off into another,
and at the end of a few minutes into still another. These offshoots of
the main thoroughfare were wrapped in inky darkness, and, in order
that there should be as much danger as possible, they were crowded
to excess. To those who know Calcutta this information will be
significant.
There are slums in all the great cities of the world, and every
one boasts its own peculiar characteristics. The Ratcliffe Highway in
London, and the streets that lead off it, can show a fair assortment
of vice; the Chinese quarters of New York, Chicago, and San
Francisco can more than equal them; Little Bourke Street,
Melbourne, a portion of Singapore, and the shipping quarter of
Bombay, have their own individual qualities, but surely for the lowest
of all the world's low places one must go to Calcutta, the capital of
our great Indian Empire.
Surrounding the Lai, Machua, Burra, and Joira Bazaars are to be
found the most infamous dens that mind of man can conceive. But
that is not all. If an exhibition of scented, high-toned, gold-lacquered
vice is required, one has only to make one's way into the streets that
lie within a stone's throw of the Chitpore Road to be accommodated.
Reaching a certain corner, the gharri came to a standstill and
the fare alighted. He said something in an undertone to the driver as
he paid him, and then stood upon the footway placidly smoking until
the vehicle had disappeared from view. When it was no longer in
sight he looked up at the houses towering above his head; in one a
marriage feast was being celebrated; across the way the sound of a
woman's voice in angry expostulation could be heard. The passers-
by, all of whom were natives, scanned him curiously, but made no
remark. Englishmen, it is true, were sometimes seen in that quarter
and at that hour, but this one seemed of a different class, and it is
possible that nine out of every ten took him for the most detested of
all Englishmen, a police officer.
For upwards of ten minutes he waited, but after that he seemed
to become impatient. The person he had expected to find at the
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
ebookbell.com